How to handle large documents in RAG
RAG, split the document into smaller chunks before embedding and indexing. Use vector databases to efficiently retrieve relevant chunks during query time, ensuring the LLM processes manageable context sizes.config_error Why this happens
Large documents exceed the token limits of LLMs and embedding models, causing errors or degraded performance. For example, passing a full book or long report directly to an embedding model or LLM results in truncation or API errors. This happens because models have maximum token limits (e.g., 8k or 32k tokens) and embedding vectors represent fixed-size text segments.
Typical broken code tries to embed or query the entire document at once:
from openai import OpenAI
import os
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
large_text = open("large_document.txt").read()
# Incorrect: embedding entire large document at once
response = client.embeddings.create(
model="text-embedding-3-large",
input=large_text
)
embedding_vector = response.data[0].embedding openai.error.InvalidRequestError: This model's maximum context length is 8192 tokens, but you requested 12000 tokens.
The fix
Split the large document into smaller, semantically meaningful chunks (e.g., paragraphs or 500-token segments). Embed each chunk separately and store these embeddings in a vector database like FAISS or Chroma. At query time, embed the query and retrieve the most relevant chunks to pass to the LLM for generation.
This approach respects token limits and improves retrieval relevance by focusing on smaller, contextually coherent pieces.
from openai import OpenAI
import os
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Load large document
large_text = open("large_document.txt").read()
# Split into chunks of ~500 tokens
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = text_splitter.split_text(large_text)
# Embed chunks
embeddings = OpenAIEmbeddings(client=client, model_name="text-embedding-3-large")
chunk_embeddings = [embeddings.embed_query(chunk) for chunk in chunks]
# Store in FAISS vector store
vector_store = FAISS.from_texts(chunks, embeddings)
# Query example
query = "Explain the main topic of the document"
query_embedding = embeddings.embed_query(query)
relevant_docs = vector_store.similarity_search_by_vector(query_embedding, k=3)
# Pass relevant docs to LLM
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": f"Based on these excerpts: {relevant_docs}, answer: {query}"}
]
)
print(response.choices[0].message.content) The main topic of the document is ... (LLM-generated summary based on retrieved chunks)
Preventing it in production
Implement automatic chunking and embedding pipelines for all large documents before indexing. Use vector databases with efficient similarity search to scale retrieval. Add validation to check chunk sizes against model token limits. Incorporate retry logic for API rate limits and fallback to smaller chunk sizes if needed.
Monitor retrieval quality and update chunking strategies (e.g., semantic vs. fixed-size) to optimize relevance. Cache embeddings and query results to reduce latency and cost.
Key Takeaways
- Always chunk large documents before embedding to respect token limits.
- Use vector databases like FAISS or Chroma for efficient similarity search.
- Retrieve relevant chunks dynamically to keep LLM context manageable.
- Validate chunk sizes and implement retries to handle API limits.
- Optimize chunking strategy based on document structure for best retrieval.