Debug Fix intermediate · 4 min read

How to handle multi-document RAG

Quick answer

To handle multi-document Retrieval-Augmented Generation (RAG), split documents into manageable chunks, embed them using OpenAI or similar embeddings, then retrieve relevant chunks for each query. Pass these retrieved chunks as context in your chat.completions.create calls to the LLM to generate accurate, context-aware responses.

ERROR TYPE code_error

⚡ QUICK FIX

Implement document chunking and retrieval before calling chat.completions.create to ensure relevant context is included for multi-document RAG.

Why this happens

Multi-document RAG often fails when raw documents are passed directly to the LLM without chunking or retrieval, causing context length overflow or irrelevant context. For example, sending entire documents in a single messages array leads to truncated inputs or noisy responses.

Typical broken code:

python

from openai import OpenAI
import os

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Passing full documents directly
documents = ["Long document text 1...", "Long document text 2..."]

messages = [{"role": "user", "content": f"Answer based on these docs:\n{documents}"}]

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=messages
)
print(response.choices[0].message.content)

output

Error or irrelevant output due to input length or noisy context

The fix

Split documents into smaller chunks, embed them with an embedding model like text-embedding-3-small, and store vectors in a vector store (e.g., FAISS). At query time, embed the query, retrieve top relevant chunks, and pass them as context to the LLM. This ensures focused, relevant input within token limits.

python

from openai import OpenAI
import os

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Example document chunking
documents = ["Long document text 1...", "Long document text 2..."]
chunks = []
chunk_size = 500  # characters or tokens
for doc in documents:
    for i in range(0, len(doc), chunk_size):
        chunks.append(doc[i:i+chunk_size])

# Embed chunks
embeddings_response = client.embeddings.create(
    model="text-embedding-3-small",
    input=chunks
)
chunk_vectors = [data.embedding for data in embeddings_response.data]

# Simulate vector store retrieval (e.g., FAISS) - here we pick top 2 chunks
# In production, use FAISS or similar for similarity search
relevant_chunks = chunks[:2]

# Prepare prompt with retrieved context
context = "\n---\n".join(relevant_chunks)
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": f"Answer the question using the following context:\n{context}\nQuestion: What is the main topic?"}
]

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=messages
)
print(response.choices[0].message.content)

output

The main topic is ... (accurate answer based on retrieved chunks)

Preventing it in production

Use robust chunking strategies (token-based), persistent vector stores like FAISS or Pinecone, and embed queries and documents consistently. Implement caching and fallback logic if retrieval fails. Monitor token usage to avoid truncation. Automate retries for API errors and validate retrieved context relevance before LLM calls.

Related errors

Error	Cause	Quick fix
Context length exceeded	Passing full documents without chunking	Chunk documents and limit context size
Irrelevant answers	No retrieval or poor chunk selection	Use vector search to retrieve relevant chunks
Empty or truncated response	Input tokens exceed model limit	Validate and truncate input context properly

✅

Key Takeaways

Always chunk large documents before embedding for RAG workflows.
Use vector similarity search to retrieve relevant document chunks per query.
Pass only retrieved chunks as context to the LLM to stay within token limits.

Verified 2026-04 · gpt-4o-mini, text-embedding-3-small

Verify ↗