Debug Fix intermediate · 4 min read

How to handle multi-document RAG

Quick answer
To handle multi-document Retrieval-Augmented Generation (RAG), split documents into manageable chunks, embed them using OpenAI or similar embeddings, then retrieve relevant chunks for each query. Pass these retrieved chunks as context in your chat.completions.create calls to the LLM to generate accurate, context-aware responses.
ERROR TYPE code_error
⚡ QUICK FIX
Implement document chunking and retrieval before calling chat.completions.create to ensure relevant context is included for multi-document RAG.

Why this happens

Multi-document RAG often fails when raw documents are passed directly to the LLM without chunking or retrieval, causing context length overflow or irrelevant context. For example, sending entire documents in a single messages array leads to truncated inputs or noisy responses.

Typical broken code:

python
from openai import OpenAI
import os

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Passing full documents directly
documents = ["Long document text 1...", "Long document text 2..."]

messages = [{"role": "user", "content": f"Answer based on these docs:\n{documents}"}]

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=messages
)
print(response.choices[0].message.content)
output
Error or irrelevant output due to input length or noisy context

The fix

Split documents into smaller chunks, embed them with an embedding model like text-embedding-3-small, and store vectors in a vector store (e.g., FAISS). At query time, embed the query, retrieve top relevant chunks, and pass them as context to the LLM. This ensures focused, relevant input within token limits.

python
from openai import OpenAI
import os

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Example document chunking
documents = ["Long document text 1...", "Long document text 2..."]
chunks = []
chunk_size = 500  # characters or tokens
for doc in documents:
    for i in range(0, len(doc), chunk_size):
        chunks.append(doc[i:i+chunk_size])

# Embed chunks
embeddings_response = client.embeddings.create(
    model="text-embedding-3-small",
    input=chunks
)
chunk_vectors = [data.embedding for data in embeddings_response.data]

# Simulate vector store retrieval (e.g., FAISS) - here we pick top 2 chunks
# In production, use FAISS or similar for similarity search
relevant_chunks = chunks[:2]

# Prepare prompt with retrieved context
context = "\n---\n".join(relevant_chunks)
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": f"Answer the question using the following context:\n{context}\nQuestion: What is the main topic?"}
]

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=messages
)
print(response.choices[0].message.content)
output
The main topic is ... (accurate answer based on retrieved chunks)

Preventing it in production

Use robust chunking strategies (token-based), persistent vector stores like FAISS or Pinecone, and embed queries and documents consistently. Implement caching and fallback logic if retrieval fails. Monitor token usage to avoid truncation. Automate retries for API errors and validate retrieved context relevance before LLM calls.

Key Takeaways

  • Always chunk large documents before embedding for RAG workflows.
  • Use vector similarity search to retrieve relevant document chunks per query.
  • Pass only retrieved chunks as context to the LLM to stay within token limits.
Verified 2026-04 · gpt-4o-mini, text-embedding-3-small
Verify ↗