High severity intermediate · Fix: 5-15 min

ValueError

builtins.ValueError

What this error means
The total size of retrieved document chunks exceeds the LLM's maximum context window, causing a failure to process the input.

Stack trace

traceback
ValueError: Retrieved document chunks exceed the model's maximum context window size of 8192 tokens.
QUICK FIX
Limit the number or size of retrieved chunks so their combined tokens fit within the model's maximum context window.

Why it happens

RAG pipelines concatenate multiple retrieved document chunks as context for the LLM. If the combined token count of these chunks exceeds the model's maximum context window, the LLM cannot process the input, triggering this error. This often happens when chunk size or number is too large or the model's context window is smaller than expected.

Detection

Monitor the total token count of retrieved chunks before passing them to the LLM. Log or assert if the combined tokens exceed the model's max context window to catch this early.

Causes & fixes

1

Retrieved document chunks are too large individually or too many chunks are retrieved, exceeding the model's context window.

✓ Fix

Reduce the chunk size during document splitting or limit the number of retrieved chunks to fit within the model's maximum context window.

2

Using a model with a smaller context window than expected for your retrieval setup.

✓ Fix

Switch to a model with a larger context window (e.g., gpt-4o with 8192 tokens) or adjust retrieval parameters to fit the smaller window.

3

Not accounting for prompt tokens and other input tokens when calculating total context size.

✓ Fix

Calculate total tokens including prompt, retrieved chunks, and any system messages to ensure the sum fits within the model's context window.

Code: broken vs fixed

Broken - triggers the error
python
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI

llm = OpenAI(model_name="gpt-4o", max_tokens=8192)
retriever = ...  # returns many large chunks
qa = RetrievalQA.from_chain_type(llm=llm, retriever=retriever)

# This line raises ValueError due to too many tokens in retrieved chunks
result = qa.run("Explain the document contents.")
Fixed - works correctly
python
import os
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI

os.environ["OPENAI_API_KEY"] = os.environ.get("OPENAI_API_KEY")  # Use env var for API key

llm = OpenAI(model_name="gpt-4o", max_tokens=8192)
retriever = ...  # configure retriever with smaller chunk size or limit

# Limit retrieved chunks to fit context window
def limited_retriever(query):
    docs = retriever.get_relevant_documents(query)
    # Keep only first N chunks or truncate chunks to fit token limit
    max_tokens = 7000  # leave room for prompt
    total_tokens = 0
    limited_docs = []
    for doc in docs:
        doc_tokens = len(doc.page_content.split())  # approximate token count
        if total_tokens + doc_tokens > max_tokens:
            break
        limited_docs.append(doc)
        total_tokens += doc_tokens
    return limited_docs

qa = RetrievalQA.from_chain_type(llm=llm, retriever=limited_retriever)

result = qa.run("Explain the document contents.")
print(result)  # Works without exceeding context window
Added logic to limit the number and size of retrieved chunks so their combined tokens fit within the model's maximum context window, preventing the ValueError.

Workaround

Catch the ValueError and retry retrieval with fewer or smaller chunks, or truncate retrieved documents before passing to the LLM.

Prevention

Design your retrieval pipeline to estimate token counts of chunks and total context size before LLM calls, and use models with context windows that match your retrieval scale.

Python 3.9+ · langchain-core >=0.1.0 · tested on 0.2.x
Verified 2026-04 · gpt-4o, gpt-4o-mini, claude-3-5-haiku-20241022
Verify ↗

Community Notes

No notes yetBe the first to share a version-specific fix or tip.