High severity intermediate · Fix: 5-15 min

ValueError

builtins.ValueError

What this error means

The total size of retrieved document chunks exceeds the LLM's maximum context window, causing a failure to process the input.

Stack trace

traceback

ValueError: Retrieved document chunks exceed the model's maximum context window size of 8192 tokens.

QUICK FIX

Limit the number or size of retrieved chunks so their combined tokens fit within the model's maximum context window.

Why it happens

RAG pipelines concatenate multiple retrieved document chunks as context for the LLM. If the combined token count of these chunks exceeds the model's maximum context window, the LLM cannot process the input, triggering this error. This often happens when chunk size or number is too large or the model's context window is smaller than expected.

Detection

Monitor the total token count of retrieved chunks before passing them to the LLM. Log or assert if the combined tokens exceed the model's max context window to catch this early.

Causes & fixes

Retrieved document chunks are too large individually or too many chunks are retrieved, exceeding the model's context window.

✓ Fix

Reduce the chunk size during document splitting or limit the number of retrieved chunks to fit within the model's maximum context window.

Using a model with a smaller context window than expected for your retrieval setup.

✓ Fix

Switch to a model with a larger context window (e.g., gpt-4o with 8192 tokens) or adjust retrieval parameters to fit the smaller window.

Not accounting for prompt tokens and other input tokens when calculating total context size.

✓ Fix

Calculate total tokens including prompt, retrieved chunks, and any system messages to ensure the sum fits within the model's context window.

Code: broken vs fixed

Broken - triggers the error

python

from langchain.chains import RetrievalQA
from langchain.llms import OpenAI

llm = OpenAI(model_name="gpt-4o", max_tokens=8192)
retriever = ...  # returns many large chunks
qa = RetrievalQA.from_chain_type(llm=llm, retriever=retriever)

# This line raises ValueError due to too many tokens in retrieved chunks
result = qa.run("Explain the document contents.")

Fixed - works correctly

python

import os
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI

os.environ["OPENAI_API_KEY"] = os.environ.get("OPENAI_API_KEY")  # Use env var for API key

llm = OpenAI(model_name="gpt-4o", max_tokens=8192)
retriever = ...  # configure retriever with smaller chunk size or limit

# Limit retrieved chunks to fit context window
def limited_retriever(query):
    docs = retriever.get_relevant_documents(query)
    # Keep only first N chunks or truncate chunks to fit token limit
    max_tokens = 7000  # leave room for prompt
    total_tokens = 0
    limited_docs = []
    for doc in docs:
        doc_tokens = len(doc.page_content.split())  # approximate token count
        if total_tokens + doc_tokens > max_tokens:
            break
        limited_docs.append(doc)
        total_tokens += doc_tokens
    return limited_docs

qa = RetrievalQA.from_chain_type(llm=llm, retriever=limited_retriever)

result = qa.run("Explain the document contents.")
print(result)  # Works without exceeding context window

Added logic to limit the number and size of retrieved chunks so their combined tokens fit within the model's maximum context window, preventing the ValueError.

⚠

Workaround

Catch the ValueError and retry retrieval with fewer or smaller chunks, or truncate retrieved documents before passing to the LLM.

✓

Prevention

Design your retrieval pipeline to estimate token counts of chunks and total context size before LLM calls, and use models with context windows that match your retrieval scale.

Python 3.9+ · langchain-core >=0.1.0 · tested on 0.2.x

Verified 2026-04 · gpt-4o, gpt-4o-mini, claude-3-5-haiku-20241022

Verify ↗

Community Notes

No notes yetBe the first to share a version-specific fix or tip.