ValueError
builtins.ValueError
Stack trace
ValueError: Retrieved document chunks exceed the model's maximum context window size of 8192 tokens.
Why it happens
RAG pipelines concatenate multiple retrieved document chunks as context for the LLM. If the combined token count of these chunks exceeds the model's maximum context window, the LLM cannot process the input, triggering this error. This often happens when chunk size or number is too large or the model's context window is smaller than expected.
Detection
Monitor the total token count of retrieved chunks before passing them to the LLM. Log or assert if the combined tokens exceed the model's max context window to catch this early.
Causes & fixes
Retrieved document chunks are too large individually or too many chunks are retrieved, exceeding the model's context window.
Reduce the chunk size during document splitting or limit the number of retrieved chunks to fit within the model's maximum context window.
Using a model with a smaller context window than expected for your retrieval setup.
Switch to a model with a larger context window (e.g., gpt-4o with 8192 tokens) or adjust retrieval parameters to fit the smaller window.
Not accounting for prompt tokens and other input tokens when calculating total context size.
Calculate total tokens including prompt, retrieved chunks, and any system messages to ensure the sum fits within the model's context window.
Code: broken vs fixed
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI
llm = OpenAI(model_name="gpt-4o", max_tokens=8192)
retriever = ... # returns many large chunks
qa = RetrievalQA.from_chain_type(llm=llm, retriever=retriever)
# This line raises ValueError due to too many tokens in retrieved chunks
result = qa.run("Explain the document contents.") import os
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI
os.environ["OPENAI_API_KEY"] = os.environ.get("OPENAI_API_KEY") # Use env var for API key
llm = OpenAI(model_name="gpt-4o", max_tokens=8192)
retriever = ... # configure retriever with smaller chunk size or limit
# Limit retrieved chunks to fit context window
def limited_retriever(query):
docs = retriever.get_relevant_documents(query)
# Keep only first N chunks or truncate chunks to fit token limit
max_tokens = 7000 # leave room for prompt
total_tokens = 0
limited_docs = []
for doc in docs:
doc_tokens = len(doc.page_content.split()) # approximate token count
if total_tokens + doc_tokens > max_tokens:
break
limited_docs.append(doc)
total_tokens += doc_tokens
return limited_docs
qa = RetrievalQA.from_chain_type(llm=llm, retriever=limited_retriever)
result = qa.run("Explain the document contents.")
print(result) # Works without exceeding context window Workaround
Catch the ValueError and retry retrieval with fewer or smaller chunks, or truncate retrieved documents before passing to the LLM.
Prevention
Design your retrieval pipeline to estimate token counts of chunks and total context size before LLM calls, and use models with context windows that match your retrieval scale.