How to handle documents longer than context window
model_behavior Why this happens
Large language models have a fixed context window size, typically ranging from 4,096 to 32,768 tokens depending on the model. When you input a document longer than this limit, the model either truncates the input or returns an error. For example, sending a 10,000-token document to gpt-4o with a 8,192-token limit will cause truncation or failure.
Typical error output might be a ContextLengthExceeded or silent truncation leading to incomplete or incorrect responses.
Example of problematic code:
from openai import OpenAI
import os
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
long_text = """Very long document text exceeding the model's context window..."""
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": long_text}]
)
print(response.choices[0].message.content) Error: ContextLengthExceeded: input tokens exceed model's max context window
The fix
Split the document into chunks smaller than the model's context window (e.g., 2,000 tokens) and process each chunk separately. You can then aggregate or summarize the outputs.
Alternatively, use retrieval-augmented generation (RAG) where you index the document chunks and retrieve only relevant parts for the query, keeping input size manageable.
This approach works because the model never receives input exceeding its token limit, avoiding truncation or errors.
from openai import OpenAI
import os
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
def chunk_text(text, max_tokens=2000):
# Simple whitespace chunking example (use tokenizer for accuracy)
words = text.split()
chunks = []
current_chunk = []
current_len = 0
for word in words:
current_len += 1
current_chunk.append(word)
if current_len >= max_tokens:
chunks.append(" ".join(current_chunk))
current_chunk = []
current_len = 0
if current_chunk:
chunks.append(" ".join(current_chunk))
return chunks
long_text = """Very long document text exceeding the model's context window..."""
chunks = chunk_text(long_text)
responses = []
for chunk in chunks:
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": chunk}]
)
responses.append(response.choices[0].message.content)
# Aggregate or summarize responses
final_summary = "\n".join(responses)
print(final_summary) Summary or processed output for each chunk combined ...
Preventing it in production
- Implement input validation to check token length before sending to the model.
- Use chunking libraries or tokenizers (like
tiktoken) to split inputs precisely. - Integrate a vector database (e.g., Pinecone, FAISS) for RAG workflows to retrieve relevant document parts dynamically.
- Apply hierarchical summarization: summarize chunks first, then summarize summaries to reduce input size.
- Use retries with fallback to summarization if input is too large.
Key Takeaways
- Always split documents into chunks smaller than the model's context window before inference.
- Use retrieval-augmented generation to dynamically fetch relevant document parts for queries.
- Validate input token length and apply summarization to handle very long documents efficiently.