Debug Fix intermediate · 3 min read

How to handle documents longer than context window

Quick answer

Documents longer than a model's context window must be split into smaller chunks or processed with techniques like retrieval-augmented generation (RAG) or hierarchical summarization. These methods ensure the model only sees manageable input sizes within its token limit.

ERROR TYPE model_behavior

⚡ QUICK FIX

Split the document into smaller chunks fitting within the context window and process them sequentially or use a retrieval system to fetch relevant parts dynamically.

Why this happens

Large language models have a fixed context window size, typically ranging from 4,096 to 32,768 tokens depending on the model. When you input a document longer than this limit, the model either truncates the input or returns an error. For example, sending a 10,000-token document to gpt-4o with a 8,192-token limit will cause truncation or failure.

Typical error output might be a ContextLengthExceeded or silent truncation leading to incomplete or incorrect responses.

Example of problematic code:

python

from openai import OpenAI
import os

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

long_text = """Very long document text exceeding the model's context window..."""

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": long_text}]
)
print(response.choices[0].message.content)

output

Error: ContextLengthExceeded: input tokens exceed model's max context window

The fix

Split the document into chunks smaller than the model's context window (e.g., 2,000 tokens) and process each chunk separately. You can then aggregate or summarize the outputs.

Alternatively, use retrieval-augmented generation (RAG) where you index the document chunks and retrieve only relevant parts for the query, keeping input size manageable.

This approach works because the model never receives input exceeding its token limit, avoiding truncation or errors.

python

from openai import OpenAI
import os

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

def chunk_text(text, max_tokens=2000):
    # Simple whitespace chunking example (use tokenizer for accuracy)
    words = text.split()
    chunks = []
    current_chunk = []
    current_len = 0
    for word in words:
        current_len += 1
        current_chunk.append(word)
        if current_len >= max_tokens:
            chunks.append(" ".join(current_chunk))
            current_chunk = []
            current_len = 0
    if current_chunk:
        chunks.append(" ".join(current_chunk))
    return chunks

long_text = """Very long document text exceeding the model's context window..."""
chunks = chunk_text(long_text)

responses = []
for chunk in chunks:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": chunk}]
    )
    responses.append(response.choices[0].message.content)

# Aggregate or summarize responses
final_summary = "\n".join(responses)
print(final_summary)

output

Summary or processed output for each chunk combined
...

Preventing it in production

Implement input validation to check token length before sending to the model.
Use chunking libraries or tokenizers (like tiktoken) to split inputs precisely.
Integrate a vector database (e.g., Pinecone, FAISS) for RAG workflows to retrieve relevant document parts dynamically.
Apply hierarchical summarization: summarize chunks first, then summarize summaries to reduce input size.
Use retries with fallback to summarization if input is too large.

Related errors

Error	Cause	Quick fix
ContextLengthExceeded	Input tokens exceed model's max context window	Split input into smaller chunks
Truncated output	Input too long, model truncates silently	Use chunking or summarization
RateLimitError	Too many requests in short time	Add exponential backoff retry logic

✅

Key Takeaways

Always split documents into chunks smaller than the model's context window before inference.
Use retrieval-augmented generation to dynamically fetch relevant document parts for queries.
Validate input token length and apply summarization to handle very long documents efficiently.

Verified 2026-04 · gpt-4o, gpt-4o-mini

Verify ↗