Debug Fix intermediate · 3 min read

How to handle documents longer than context window

Quick answer
Documents longer than a model's context window must be split into smaller chunks or processed with techniques like retrieval-augmented generation (RAG) or hierarchical summarization. These methods ensure the model only sees manageable input sizes within its token limit.
ERROR TYPE model_behavior
⚡ QUICK FIX
Split the document into smaller chunks fitting within the context window and process them sequentially or use a retrieval system to fetch relevant parts dynamically.

Why this happens

Large language models have a fixed context window size, typically ranging from 4,096 to 32,768 tokens depending on the model. When you input a document longer than this limit, the model either truncates the input or returns an error. For example, sending a 10,000-token document to gpt-4o with a 8,192-token limit will cause truncation or failure.

Typical error output might be a ContextLengthExceeded or silent truncation leading to incomplete or incorrect responses.

Example of problematic code:

python
from openai import OpenAI
import os

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

long_text = """Very long document text exceeding the model's context window..."""

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": long_text}]
)
print(response.choices[0].message.content)
output
Error: ContextLengthExceeded: input tokens exceed model's max context window

The fix

Split the document into chunks smaller than the model's context window (e.g., 2,000 tokens) and process each chunk separately. You can then aggregate or summarize the outputs.

Alternatively, use retrieval-augmented generation (RAG) where you index the document chunks and retrieve only relevant parts for the query, keeping input size manageable.

This approach works because the model never receives input exceeding its token limit, avoiding truncation or errors.

python
from openai import OpenAI
import os

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

def chunk_text(text, max_tokens=2000):
    # Simple whitespace chunking example (use tokenizer for accuracy)
    words = text.split()
    chunks = []
    current_chunk = []
    current_len = 0
    for word in words:
        current_len += 1
        current_chunk.append(word)
        if current_len >= max_tokens:
            chunks.append(" ".join(current_chunk))
            current_chunk = []
            current_len = 0
    if current_chunk:
        chunks.append(" ".join(current_chunk))
    return chunks

long_text = """Very long document text exceeding the model's context window..."""
chunks = chunk_text(long_text)

responses = []
for chunk in chunks:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": chunk}]
    )
    responses.append(response.choices[0].message.content)

# Aggregate or summarize responses
final_summary = "\n".join(responses)
print(final_summary)
output
Summary or processed output for each chunk combined
...

Preventing it in production

  • Implement input validation to check token length before sending to the model.
  • Use chunking libraries or tokenizers (like tiktoken) to split inputs precisely.
  • Integrate a vector database (e.g., Pinecone, FAISS) for RAG workflows to retrieve relevant document parts dynamically.
  • Apply hierarchical summarization: summarize chunks first, then summarize summaries to reduce input size.
  • Use retries with fallback to summarization if input is too large.

Key Takeaways

  • Always split documents into chunks smaller than the model's context window before inference.
  • Use retrieval-augmented generation to dynamically fetch relevant document parts for queries.
  • Validate input token length and apply summarization to handle very long documents efficiently.
Verified 2026-04 · gpt-4o, gpt-4o-mini
Verify ↗