Debug Fix Intermediate · 4 min read

How to handle long documents in prompts

Quick answer

To handle long documents in prompts, use chunking to split the text into smaller parts and process them sequentially or with retrieval-augmented generation. Alternatively, summarize the document first to reduce length. These techniques prevent exceeding token limits and improve response relevance when using gpt-4o or similar models.

ERROR TYPE model_behavior

⚡ QUICK FIX

Split the long document into smaller chunks and feed them sequentially or use a retrieval-augmented approach to keep prompt size within token limits.

Why this happens

Large documents often exceed the token limits of models like gpt-4o, causing truncation or errors. For example, sending a 10,000-token document in a single prompt triggers incomplete outputs or context_length_exceeded errors. This happens because models have fixed maximum context windows (e.g., 8,192 tokens for gpt-4o), and exceeding them breaks prompt processing.

Example broken code sending a long document:

python

from openai import OpenAI
import os

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

long_document = """Very long text exceeding model token limit..."""

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": long_document}]
)
print(response.choices[0].message.content)

output

Error: context_length_exceeded or truncated output

The fix

Split the document into smaller chunks that fit within the model's token limit, then process each chunk separately or summarize chunks before combining. This avoids exceeding context length and maintains output quality.

Example code chunking a long document and summarizing each part with gpt-4o:

python

from openai import OpenAI
import os

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

def chunk_text(text, max_tokens=2000):
    words = text.split()
    chunks = []
    current_chunk = []
    current_len = 0
    for word in words:
        current_len += len(word) // 4  # approx tokens
        if current_len > max_tokens:
            chunks.append(" ".join(current_chunk))
            current_chunk = [word]
            current_len = len(word) // 4
        else:
            current_chunk.append(word)
    if current_chunk:
        chunks.append(" ".join(current_chunk))
    return chunks

long_document = """Very long text exceeding model token limit..."""
chunks = chunk_text(long_document)

summaries = []
for chunk in chunks:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": f"Summarize this text:\n{chunk}"}]
    )
    summaries.append(response.choices[0].message.content)

final_summary = "\n".join(summaries)
print(final_summary)

output

Concise summaries of each chunk combined into a final summary

Preventing it in production

Implement automatic chunking and summarization pipelines before sending documents to the model. Use retrieval-augmented generation (RAG) to query relevant document parts dynamically. Add validation to check token counts and fallback to chunking if limits are exceeded. Employ retries with backoff for transient errors.

Related errors

Error	Cause	Quick fix
context_length_exceeded	Input exceeds model token limit	Chunk input or summarize before sending
truncated_output	Model output cut off due to length	Reduce prompt size or use streaming
RateLimitError	Too many requests in short time	Add exponential backoff retry logic

✅

Key Takeaways

Always split long documents into chunks smaller than the model's max token limit before prompting.
Use summarization on chunks to condense information and reduce token usage.
Implement retrieval-augmented generation to dynamically fetch relevant document parts.
Validate prompt token length programmatically to avoid errors in production.
Add retry and fallback logic to handle transient API errors gracefully.

Verified 2026-04 · gpt-4o

Verify ↗