How to handle long documents in prompts
gpt-4o or similar models.model_behavior Why this happens
Large documents often exceed the token limits of models like gpt-4o, causing truncation or errors. For example, sending a 10,000-token document in a single prompt triggers incomplete outputs or context_length_exceeded errors. This happens because models have fixed maximum context windows (e.g., 8,192 tokens for gpt-4o), and exceeding them breaks prompt processing.
Example broken code sending a long document:
from openai import OpenAI
import os
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
long_document = """Very long text exceeding model token limit..."""
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": long_document}]
)
print(response.choices[0].message.content) Error: context_length_exceeded or truncated output
The fix
Split the document into smaller chunks that fit within the model's token limit, then process each chunk separately or summarize chunks before combining. This avoids exceeding context length and maintains output quality.
Example code chunking a long document and summarizing each part with gpt-4o:
from openai import OpenAI
import os
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
def chunk_text(text, max_tokens=2000):
words = text.split()
chunks = []
current_chunk = []
current_len = 0
for word in words:
current_len += len(word) // 4 # approx tokens
if current_len > max_tokens:
chunks.append(" ".join(current_chunk))
current_chunk = [word]
current_len = len(word) // 4
else:
current_chunk.append(word)
if current_chunk:
chunks.append(" ".join(current_chunk))
return chunks
long_document = """Very long text exceeding model token limit..."""
chunks = chunk_text(long_document)
summaries = []
for chunk in chunks:
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": f"Summarize this text:\n{chunk}"}]
)
summaries.append(response.choices[0].message.content)
final_summary = "\n".join(summaries)
print(final_summary) Concise summaries of each chunk combined into a final summary
Preventing it in production
Implement automatic chunking and summarization pipelines before sending documents to the model. Use retrieval-augmented generation (RAG) to query relevant document parts dynamically. Add validation to check token counts and fallback to chunking if limits are exceeded. Employ retries with backoff for transient errors.
Key Takeaways
- Always split long documents into chunks smaller than the model's max token limit before prompting.
- Use summarization on chunks to condense information and reduce token usage.
- Implement retrieval-augmented generation to dynamically fetch relevant document parts.
- Validate prompt token length programmatically to avoid errors in production.
- Add retry and fallback logic to handle transient API errors gracefully.