Debug Fix intermediate · 3 min read

How to handle context length exceeded error OpenAI

Quick answer
The context length exceeded error occurs when the total tokens in your prompt and conversation exceed the model's maximum context window. To fix it, truncate or summarize earlier messages to stay within the token limit before calling client.chat.completions.create.
ERROR TYPE api_error
⚡ QUICK FIX
Truncate or summarize conversation history to keep total tokens under the model's max context length before sending the request.

Why this happens

The context length exceeded error arises when the combined tokens of your prompt, system instructions, and conversation history exceed the model's maximum context window (e.g., 8,192 tokens for gpt-4o). This typically happens in chat applications that accumulate long message histories without pruning.

Example error output:

{"error": {"message": "This model's maximum context length is 8192 tokens, but you requested 9000 tokens.", "type": "invalid_request_error"}}

Broken code example that triggers this error:

python
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
] + [{"role": "user", "content": "Long conversation message repeated many times..."}] * 1000

response = client.chat.completions.create(
    model="gpt-4o",
    messages=messages
)
print(response.choices[0].message.content)
output
{"error": {"message": "This model's maximum context length is 8192 tokens, but you requested 9000 tokens.", "type": "invalid_request_error"}}

The fix

To fix the error, truncate or summarize the conversation history so the total tokens fit within the model's limit. This can be done by keeping only the most recent messages or summarizing older ones.

Example corrected code truncates messages to the last 10 entries before sending:

python
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Simulated long conversation history
full_history = [
    {"role": "system", "content": "You are a helpful assistant."},
] + [{"role": "user", "content": f"Message {i}"} for i in range(1000)]

# Keep only the last 10 messages plus system prompt
messages = full_history[:1] + full_history[-10:]

response = client.chat.completions.create(
    model="gpt-4o",
    messages=messages
)
print(response.choices[0].message.content)
output
Assistant's response based on last 10 messages

Preventing it in production

Implement these strategies to avoid context length errors in production:

  • Token counting: Use tokenizers (like tiktoken) to count tokens before sending requests.
  • Truncation: Automatically truncate or summarize older messages to keep total tokens under the limit.
  • Retries and fallbacks: Catch invalid_request_error and retry with reduced context.
  • Model selection: Choose models with larger context windows if your use case requires long histories.

Key Takeaways

  • Always monitor and limit total tokens sent to the model to avoid context length errors.
  • Use token counting libraries to programmatically manage prompt size before API calls.
  • Implement automatic truncation or summarization of conversation history in chat apps.
  • Handle invalid_request_error gracefully with retries and context reduction.
  • Select models with appropriate context windows based on your application's needs.
Verified 2026-04 · gpt-4o
Verify ↗