High severity HTTP 400 intermediate · Fix: 15-30 min

InvalidRequestError

openai.InvalidRequestError (HTTP 400: context_length_exceeded)

What this error means
OpenAI o1 model has a 200,000 input token limit; requests exceeding this raise InvalidRequestError with 'context_length_exceeded' error type.

Stack trace

traceback
openai.BadRequestError: Error code: 400 - {'error': {'message': 'This model's maximum context length is 200000 tokens. However, your messages resulted in 245630 tokens. Please reduce the length of the messages.', 'type': 'invalid_request_error', 'param': 'messages', 'code': 'context_length_exceeded'}}
QUICK FIX
Use RAG (vector search for document chunks) instead of passing full documents, and implement message sliding window to keep conversation history under 150K tokens total.

Why it happens

o1 is a reasoning model optimized for complex problem-solving, but it processes input tokens before entering reasoning mode. Unlike GPT-4o which can handle 128K tokens, o1 enforces a strict 200K limit on all input (system + conversation history + user prompt). When your messages array, including long document context or multi-turn history, exceeds 200K tokens, the API rejects the request with an explicit context_length_exceeded error.

Detection

Monitor token count before sending to o1 using OpenAI's tokenizer: `encoding.encode(str(messages))` then check `len(tokens) > 200000`. Log message sizes by turn to identify which parts of your conversation history are consuming tokens fastest.

Causes & fixes

1

Long document context passed directly in system or user message without chunking

✓ Fix

Split large documents into semantic chunks (500-1000 tokens each), embed them, retrieve only the top-k relevant chunks via vector search, and pass only those to o1 instead of the full document.

2

Accumulating full conversation history without pruning or summarization

✓ Fix

Implement a sliding window: keep only the last N messages (e.g., last 5-10 turns), or summarize older turns into a brief recap and replace them with the summary to preserve context while reducing token count.

3

System prompt is extremely verbose with detailed instructions and examples

✓ Fix

Condense system prompt to essentials only (max 500 tokens). Move detailed task descriptions and few-shot examples into a single user message or RAG context instead of the system prompt.

4

Not using the correct tokenizer to measure input size before API call

✓ Fix

Use `tiktoken.encoding_for_model('o1')` to count tokens accurately. o1 counts system + all user/assistant messages + special tokens. Test locally: `sum(len(tiktoken.encoding_for_model('o1').encode(str(m))) for m in messages)` must be ≤ 200000.

Code: broken vs fixed

Broken - triggers the error
python
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

# ❌ BROKEN: Passing entire 500KB document as context
with open('large_document.txt', 'r') as f:
    full_doc = f.read()  # 250K+ tokens

messages = [
    {"role": "system", "content": "You are an expert analyst."},
    {
        "role": "user",
        "content": f"Analyze this document:\n\n{full_doc}\n\nAnswer: What are the key findings?"
    }
]

# This will exceed 200K and fail with context_length_exceeded
response = client.chat.completions.create(
    model="o1",
    messages=messages
)
print(response.choices[0].message.content)
Fixed - works correctly
python
import os
import tiktoken
from openai import OpenAI

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

# ✅ FIXED: Use RAG to chunk document and retrieve only relevant sections
def count_tokens(messages: list) -> int:
    """Count input tokens for o1 model."""
    encoding = tiktoken.encoding_for_model("o1")
    return sum(len(encoding.encode(str(m))) for m in messages)

def chunk_document(text: str, chunk_size: int = 1000) -> list:
    """Split document into semantic chunks."""
    encoding = tiktoken.encoding_for_model("o1")
    tokens = encoding.encode(text)
    chunks = []
    for i in range(0, len(tokens), chunk_size):
        chunk_tokens = tokens[i:i+chunk_size]
        chunks.append(encoding.decode(chunk_tokens))
    return chunks

def retrieve_relevant_chunks(query: str, chunks: list, top_k: int = 3) -> str:
    """Simulated RAG: return top-k most relevant chunks (use real embeddings in production)."""
    # In production: embed chunks, embed query, use cosine similarity
    # For demo: return first top_k chunks
    return "\n\n".join(chunks[:top_k])

# Load and chunk the document
with open('large_document.txt', 'r') as f:
    full_doc = f.read()

chunks = chunk_document(full_doc, chunk_size=1000)
relevant_context = retrieve_relevant_chunks("key findings", chunks, top_k=3)

# Build messages with only relevant chunks (not full doc)
messages = [
    {"role": "system", "content": "You are an expert analyst. Provide concise findings."},
    {
        "role": "user",
        "content": f"Based on this context:\n\n{relevant_context}\n\nWhat are the key findings?"
    }
]

# Verify token count is under 200K before calling API
token_count = count_tokens(messages)
print(f"Total input tokens: {token_count}")

if token_count > 200000:
    print(f"ERROR: {token_count} tokens exceeds 200K limit. Reduce context further.")
else:
    response = client.chat.completions.create(
        model="o1",
        messages=messages
    )
    print("Response:", response.choices[0].message.content)
Changed from passing full document (250K+ tokens) to using RAG to retrieve only the top 3 relevant chunks (~30K tokens), added token counting with tiktoken before API call, and verified total input stays under 200K limit before sending to o1.

Workaround

If you cannot implement RAG immediately, use a multi-step approach: (1) send the user's query alone to o1 to get a focused analysis direction, (2) use that response as context to decide which document sections are most relevant, (3) re-chunk the document based on those sections, (4) send only those chunks to o1. This trades latency for staying under the token limit. Alternatively, fall back to gpt-4o (128K limit but more flexible) while you implement RAG.

Prevention

Architect your system to separate context retrieval from inference: always use semantic search (vector DB, FAISS, or Pinecone) to fetch only relevant document chunks before constructing the messages array. Implement automatic token counting at message build time, add a circuit breaker that rejects messages >180K tokens, and monitor token usage per request in production. For multi-turn conversations, keep a rolling 10-turn window and summarize older turns into a brief recap message before reaching 150K tokens.

Python 3.9+ · openai >=1.3.0 · tested on 1.50.0
Verified 2026-04 · o1, o1-mini, gpt-4o
Verify ↗

Community Notes

No notes yetBe the first to share a version-specific fix or tip.