InvalidRequestError
openai.InvalidRequestError (HTTP 400: context_length_exceeded)
Stack trace
openai.BadRequestError: Error code: 400 - {'error': {'message': 'This model's maximum context length is 200000 tokens. However, your messages resulted in 245630 tokens. Please reduce the length of the messages.', 'type': 'invalid_request_error', 'param': 'messages', 'code': 'context_length_exceeded'}} Why it happens
o1 is a reasoning model optimized for complex problem-solving, but it processes input tokens before entering reasoning mode. Unlike GPT-4o which can handle 128K tokens, o1 enforces a strict 200K limit on all input (system + conversation history + user prompt). When your messages array, including long document context or multi-turn history, exceeds 200K tokens, the API rejects the request with an explicit context_length_exceeded error.
Detection
Monitor token count before sending to o1 using OpenAI's tokenizer: `encoding.encode(str(messages))` then check `len(tokens) > 200000`. Log message sizes by turn to identify which parts of your conversation history are consuming tokens fastest.
Causes & fixes
Long document context passed directly in system or user message without chunking
Split large documents into semantic chunks (500-1000 tokens each), embed them, retrieve only the top-k relevant chunks via vector search, and pass only those to o1 instead of the full document.
Accumulating full conversation history without pruning or summarization
Implement a sliding window: keep only the last N messages (e.g., last 5-10 turns), or summarize older turns into a brief recap and replace them with the summary to preserve context while reducing token count.
System prompt is extremely verbose with detailed instructions and examples
Condense system prompt to essentials only (max 500 tokens). Move detailed task descriptions and few-shot examples into a single user message or RAG context instead of the system prompt.
Not using the correct tokenizer to measure input size before API call
Use `tiktoken.encoding_for_model('o1')` to count tokens accurately. o1 counts system + all user/assistant messages + special tokens. Test locally: `sum(len(tiktoken.encoding_for_model('o1').encode(str(m))) for m in messages)` must be ≤ 200000.
Code: broken vs fixed
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
# ❌ BROKEN: Passing entire 500KB document as context
with open('large_document.txt', 'r') as f:
full_doc = f.read() # 250K+ tokens
messages = [
{"role": "system", "content": "You are an expert analyst."},
{
"role": "user",
"content": f"Analyze this document:\n\n{full_doc}\n\nAnswer: What are the key findings?"
}
]
# This will exceed 200K and fail with context_length_exceeded
response = client.chat.completions.create(
model="o1",
messages=messages
)
print(response.choices[0].message.content) import os
import tiktoken
from openai import OpenAI
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
# ✅ FIXED: Use RAG to chunk document and retrieve only relevant sections
def count_tokens(messages: list) -> int:
"""Count input tokens for o1 model."""
encoding = tiktoken.encoding_for_model("o1")
return sum(len(encoding.encode(str(m))) for m in messages)
def chunk_document(text: str, chunk_size: int = 1000) -> list:
"""Split document into semantic chunks."""
encoding = tiktoken.encoding_for_model("o1")
tokens = encoding.encode(text)
chunks = []
for i in range(0, len(tokens), chunk_size):
chunk_tokens = tokens[i:i+chunk_size]
chunks.append(encoding.decode(chunk_tokens))
return chunks
def retrieve_relevant_chunks(query: str, chunks: list, top_k: int = 3) -> str:
"""Simulated RAG: return top-k most relevant chunks (use real embeddings in production)."""
# In production: embed chunks, embed query, use cosine similarity
# For demo: return first top_k chunks
return "\n\n".join(chunks[:top_k])
# Load and chunk the document
with open('large_document.txt', 'r') as f:
full_doc = f.read()
chunks = chunk_document(full_doc, chunk_size=1000)
relevant_context = retrieve_relevant_chunks("key findings", chunks, top_k=3)
# Build messages with only relevant chunks (not full doc)
messages = [
{"role": "system", "content": "You are an expert analyst. Provide concise findings."},
{
"role": "user",
"content": f"Based on this context:\n\n{relevant_context}\n\nWhat are the key findings?"
}
]
# Verify token count is under 200K before calling API
token_count = count_tokens(messages)
print(f"Total input tokens: {token_count}")
if token_count > 200000:
print(f"ERROR: {token_count} tokens exceeds 200K limit. Reduce context further.")
else:
response = client.chat.completions.create(
model="o1",
messages=messages
)
print("Response:", response.choices[0].message.content) Workaround
If you cannot implement RAG immediately, use a multi-step approach: (1) send the user's query alone to o1 to get a focused analysis direction, (2) use that response as context to decide which document sections are most relevant, (3) re-chunk the document based on those sections, (4) send only those chunks to o1. This trades latency for staying under the token limit. Alternatively, fall back to gpt-4o (128K limit but more flexible) while you implement RAG.
Prevention
Architect your system to separate context retrieval from inference: always use semantic search (vector DB, FAISS, or Pinecone) to fetch only relevant document chunks before constructing the messages array. Implement automatic token counting at message build time, add a circuit breaker that rejects messages >180K tokens, and monitor token usage per request in production. For multi-turn conversations, keep a rolling 10-turn window and summarize older turns into a brief recap message before reaching 150K tokens.