ContextLengthExceededError
ollama.errors.ContextLengthExceededError
Stack trace
ollama.errors.ContextLengthExceededError: The input context length exceeded the model's num_ctx limit of 2048 tokens.
Why it happens
Ollama models have a fixed maximum context length (num_ctx) that limits how many tokens can be processed in a single request. When the combined tokens of the prompt, conversation history, and any system messages exceed this limit, the client throws this error to prevent invalid requests.
Detection
Monitor token usage before sending requests by tokenizing prompts and conversation history; log or assert token counts do not exceed the model's num_ctx limit.
Causes & fixes
Prompt plus conversation history tokens exceed the model's maximum context length (num_ctx).
Truncate or summarize conversation history and reduce prompt length to fit within the model's token limit.
Using a model with a smaller context window than expected for your application.
Switch to an Ollama model variant with a larger num_ctx token limit if available.
Unintentionally appending large system or user messages repeatedly in the conversation state.
Implement logic to prune or reset conversation history periodically to keep token count under the limit.
Code: broken vs fixed
import os
import ollama
client = ollama
prompt = "A" * 3000 # Very long prompt exceeding context length
# This line triggers the context length exceeded error
response = client.chat(model="ollama-model", messages=[{"role": "user", "content": prompt}])
print(response) import os
import ollama
client = ollama
prompt = "A" * 1500 # Reduced prompt length to fit context window
# Fixed: prompt length reduced to avoid context length exceeded error
response = client.chat(model="ollama-model", messages=[{"role": "user", "content": prompt}])
print(response) Workaround
Catch the ContextLengthExceededError exception, then truncate or summarize the prompt and conversation history before retrying the request.
Prevention
Implement token counting and prompt length checks before sending requests, and use models with larger context windows when handling long conversations or documents.