TokenLimitExceededError
langchain_core.exceptions.TokenLimitExceededError
Stack trace
langchain_core.exceptions.TokenLimitExceededError: The total tokens in the prompt and expected completion exceed the model's maximum context window size of 8192 tokens.
Why it happens
This error occurs because the input prompt plus the expected output length surpass the model's maximum token context window. LangChain chains concatenate multiple inputs or memory states, causing token count overflow beyond the model's limit.
Detection
Monitor token usage by calculating prompt and expected output tokens before sending requests; log token counts and catch TokenLimitExceededError to identify overflows early.
Causes & fixes
Input documents or conversation history exceed the model's maximum token limit when combined.
Implement token length checks and truncate or summarize inputs to fit within the model's context window before passing to the chain.
Chain memory accumulates too much context over multiple interactions, causing token overflow.
Use memory management strategies like windowed memory or summarization to limit stored context size.
Prompt templates or chain outputs request too many tokens in the completion, exceeding limits.
Reduce max_tokens parameter in the completion call and optimize prompt length to stay within limits.
Code: broken vs fixed
from langchain.chains import LLMChain
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model_name='gpt-4o', max_tokens=4000)
chain = LLMChain(llm=llm, prompt=some_long_prompt_template)
# This line raises TokenLimitExceededError if prompt + output tokens exceed limit
result = chain.run(long_input_text) # TokenLimitExceededError here import os
from langchain.chains import LLMChain
from langchain_openai import ChatOpenAI
from langchain.schema import BasePromptTemplate
# Example token counting utility (pseudo)
def count_tokens(text):
# Implement token counting logic here
return len(text.split()) # Simplified example
llm = ChatOpenAI(model_name='gpt-4o', max_tokens=4000)
# Truncate input to fit token limit
max_context_tokens = 8192
max_output_tokens = 4000
input_tokens = count_tokens(long_input_text)
allowed_input_tokens = max_context_tokens - max_output_tokens
if input_tokens > allowed_input_tokens:
truncated_input = ' '.join(long_input_text.split()[:allowed_input_tokens])
else:
truncated_input = long_input_text
chain = LLMChain(llm=llm, prompt=some_long_prompt_template)
result = chain.run(truncated_input) # Fixed: input truncated to fit context window
print(result) Workaround
Catch TokenLimitExceededError and on exception, truncate or summarize the input text dynamically before retrying the chain call.
Prevention
Design chains with token budget awareness by monitoring token counts continuously and using summarization or windowed memory to keep context within model limits.