LLMPredictionError
llama_index.llm_predictor.LLMPredictionError
Stack trace
llama_index.llm_predictor.LLMPredictionError: Token limit exceeded: input tokens + output tokens exceed model max tokens
File "/app/llama_index/llm_predictor.py", line 123, in predict
raise LLMPredictionError("Token limit exceeded")
File "/app/llama_index/llm_predictor.py", line 98, in _check_token_limit
if total_tokens > self.max_tokens:
Why it happens
LlamaIndex calculates the total tokens for the prompt plus expected output tokens before sending a request to the LLM. If this sum exceeds the model's maximum token limit, it raises LLMPredictionError to prevent API rejection or truncation. This often happens with large documents or high max output tokens settings.
Detection
Monitor token usage by logging input and output token counts before prediction calls; catch LLMPredictionError exceptions to identify when token limits are breached.
Causes & fixes
Input document or prompt is too large, causing input tokens to exceed model limits.
Reduce the input size by chunking documents into smaller pieces or summarizing content before passing to the LLM.
Max output tokens parameter is set too high, causing total tokens to exceed the model's max tokens.
Lower the max output tokens setting in the LLM predictor configuration to fit within the model's token budget.
Model max token limit is smaller than expected (e.g., using a base model with lower token capacity).
Switch to a model with a higher token limit, such as gpt-4o or llama-3.3-70b, to accommodate larger inputs and outputs.
Code: broken vs fixed
from llama_index import LLMPredictor, ServiceContext
import os
llm_predictor = LLMPredictor(max_output_tokens=2048) # Too high for model
service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor)
# This call triggers LLMPredictionError due to token limit
response = service_context.llm_predictor.predict(prompt="Very large input text..." ) # Error here
print(response) import os
from llama_index import LLMPredictor, ServiceContext
# Use environment variable for API key
os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY")
# Lower max_output_tokens to fit model token limit
llm_predictor = LLMPredictor(max_output_tokens=512) # Reduced to avoid token limit error
service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor)
# Chunk or summarize input externally before passing
prompt = "Summarized or chunked input text..."
response = service_context.llm_predictor.predict(prompt=prompt) # Fixed
print(response) Workaround
Catch LLMPredictionError exceptions and automatically retry with smaller input chunks or reduced max output tokens to avoid token overflow.
Prevention
Implement input chunking and token counting before prediction calls; use models with sufficient token capacity and configure max output tokens conservatively.