High severity intermediate · Fix: 5-10 min

LLMPredictionError

llama_index.llm_predictor.LLMPredictionError

What this error means

LlamaIndex throws LLMPredictionError when the input or output exceeds the model's token limit during LLM prediction.

Stack trace

traceback

llama_index.llm_predictor.LLMPredictionError: Token limit exceeded: input tokens + output tokens exceed model max tokens
  File "/app/llama_index/llm_predictor.py", line 123, in predict
    raise LLMPredictionError("Token limit exceeded")
  File "/app/llama_index/llm_predictor.py", line 98, in _check_token_limit
    if total_tokens > self.max_tokens:

QUICK FIX

Reduce input size or max output tokens to fit within the model's token limit before calling LlamaIndex's LLM predictor.

Why it happens

LlamaIndex calculates the total tokens for the prompt plus expected output tokens before sending a request to the LLM. If this sum exceeds the model's maximum token limit, it raises LLMPredictionError to prevent API rejection or truncation. This often happens with large documents or high max output tokens settings.

Detection

Monitor token usage by logging input and output token counts before prediction calls; catch LLMPredictionError exceptions to identify when token limits are breached.

Causes & fixes

Input document or prompt is too large, causing input tokens to exceed model limits.

✓ Fix

Reduce the input size by chunking documents into smaller pieces or summarizing content before passing to the LLM.

Max output tokens parameter is set too high, causing total tokens to exceed the model's max tokens.

✓ Fix

Lower the max output tokens setting in the LLM predictor configuration to fit within the model's token budget.

Model max token limit is smaller than expected (e.g., using a base model with lower token capacity).

✓ Fix

Switch to a model with a higher token limit, such as gpt-4o or llama-3.3-70b, to accommodate larger inputs and outputs.

Code: broken vs fixed

Broken - triggers the error

python

from llama_index import LLMPredictor, ServiceContext
import os

llm_predictor = LLMPredictor(max_output_tokens=2048)  # Too high for model
service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor)

# This call triggers LLMPredictionError due to token limit
response = service_context.llm_predictor.predict(prompt="Very large input text..." )  # Error here
print(response)

Fixed - works correctly

python

import os
from llama_index import LLMPredictor, ServiceContext

# Use environment variable for API key
os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY")

# Lower max_output_tokens to fit model token limit
llm_predictor = LLMPredictor(max_output_tokens=512)  # Reduced to avoid token limit error
service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor)

# Chunk or summarize input externally before passing
prompt = "Summarized or chunked input text..."
response = service_context.llm_predictor.predict(prompt=prompt)  # Fixed
print(response)

Lowered max_output_tokens and reduced input size to ensure total tokens stay within the model's maximum token limit, preventing LLMPredictionError.

⚠

Workaround

Catch LLMPredictionError exceptions and automatically retry with smaller input chunks or reduced max output tokens to avoid token overflow.

✓

Prevention

Implement input chunking and token counting before prediction calls; use models with sufficient token capacity and configure max output tokens conservatively.

Python 3.9+ · llama_index >=0.5.0 · tested on 0.6.x

Verified 2026-04 · gpt-4o, llama-3.3-70b

Verify ↗

Community Notes

No notes yetBe the first to share a version-specific fix or tip.