ValueError
llama_cpp.Llama.ValueError
Stack trace
Traceback (most recent call last):
File "app.py", line 25, in <module>
output = llama_model(prompt, max_tokens=2048) # triggers token limit error
File "/usr/local/lib/python3.9/site-packages/llama_cpp/llama.py", line 150, in __call__
raise ValueError(f"max_tokens {max_tokens} exceeds model limit {self.model_token_limit}")
ValueError: max_tokens 2048 exceeds model limit 1024 Why it happens
The llama_cpp Llama class enforces a strict maximum token limit per model instance. When you request more tokens than the model supports, it raises a ValueError to prevent runtime failures or memory issues. This limit is defined by the underlying model architecture and cannot be exceeded.
Detection
Check the max_tokens parameter before calling the Llama instance. Log or assert that max_tokens does not exceed the model_token_limit attribute to catch this error early.
Causes & fixes
max_tokens argument exceeds the model's maximum token limit
Reduce the max_tokens parameter to be equal or less than the model_token_limit property of your Llama instance.
Prompt length plus max_tokens exceeds the model's total token capacity
Shorten the prompt or reduce max_tokens so that prompt tokens plus max_tokens do not exceed the model_token_limit.
Using a model with a smaller token limit than expected (e.g., llama-7b vs llama-13b)
Verify the model you loaded supports the token count you need or switch to a model with a higher token limit.
Code: broken vs fixed
from llama_cpp import Llama
llama_model = Llama(model_path="./models/llama-7b.bin")
prompt = "Hello, world!"
output = llama_model(prompt, max_tokens=2048) # triggers ValueError: token limit exceeded
print(output) import os
from llama_cpp import Llama
# Use environment variable for model path
model_path = os.environ.get("LLAMA_MODEL_PATH", "./models/llama-7b.bin")
llama_model = Llama(model_path=model_path)
prompt = "Hello, world!"
# Ensure max_tokens does not exceed model limit
max_tokens = min(512, llama_model.model_token_limit)
output = llama_model(prompt, max_tokens=max_tokens) # fixed token limit
print(output) Workaround
Catch the ValueError exception around the Llama call, then reduce max_tokens dynamically and retry the call until it succeeds.
Prevention
Always check llama_model.model_token_limit before generation and design your prompt and max_tokens parameters to never exceed this limit.