ValueError
vllm.LLM.ValueError
Stack trace
ValueError: max_tokens (4097) exceeds model context length (4096)
Why it happens
vLLM enforces that the max_tokens parameter must not exceed the model's maximum context length. If max_tokens is set larger than the model's context window, the library raises a ValueError to prevent invalid generation requests.
Detection
Check the max_tokens parameter before calling the generate method; assert it is less than or equal to the model's context length to avoid runtime exceptions.
Causes & fixes
max_tokens is set larger than the model's maximum context length
Reduce max_tokens to be equal to or less than the model's context length, which can be found in the model documentation or via LLM model attributes.
Using a model with a smaller context length than expected without adjusting max_tokens
Verify the model's context length before setting max_tokens, especially when switching models, and adjust max_tokens accordingly.
Hardcoding max_tokens without dynamic validation against model limits
Implement a validation step in your code to dynamically check and cap max_tokens based on the loaded model's context length.
Code: broken vs fixed
from vllm import LLM, SamplingParams
llm = LLM(model="llama-3.3-70b")
params = SamplingParams(max_tokens=4097) # This exceeds the model context length
outputs = llm.generate("Hello world", sampling_params=params) # Raises ValueError here import os
from vllm import LLM, SamplingParams
llm = LLM(model="llama-3.3-70b")
# Adjust max_tokens to not exceed model context length (4096)
params = SamplingParams(max_tokens=4096) # Fixed max_tokens
outputs = llm.generate("Hello world", sampling_params=params)
print(outputs[0].outputs[0].text) # Works without error Workaround
Catch the ValueError exception, then programmatically reduce max_tokens to the model's context length and retry the generation call.
Prevention
Always query or document the model's maximum context length and validate max_tokens dynamically before generation calls to avoid exceeding limits.