RuntimeError
llamacpp.RuntimeError: context size exceeded n_ctx
Stack trace
Traceback (most recent call last):
File "app.py", line 42, in <module>
output = model.generate(prompt)
File "llamacpp.py", line 88, in generate
raise RuntimeError("context size exceeded n_ctx")
llamacpp.RuntimeError: context size exceeded n_ctx Why it happens
llama.cpp models have a fixed maximum context window size (n_ctx) that limits the total tokens processed in a single request. When the combined length of the prompt plus generated tokens exceeds this limit, the runtime throws this error to prevent buffer overflow or memory issues.
Detection
Monitor the token count of your prompt and expected generation length before calling generate(), and log or assert if it approaches or exceeds the model's n_ctx limit.
Causes & fixes
Input prompt plus generation length exceeds the model's n_ctx context window size
Reduce the prompt length or lower the max tokens parameter to ensure total tokens stay within the n_ctx limit.
Using a model with a smaller n_ctx size than expected for your application
Switch to a llama.cpp model variant with a larger n_ctx context window if available.
Not accounting for special tokens or tokenization overhead in token count calculations
Use the llama.cpp tokenizer to accurately count tokens including special tokens before generation.
Code: broken vs fixed
from llamacpp import Llama
model = Llama(model_path="./model.bin", n_ctx=2048)
prompt = "A" * 3000 # Too long prompt
output = model.generate(prompt) # RuntimeError: context size exceeded n_ctx
print(output) import os
from llamacpp import Llama
model = Llama(model_path=os.environ["LLAMA_MODEL_PATH"], n_ctx=2048)
prompt = "A" * 1500 # Reduced prompt length
max_tokens = 500 # Ensure prompt + max_tokens <= n_ctx
output = model.generate(prompt, max_tokens=max_tokens) # Fixed context size error
print(output) # Works without error Workaround
Catch the RuntimeError and programmatically truncate the prompt or reduce max_tokens dynamically before retrying generation.
Prevention
Implement token counting and validation before generation calls to ensure prompt plus generation length never exceeds the model's n_ctx limit.