High severity intermediate · Fix: 2-5 min

RuntimeError

llamacpp.RuntimeError: context size exceeded n_ctx

What this error means

The llama.cpp model context size limit (n_ctx) was exceeded by the input or generated tokens, causing a runtime failure.

Stack trace

traceback

Traceback (most recent call last):
  File "app.py", line 42, in <module>
    output = model.generate(prompt)
  File "llamacpp.py", line 88, in generate
    raise RuntimeError("context size exceeded n_ctx")
llamacpp.RuntimeError: context size exceeded n_ctx

QUICK FIX

Trim your prompt or reduce max tokens so total tokens do not exceed the model's n_ctx limit.

Why it happens

llama.cpp models have a fixed maximum context window size (n_ctx) that limits the total tokens processed in a single request. When the combined length of the prompt plus generated tokens exceeds this limit, the runtime throws this error to prevent buffer overflow or memory issues.

Detection

Monitor the token count of your prompt and expected generation length before calling generate(), and log or assert if it approaches or exceeds the model's n_ctx limit.

Causes & fixes

Input prompt plus generation length exceeds the model's n_ctx context window size

✓ Fix

Reduce the prompt length or lower the max tokens parameter to ensure total tokens stay within the n_ctx limit.

Using a model with a smaller n_ctx size than expected for your application

✓ Fix

Switch to a llama.cpp model variant with a larger n_ctx context window if available.

Not accounting for special tokens or tokenization overhead in token count calculations

✓ Fix

Use the llama.cpp tokenizer to accurately count tokens including special tokens before generation.

Code: broken vs fixed

Broken - triggers the error

python

from llamacpp import Llama

model = Llama(model_path="./model.bin", n_ctx=2048)
prompt = "A" * 3000  # Too long prompt
output = model.generate(prompt)  # RuntimeError: context size exceeded n_ctx
print(output)

Fixed - works correctly

python

import os
from llamacpp import Llama

model = Llama(model_path=os.environ["LLAMA_MODEL_PATH"], n_ctx=2048)
prompt = "A" * 1500  # Reduced prompt length
max_tokens = 500  # Ensure prompt + max_tokens <= n_ctx
output = model.generate(prompt, max_tokens=max_tokens)  # Fixed context size error
print(output)  # Works without error

Reduced prompt length and limited max_tokens so total tokens fit within the model's n_ctx context window, preventing the runtime error.

⚠

Workaround

Catch the RuntimeError and programmatically truncate the prompt or reduce max_tokens dynamically before retrying generation.

✓

Prevention

Implement token counting and validation before generation calls to ensure prompt plus generation length never exceeds the model's n_ctx limit.

Python 3.9+ · llamacpp >=0.1.0 · tested on 0.1.x

Verified 2026-04

Verify ↗

Community Notes

No notes yetBe the first to share a version-specific fix or tip.