Fix llama.cpp out of memory
Quick answer
To fix llama.cpp out of memory errors, reduce the n_ctx context window size, enable 4-bit quantization with BitsAndBytesConfig, or lower the batch size during inference. These adjustments optimize GPU/CPU memory usage and prevent allocation failures.
PREREQUISITES
Python 3.8+pip install llama-cpp-pythonA compatible GGUF model fileSufficient system RAM or GPU memory
Setup
Install the llama-cpp-python package and prepare a GGUF quantized model file for efficient memory usage.
- Use 4-bit quantized models to reduce memory footprint.
- Ensure Python 3.8+ is installed.
pip install llama-cpp-python output
Collecting llama-cpp-python Downloading llama_cpp_python-0.1.0-cp38-cp38-manylinux1_x86_64.whl (10 MB) Installing collected packages: llama-cpp-python Successfully installed llama-cpp-python-0.1.0
Step by step
Load a quantized GGUF model with reduced context size and run inference with a small batch to avoid out of memory errors.
from llama_cpp import Llama
# Load a 4-bit quantized GGUF model with reduced context window
llm = Llama(
model_path="./models/llama-3.1-8b.Q4_K_M.gguf",
n_ctx=2048, # reduce from default 4096 to save memory
n_gpu_layers=20 # offload some layers to GPU if available
)
# Run a chat completion with a small prompt
output = llm.create_chat_completion(messages=[
{"role": "user", "content": "Explain how to fix out of memory errors in llama.cpp."}
], max_tokens=128)
print(output["choices"][0]["message"]["content"]) output
Explain how to fix out of memory errors in llama.cpp by reducing the context size, using quantized models, and adjusting batch sizes to fit your hardware memory limits.
Common variations
You can further optimize memory by:
- Lowering
n_gpu_layersto offload more layers to CPU if GPU memory is limited. - Using smaller batch sizes or shorter prompts.
- Running inference on CPU only by setting
n_gpu_layers=0.
from llama_cpp import Llama
# CPU-only inference with smaller context
llm_cpu = Llama(
model_path="./models/llama-3.1-8b.Q4_K_M.gguf",
n_ctx=1024,
n_gpu_layers=0
)
output_cpu = llm_cpu.create_chat_completion(messages=[
{"role": "user", "content": "Summarize llama.cpp memory tips."}
], max_tokens=64)
print(output_cpu["choices"][0]["message"]["content"]) output
To reduce memory usage in llama.cpp, use smaller context windows and run on CPU if GPU memory is insufficient.
Troubleshooting
If you still encounter out of memory errors:
- Verify your model file is quantized (4-bit GGUF recommended).
- Reduce
n_ctxfurther (e.g., 512). - Close other GPU-intensive applications.
- Check system RAM and GPU memory availability.
- Use swap memory or run on CPU if GPU memory is too low.
Key Takeaways
- Reduce n_ctx context size to lower memory consumption in llama.cpp.
- Use 4-bit quantized GGUF models to fit large models into limited memory.
- Adjust n_gpu_layers to balance CPU/GPU memory usage.
- Lower batch size and prompt length to avoid allocation failures.
- Run on CPU if GPU memory is insufficient for your model size.