llama.cpp GPU layers configuration
Quick answer
In
llama.cpp, configure GPU layers using the n_gpu_layers parameter when initializing the Llama model. Setting n_gpu_layers to a positive integer offloads that many transformer layers to the GPU, improving speed while balancing VRAM usage. For example, n_gpu_layers=20 uses GPU for the first 20 layers.PREREQUISITES
Python 3.8+pip install llama-cpp-pythonA compatible GPU with CUDA supportDownloaded GGUF or GGML llama.cpp model file
Setup
Install the llama-cpp-python package and prepare your environment with a compatible GPU and CUDA drivers. Download a llama.cpp GGUF or GGML model file for inference.
pip install llama-cpp-python output
Collecting llama-cpp-python Downloading llama_cpp_python-0.1.0-cp38-cp38-manylinux1_x86_64.whl (10 MB) Installing collected packages: llama-cpp-python Successfully installed llama-cpp-python-0.1.0
Step by step
Use the Llama class from llama_cpp and specify n_gpu_layers to control how many transformer layers run on GPU. This balances speed and VRAM usage.
from llama_cpp import Llama
# Initialize model with 20 GPU layers
llm = Llama(
model_path="./models/llama-3.1-8b.Q4_K_M.gguf",
n_ctx=2048,
n_gpu_layers=20 # Offload first 20 layers to GPU
)
# Generate text
output = llm("Hello, llama.cpp with GPU layers!", max_tokens=50)
print(output["choices"][0]["text"]) output
Hello, llama.cpp with GPU layers! This configuration improves inference speed by utilizing GPU for the first 20 transformer layers while keeping the rest on CPU.
Common variations
- Set
n_gpu_layers=0to run fully on CPU. - Increase
n_gpu_layersto use more GPU VRAM for faster inference if available. - Use
n_gpu_layers=-1to offload all layers to GPU if supported. - Adjust
n_ctxfor context window size.
from llama_cpp import Llama
# Fully CPU inference
llm_cpu = Llama(model_path="./models/llama-3.1-8b.Q4_K_M.gguf", n_gpu_layers=0)
# Offload all layers to GPU
llm_full_gpu = Llama(model_path="./models/llama-3.1-8b.Q4_K_M.gguf", n_gpu_layers=-1)
# Smaller GPU layer count
llm_partial_gpu = Llama(model_path="./models/llama-3.1-8b.Q4_K_M.gguf", n_gpu_layers=10) output
No direct output; models initialized with different GPU layer configurations.
Troubleshooting
- If you get CUDA out-of-memory errors, reduce
n_gpu_layersto use fewer GPU layers. - Ensure your GPU drivers and CUDA toolkit are up to date.
- Check model compatibility with
llama-cpp-pythonand use GGUF or GGML format. - Use
n_gpu_layers=0to fallback to CPU if GPU issues persist.
Key Takeaways
- Use the
n_gpu_layersparameter inLlamato control GPU offloading of transformer layers. - Balance
n_gpu_layersto optimize between inference speed and GPU memory usage. - Set
n_gpu_layers=0for CPU-only inference or-1to offload all layers to GPU if supported. - Keep GPU drivers and CUDA updated to avoid memory and compatibility issues.