How to run llama.cpp on GPU
Quick answer
To run
llama.cpp on GPU, use the llama-cpp-python library with a GGUF model and set n_gpu_layers to a positive number (e.g., -1 for all layers on GPU). This enables CUDA acceleration for faster local inference with llama.cpp.PREREQUISITES
Python 3.8+CUDA-enabled GPU with proper drivers installedpip install llama-cpp-pythonDownload a GGUF format Llama model
Setup
Install the llama-cpp-python package, which provides Python bindings for llama.cpp with GPU support. Ensure your system has a CUDA-enabled GPU and the appropriate NVIDIA drivers installed. Download a GGUF model compatible with llama.cpp from Hugging Face or other sources.
pip install llama-cpp-python output
Collecting llama-cpp-python\n Downloading llama_cpp_python-0.1.0-cp38-cp38-manylinux1_x86_64.whl (10 MB)\nInstalling collected packages: llama-cpp-python\nSuccessfully installed llama-cpp-python-0.1.0
Step by step
Use the Llama class from llama_cpp to load your GGUF model and enable GPU acceleration by setting n_gpu_layers=-1. This moves all transformer layers to GPU for faster inference. Then call create_chat_completion with chat messages.
from llama_cpp import Llama
import os
model_path = os.path.expanduser("~/models/llama-3.1-8b.Q8_0.gguf")
llm = Llama(
model_path=model_path,
n_ctx=2048,
n_gpu_layers=-1 # Use all GPU layers for acceleration
)
messages = [
{"role": "user", "content": "Explain the benefits of running llama.cpp on GPU."}
]
response = llm.create_chat_completion(messages=messages, max_tokens=128)
print(response["choices"][0]["message"]["content"]) output
The benefits of running llama.cpp on GPU include significantly faster inference times compared to CPU-only execution, enabling real-time or near-real-time responses. GPU acceleration leverages parallel processing capabilities, reducing latency and improving throughput for large language models.
Common variations
- Set
n_gpu_layersto a positive integer to offload only some layers to GPU, balancing memory and speed. - Use
n_threadsto control CPU threads for remaining layers. - Run asynchronously by integrating with async frameworks, though
llama-cpp-pythonis primarily synchronous. - Use different GGUF models like
llama-3.3-70b.Q8_0.gguffor larger models on GPUs with more memory.
Troubleshooting
- If you get
CUDA out of memoryerrors, reducen_gpu_layersor use a smaller model. - Ensure your CUDA drivers and toolkit versions are compatible with your GPU and
llama-cpp-pythonversion. - If the model fails to load, verify the GGUF model path and format.
- Check that your Python environment matches the CUDA version requirements.
Key Takeaways
- Use
llama-cpp-pythonwithn_gpu_layers=-1to run all transformer layers on GPU. - Download GGUF format models for compatibility with GPU-accelerated
llama.cpp. - Adjust
n_gpu_layersto balance GPU memory usage and speed. - Ensure CUDA drivers and environment are properly installed to avoid runtime errors.