How to Intermediate · 3 min read

How to run llama.cpp on GPU

Q: How to run llama.cpp on GPU

To run llama.cpp on GPU, use the llama-cpp-python library with a GGUF model and set n_gpu_layers to a positive number (e.g., -1 for all layers on GPU). This enables CUDA acceleration for faster local inference with llama.cpp.

Quick answer

To run llama.cpp on GPU, use the llama-cpp-python library with a GGUF model and set n_gpu_layers to a positive number (e.g., -1 for all layers on GPU). This enables CUDA acceleration for faster local inference with llama.cpp.

PREREQUISITES

Python 3.8+
CUDA-enabled GPU with proper drivers installed
pip install llama-cpp-python
Download a GGUF format Llama model

Setup

Install the llama-cpp-python package, which provides Python bindings for llama.cpp with GPU support. Ensure your system has a CUDA-enabled GPU and the appropriate NVIDIA drivers installed. Download a GGUF model compatible with llama.cpp from Hugging Face or other sources.

bash

pip install llama-cpp-python

output

Collecting llama-cpp-python\n  Downloading llama_cpp_python-0.1.0-cp38-cp38-manylinux1_x86_64.whl (10 MB)\nInstalling collected packages: llama-cpp-python\nSuccessfully installed llama-cpp-python-0.1.0

Step by step

Use the Llama class from llama_cpp to load your GGUF model and enable GPU acceleration by setting n_gpu_layers=-1. This moves all transformer layers to GPU for faster inference. Then call create_chat_completion with chat messages.

python

from llama_cpp import Llama
import os

model_path = os.path.expanduser("~/models/llama-3.1-8b.Q8_0.gguf")

llm = Llama(
    model_path=model_path,
    n_ctx=2048,
    n_gpu_layers=-1  # Use all GPU layers for acceleration
)

messages = [
    {"role": "user", "content": "Explain the benefits of running llama.cpp on GPU."}
]

response = llm.create_chat_completion(messages=messages, max_tokens=128)
print(response["choices"][0]["message"]["content"])

output

The benefits of running llama.cpp on GPU include significantly faster inference times compared to CPU-only execution, enabling real-time or near-real-time responses. GPU acceleration leverages parallel processing capabilities, reducing latency and improving throughput for large language models.

Common variations

Set n_gpu_layers to a positive integer to offload only some layers to GPU, balancing memory and speed.
Use n_threads to control CPU threads for remaining layers.
Run asynchronously by integrating with async frameworks, though llama-cpp-python is primarily synchronous.
Use different GGUF models like llama-3.3-70b.Q8_0.gguf for larger models on GPUs with more memory.

Troubleshooting

If you get CUDA out of memory errors, reduce n_gpu_layers or use a smaller model.
Ensure your CUDA drivers and toolkit versions are compatible with your GPU and llama-cpp-python version.
If the model fails to load, verify the GGUF model path and format.
Check that your Python environment matches the CUDA version requirements.

Key Takeaways

Use llama-cpp-python with n_gpu_layers=-1 to run all transformer layers on GPU.
Download GGUF format models for compatibility with GPU-accelerated llama.cpp.
Adjust n_gpu_layers to balance GPU memory usage and speed.
Ensure CUDA drivers and environment are properly installed to avoid runtime errors.

Verified 2026-04 · llama-3.1-8b.Q8_0.gguf, llama-3.3-70b.Q8_0.gguf

Verify ↗

Community Notes

No notes yetBe the first to share a version-specific fix or tip.