RuntimeError
llamacpp.RuntimeError: CUDA build not found, CPU fallback
Stack trace
RuntimeError: CUDA build not found, CPU fallback
File "llamacpp.py", line 45, in load_model
model = Llama(model_path, use_cuda=True) # triggers error
File "llamacpp.py", line 120, in __init__
raise RuntimeError("CUDA build not found, CPU fallback") Why it happens
llama.cpp requires a CUDA-enabled GPU build to run on GPU. If the CUDA build is missing or incompatible, it falls back to CPU execution. This happens when the environment lacks the proper CUDA toolkit, drivers, or the llama.cpp library was compiled without CUDA support.
Detection
Check logs for 'CUDA build not found' RuntimeError during model initialization. Monitor GPU usage; if zero, the model is running on CPU fallback.
Causes & fixes
llama.cpp library was compiled without CUDA support
Rebuild or reinstall llama.cpp with CUDA enabled by following the official build instructions for GPU support.
CUDA toolkit or drivers are missing or incompatible on the system
Install the correct CUDA toolkit and GPU drivers matching your GPU and llama.cpp CUDA build requirements.
Environment variable or configuration disables CUDA usage
Ensure the llama.cpp initialization parameter 'use_cuda=True' is set and no environment variables disable GPU usage.
Code: broken vs fixed
from llamacpp import Llama
model = Llama("model.bin", use_cuda=True) # triggers RuntimeError: CUDA build not found, CPU fallback
print("Model loaded") import os
from llamacpp import Llama
# Ensure CUDA environment variables and drivers are set
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
model = Llama("model.bin", use_cuda=True) # fixed: CUDA build found and used
print("Model loaded with CUDA support") Workaround
Run llama.cpp without use_cuda=True to explicitly use CPU mode until CUDA support is fixed, accepting slower performance.
Prevention
Set up continuous integration to verify llama.cpp builds with CUDA enabled and test GPU availability before deployment to prevent fallback.