How to use GGUF models with llama.cpp
Quick answer
To use
GGUF models with llama.cpp, first convert your original model to the GGUF format using the convert-llama-to-gguf tool included in llama.cpp. Then run inference by specifying the GGUF model file with llama.cpp's command-line interface or API. This enables efficient quantized model loading and faster local inference.PREREQUISITES
C++ compiler (gcc or clang)GitPython 3.8+ (optional for conversion scripts)llama.cpp repository cloned from GitHubOriginal LLaMA or compatible model files
Setup
Clone the llama.cpp repository and build the project. This provides the tools needed to convert and run GGUF models.
- Install a C++ compiler (gcc or clang).
- Clone the repo:
git clone https://github.com/ggerganov/llama.cpp.git - Build the project:
cd llama.cpp && make
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
make output
gcc -o main main.c ... Built llama.cpp successfully
Step by step
Convert your original LLaMA model to GGUF format using the provided conversion tool, then run inference with llama.cpp.
# Convert original model to GGUF format
./convert-llama-to-gguf original-model.bin gguf-model.gguf
# Run inference with GGUF model
./main -m gguf-model.gguf -p "Hello, world!" -n 128 output
Model loaded: gguf-model.gguf Prompt: Hello, world! Response: Hello! How can I assist you today?
Common variations
You can use different quantization levels (e.g., 4-bit, 8-bit) during conversion by passing flags to convert-llama-to-gguf. Also, llama.cpp supports streaming output and multi-threading for faster inference.
- Use
-q 4for 4-bit quantization. - Enable streaming with
-sflag. - Set number of threads with
-t.
# Convert with 4-bit quantization
./convert-llama-to-gguf -q 4 original-model.bin gguf-4bit.gguf
# Run with streaming and 8 threads
./main -m gguf-4bit.gguf -p "Explain AI" -n 256 -s -t 8 output
Model loaded: gguf-4bit.gguf Streaming response: AI stands for Artificial Intelligence, which is the simulation of human intelligence in machines...
Troubleshooting
If you encounter errors loading the GGUF model, verify the conversion completed successfully and the model file path is correct. For performance issues, increase thread count or use lower-bit quantization. Ensure your CPU supports AVX2 or higher for best speed.
Key Takeaways
- Use the
convert-llama-to-gguftool inllama.cppto convert models to GGUF format. - Run inference by specifying the GGUF model file with
llama.cppCLI for efficient quantized execution. - Adjust quantization bits and threading options to balance speed and accuracy.
- Ensure your CPU supports AVX2+ instructions for optimal performance.
- Check file paths and conversion logs if model loading fails.