How to Intermediate · 4 min read

How to use GGUF models with llama.cpp

Q: How to use GGUF models with llama.cpp

To use GGUF models with llama.cpp, first convert your original model to the GGUF format using the convert-llama-to-gguf tool included in llama.cpp. Then run inference by specifying the GGUF model file with llama.cpp's command-line interface or API. This enables efficient quantized model loading and faster local inference.

Quick answer

To use GGUF models with llama.cpp, first convert your original model to the GGUF format using the convert-llama-to-gguf tool included in llama.cpp. Then run inference by specifying the GGUF model file with llama.cpp's command-line interface or API. This enables efficient quantized model loading and faster local inference.

PREREQUISITES

C++ compiler (gcc or clang)
Git
Python 3.8+ (optional for conversion scripts)
llama.cpp repository cloned from GitHub
Original LLaMA or compatible model files

Setup

Clone the llama.cpp repository and build the project. This provides the tools needed to convert and run GGUF models.

Install a C++ compiler (gcc or clang).
Clone the repo: git clone https://github.com/ggerganov/llama.cpp.git
Build the project: cd llama.cpp && make

bash

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
make

output

gcc -o main main.c ...
Built llama.cpp successfully

Step by step

Convert your original LLaMA model to GGUF format using the provided conversion tool, then run inference with llama.cpp.

bash

# Convert original model to GGUF format
./convert-llama-to-gguf original-model.bin gguf-model.gguf

# Run inference with GGUF model
./main -m gguf-model.gguf -p "Hello, world!" -n 128

output

Model loaded: gguf-model.gguf
Prompt: Hello, world!
Response: Hello! How can I assist you today?

Common variations

You can use different quantization levels (e.g., 4-bit, 8-bit) during conversion by passing flags to convert-llama-to-gguf. Also, llama.cpp supports streaming output and multi-threading for faster inference.

Use -q 4 for 4-bit quantization.
Enable streaming with -s flag.
Set number of threads with -t.

bash

# Convert with 4-bit quantization
./convert-llama-to-gguf -q 4 original-model.bin gguf-4bit.gguf

# Run with streaming and 8 threads
./main -m gguf-4bit.gguf -p "Explain AI" -n 256 -s -t 8

output

Model loaded: gguf-4bit.gguf
Streaming response:
AI stands for Artificial Intelligence, which is the simulation of human intelligence in machines...

Troubleshooting

If you encounter errors loading the GGUF model, verify the conversion completed successfully and the model file path is correct. For performance issues, increase thread count or use lower-bit quantization. Ensure your CPU supports AVX2 or higher for best speed.

Key Takeaways

Use the convert-llama-to-gguf tool in llama.cpp to convert models to GGUF format.
Run inference by specifying the GGUF model file with llama.cpp CLI for efficient quantized execution.
Adjust quantization bits and threading options to balance speed and accuracy.
Ensure your CPU supports AVX2+ instructions for optimal performance.
Check file paths and conversion logs if model loading fails.

Verified 2026-04 · gguf, llama.cpp

Verify ↗

Community Notes

No notes yetBe the first to share a version-specific fix or tip.