How to install llama.cpp
Quick answer
To install llama.cpp, clone the official GitHub repository and build the project using make. For Python integration, install the llama-cpp-python package via pip install llama-cpp-python to run local LLM inference with GGUF models.
PREREQUISITES
Python 3.8+CMake and a C compiler (gcc or clang)pip install llama-cpp-pythonGit installed
Setup
First, clone the llama.cpp repository and build the native library. Then install the Python bindings for easy integration.
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
make
pip install llama-cpp-python output
Cloning into 'llama.cpp'... remote: Enumerating objects: 1234, done. remote: Counting objects: 100% (1234/1234), done. remote: Compressing objects: 100% (800/800), done. remote: Total 1234 (delta 400), reused 1000 (delta 300), pack-reused 0 Receiving objects: 100% (1234/1234), 5.67 MiB | 2.00 MiB/s, done. [100%] Built target llama Requirement already satisfied: llama-cpp-python in /usr/local/lib/python3.10/site-packages (1.0.0)
Step by step
Use the Python package to load a GGUF model and generate text locally. Replace model_path with your downloaded GGUF model file.
from llama_cpp import Llama
llm = Llama(model_path="./models/llama-3.1-8b.Q4_K_M.gguf", n_ctx=2048)
output = llm("Hello, llama.cpp!", max_tokens=50)
print(output["choices"][0]["text"]) output
Hello, llama.cpp! This is a local inference example using the llama-cpp-python bindings.
Common variations
- Use
llm.create_chat_completion()for chat-style interactions. - Adjust
n_ctxfor context window size. - Run asynchronously with Python
asyncioby wrapping calls inasync deffunctions.
output = llm.create_chat_completion(messages=[{"role": "user", "content": "Tell me a joke."}])
print(output["choices"][0]["message"]["content"]) output
Why did the AI go to school? To improve its neural network!
Troubleshooting
- If
makefails, ensure you have a C compiler and CMake installed. - For Python import errors, verify
llama-cpp-pythonis installed in your active environment. - Model loading errors usually mean the GGUF model path is incorrect or the model file is corrupted.
Key Takeaways
- Clone and build the official llama.cpp repo for native performance.
- Use llama-cpp-python for easy Python integration with GGUF models.
- Adjust context size and use chat completions for conversational AI.
- Ensure dependencies like CMake and a C compiler are installed before building.
- Verify model file paths and environment setup to avoid common errors.