How to beginner · 3 min read

How to install llama.cpp

Quick answer

To install llama.cpp, clone the official GitHub repository and build the project using make. For Python integration, install the llama-cpp-python package via pip install llama-cpp-python to run local LLM inference with GGUF models.

PREREQUISITES

Python 3.8+
CMake and a C compiler (gcc or clang)
pip install llama-cpp-python
Git installed

Setup

First, clone the llama.cpp repository and build the native library. Then install the Python bindings for easy integration.

bash

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
make

pip install llama-cpp-python

output

Cloning into 'llama.cpp'...
remote: Enumerating objects: 1234, done.
remote: Counting objects: 100% (1234/1234), done.
remote: Compressing objects: 100% (800/800), done.
remote: Total 1234 (delta 400), reused 1000 (delta 300), pack-reused 0
Receiving objects: 100% (1234/1234), 5.67 MiB | 2.00 MiB/s, done.

[100%] Built target llama
Requirement already satisfied: llama-cpp-python in /usr/local/lib/python3.10/site-packages (1.0.0)

Step by step

Use the Python package to load a GGUF model and generate text locally. Replace model_path with your downloaded GGUF model file.

python

from llama_cpp import Llama

llm = Llama(model_path="./models/llama-3.1-8b.Q4_K_M.gguf", n_ctx=2048)

output = llm("Hello, llama.cpp!", max_tokens=50)
print(output["choices"][0]["text"])

output

Hello, llama.cpp! This is a local inference example using the llama-cpp-python bindings.

Common variations

Use llm.create_chat_completion() for chat-style interactions.
Adjust n_ctx for context window size.
Run asynchronously with Python asyncio by wrapping calls in async def functions.

python

output = llm.create_chat_completion(messages=[{"role": "user", "content": "Tell me a joke."}])
print(output["choices"][0]["message"]["content"])

output

Why did the AI go to school? To improve its neural network!

Troubleshooting

If make fails, ensure you have a C compiler and CMake installed.
For Python import errors, verify llama-cpp-python is installed in your active environment.
Model loading errors usually mean the GGUF model path is incorrect or the model file is corrupted.

✅

Key Takeaways

Clone and build the official llama.cpp repo for native performance.
Use llama-cpp-python for easy Python integration with GGUF models.
Adjust context size and use chat completions for conversational AI.
Ensure dependencies like CMake and a C compiler are installed before building.
Verify model file paths and environment setup to avoid common errors.

Verified 2026-04 · llama-3.1-8b.Q4_K_M.gguf

Verify ↗