How to beginner · 3 min read

How to run Llama with llama.cpp

Quick answer
Use the llama-cpp-python package to run Llama models locally by loading a GGUF model file and calling create_chat_completion or __call__ methods. This enables running Llama models efficiently on your machine without cloud APIs.

PREREQUISITES

  • Python 3.8+
  • llama-cpp-python package (pip install llama-cpp-python)
  • Llama GGUF model file downloaded locally

Setup

Install the llama-cpp-python package and download a Llama GGUF model file from Hugging Face or other sources. Ensure Python 3.8 or newer is installed.

bash
pip install llama-cpp-python
output
Collecting llama-cpp-python
  Downloading llama_cpp_python-0.1.0-cp38-cp38-win_amd64.whl (1.2 MB)
Installing collected packages: llama-cpp-python
Successfully installed llama-cpp-python-0.1.0

Step by step

Load the Llama model from a local GGUF file and run a chat completion with a user prompt. The example shows synchronous usage with the Llama class.

python
from llama_cpp import Llama

# Path to your downloaded Llama GGUF model file
model_path = "./models/llama-3.1-8b.Q4_K_M.gguf"

# Initialize the Llama model
llm = Llama(model_path=model_path, n_ctx=2048)

# Create a chat completion
messages = [
    {"role": "user", "content": "Hello, how are you?"}
]

response = llm.create_chat_completion(messages=messages, max_tokens=50)
print("Response:", response["choices"][0]["message"]["content"])
output
Response: I'm doing well, thank you! How can I assist you today?

Common variations

You can also run simple text completions without chat format by calling the model directly. For async usage, run llama.cpp server and query via OpenAI-compatible API.

python
from llama_cpp import Llama

llm = Llama(model_path="./models/llama-3.1-8b.Q4_K_M.gguf")

# Simple text completion
output = llm("Explain the benefits of llama.cpp", max_tokens=50)
print("Text completion:", output["choices"][0]["text"])

# For async or server usage, start server:
# python -m llama_cpp.server --model ./models/llama-3.1-8b.Q4_K_M.gguf --port 8080

# Then query via OpenAI SDK with base_url="http://localhost:8080/v1"
output
Text completion: llama.cpp enables efficient local inference of Llama models by using optimized C++ code and quantized weights.

Troubleshooting

  • If you get FileNotFoundError, verify the model path is correct and the GGUF file is downloaded.
  • For CUDA GPU acceleration, ensure your system supports it and install compatible llama-cpp-python versions.
  • If you see ValueError: context length exceeded, increase n_ctx parameter or shorten input.

Key Takeaways

  • Use llama-cpp-python to run Llama models locally with Python.
  • Download GGUF format Llama models for compatibility with llama.cpp.
  • Run chat completions with create_chat_completion or text completions by calling the Llama instance.
  • For async or multi-user setups, run the llama.cpp server and query via OpenAI-compatible API.
  • Check model path and context length to avoid common errors.
Verified 2026-04 · llama-3.1-8b, llama-3.1-8b.Q4_K_M.gguf
Verify ↗