How to beginner · 3 min read

How to run Llama with llama.cpp

Q: How to run Llama with llama.cpp

Use the llama-cpp-python package to run Llama models locally by loading a GGUF model file and calling create_chat_completion or __call__ methods. This enables running Llama models efficiently on your machine without cloud APIs.

Quick answer

Use the llama-cpp-python package to run Llama models locally by loading a GGUF model file and calling create_chat_completion or __call__ methods. This enables running Llama models efficiently on your machine without cloud APIs.

PREREQUISITES

Python 3.8+
llama-cpp-python package (pip install llama-cpp-python)
Llama GGUF model file downloaded locally

Setup

Install the llama-cpp-python package and download a Llama GGUF model file from Hugging Face or other sources. Ensure Python 3.8 or newer is installed.

bash

pip install llama-cpp-python

output

Collecting llama-cpp-python
  Downloading llama_cpp_python-0.1.0-cp38-cp38-win_amd64.whl (1.2 MB)
Installing collected packages: llama-cpp-python
Successfully installed llama-cpp-python-0.1.0

Step by step

Load the Llama model from a local GGUF file and run a chat completion with a user prompt. The example shows synchronous usage with the Llama class.

python

from llama_cpp import Llama

# Path to your downloaded Llama GGUF model file
model_path = "./models/llama-3.1-8b.Q4_K_M.gguf"

# Initialize the Llama model
llm = Llama(model_path=model_path, n_ctx=2048)

# Create a chat completion
messages = [
    {"role": "user", "content": "Hello, how are you?"}
]

response = llm.create_chat_completion(messages=messages, max_tokens=50)
print("Response:", response["choices"][0]["message"]["content"])

output

Response: I'm doing well, thank you! How can I assist you today?

Common variations

You can also run simple text completions without chat format by calling the model directly. For async usage, run llama.cpp server and query via OpenAI-compatible API.

python

from llama_cpp import Llama

llm = Llama(model_path="./models/llama-3.1-8b.Q4_K_M.gguf")

# Simple text completion
output = llm("Explain the benefits of llama.cpp", max_tokens=50)
print("Text completion:", output["choices"][0]["text"])

# For async or server usage, start server:
# python -m llama_cpp.server --model ./models/llama-3.1-8b.Q4_K_M.gguf --port 8080

# Then query via OpenAI SDK with base_url="http://localhost:8080/v1"

output

Text completion: llama.cpp enables efficient local inference of Llama models by using optimized C++ code and quantized weights.

Troubleshooting

If you get FileNotFoundError, verify the model path is correct and the GGUF file is downloaded.
For CUDA GPU acceleration, ensure your system supports it and install compatible llama-cpp-python versions.
If you see ValueError: context length exceeded, increase n_ctx parameter or shorten input.

✅

Key Takeaways

Use llama-cpp-python to run Llama models locally with Python.
Download GGUF format Llama models for compatibility with llama.cpp.
Run chat completions with create_chat_completion or text completions by calling the Llama instance.
For async or multi-user setups, run the llama.cpp server and query via OpenAI-compatible API.
Check model path and context length to avoid common errors.

Verified 2026-04 · llama-3.1-8b, llama-3.1-8b.Q4_K_M.gguf

Verify ↗