Code beginner · 3 min read

How to use llama-cpp-python

Direct answer
Use the llama_cpp Python package to load GGUF models locally and generate text completions with Llama class methods like create_chat_completion or direct calls.

Setup

Install
bash
pip install llama-cpp-python
Imports
python
from llama_cpp import Llama
import os

Examples

inPrompt: "Hello, how are you?"
outHello! I'm doing well, thank you. How can I assist you today?
inChat messages: [{"role": "user", "content": "Explain RAG."}]
outRAG stands for Retrieval-Augmented Generation, a technique that combines retrieval of documents with generation to improve accuracy.
inPrompt: "" (empty prompt)
outError or empty response depending on model behavior.

Integration steps

  1. Install the llama-cpp-python package via pip.
  2. Download or prepare a GGUF format Llama model file locally.
  3. Import the Llama class from llama_cpp and instantiate it with the model path.
  4. Call the Llama instance with a prompt string or use create_chat_completion with chat messages.
  5. Extract the generated text from the response dictionary's choices field.
  6. Handle exceptions or empty outputs gracefully.

Full code

python
from llama_cpp import Llama
import os

# Path to your local GGUF model file
model_path = os.path.expanduser("~/models/llama-3.1-8b.Q4_K_M.gguf")

# Initialize the Llama model
llm = Llama(model_path=model_path, n_ctx=2048, n_gpu_layers=10)

# Simple prompt completion
prompt = "Hello, how are you?"
response = llm(prompt, max_tokens=128)
print("Completion:", response["choices"][0]["text"])

# Chat completion example
messages = [
    {"role": "user", "content": "Explain Retrieval-Augmented Generation (RAG)."}
]
chat_response = llm.create_chat_completion(messages=messages, max_tokens=150)
print("Chat Completion:", chat_response["choices"][0]["message"]["content"])
output
Completion: Hello! I'm doing well, thank you. How can I assist you today?
Chat Completion: Retrieval-Augmented Generation (RAG) is a technique that combines document retrieval with language model generation to improve accuracy and relevance.

API trace

Request
json
{"model_path": "path/to/model.gguf", "prompt": "Hello, how are you?", "max_tokens": 128}
Response
json
{"choices": [{"text": "Hello! I'm doing well, thank you..."}]}
Extractresponse["choices"][0]["text"]

Variants

Streaming output

Use streaming to display tokens as they are generated for better user experience with long outputs.

python
from llama_cpp import Llama
import os

model_path = os.path.expanduser("~/models/llama-3.1-8b.Q4_K_M.gguf")
llm = Llama(model_path=model_path, n_ctx=2048, n_gpu_layers=10)

# Streaming generator
for chunk in llm.stream(prompt="Tell me a joke.", max_tokens=50):
    print(chunk["choices"][0]["text"], end="", flush=True)
print()
Async usage with server

Use when integrating llama-cpp-python in async web servers or concurrent applications.

python
# llama-cpp-python does not natively support async calls; run a local server and query via HTTP or use threading for concurrency.
Alternative smaller model

Use smaller models for faster inference and lower resource usage when high accuracy is not critical.

python
from llama_cpp import Llama
import os

model_path = os.path.expanduser("~/models/llama-3.1-4b.Q4_K_M.gguf")
llm = Llama(model_path=model_path, n_ctx=2048, n_gpu_layers=5)
response = llm("Summarize AI.", max_tokens=100)
print(response["choices"][0]["text"])

Performance

Latency~500ms to 2s per 100 tokens on a modern GPU-enabled desktop
CostFree for local inference; hardware cost only
Rate limitsNo API rate limits; limited by local hardware resources
  • Limit <code>max_tokens</code> to reduce latency and memory usage.
  • Use <code>n_ctx</code> to control context window size for memory efficiency.
  • Offload layers to GPU with <code>n_gpu_layers</code> for faster generation.
ApproachLatencyCost/callBest for
Local llama-cpp-python (CPU)~1-3s per 100 tokensFree (hardware only)Offline, privacy-sensitive use
Local llama-cpp-python (GPU)~500ms-1s per 100 tokensFree (hardware only)Faster local inference with GPU
Cloud LLM APIs (OpenAI, Anthropic)~200-800ms per 100 tokensPaid per tokenHigh accuracy, no hardware setup

Quick tip

Set <code>n_gpu_layers</code> to a positive number to offload initial layers to GPU for faster inference if available.

Common mistake

Not specifying the correct GGUF model path or using incompatible model formats causes silent failures or errors.

Verified 2026-04 · llama-3.1-8b.Q4_K_M.gguf, llama-3.1-4b.Q4_K_M.gguf
Verify ↗