Code beginner · 3 min read

How to use llama-cpp-python

Q: How to use llama-cpp-python

Use the llama_cpp Python package to load GGUF models locally and generate text completions with Llama class methods like create_chat_completion or direct calls.

Direct answer

Use the llama_cpp Python package to load GGUF models locally and generate text completions with Llama class methods like create_chat_completion or direct calls.

Setup

Install

bash

pip install llama-cpp-python

Imports

python

from llama_cpp import Llama
import os

Examples

inPrompt: "Hello, how are you?"

outHello! I'm doing well, thank you. How can I assist you today?

inChat messages: [{"role": "user", "content": "Explain RAG."}]

outRAG stands for Retrieval-Augmented Generation, a technique that combines retrieval of documents with generation to improve accuracy.

inPrompt: "" (empty prompt)

outError or empty response depending on model behavior.

Integration steps

Install the llama-cpp-python package via pip.
Download or prepare a GGUF format Llama model file locally.
Import the Llama class from llama_cpp and instantiate it with the model path.
Call the Llama instance with a prompt string or use create_chat_completion with chat messages.
Extract the generated text from the response dictionary's choices field.
Handle exceptions or empty outputs gracefully.

Full code

python

from llama_cpp import Llama
import os

# Path to your local GGUF model file
model_path = os.path.expanduser("~/models/llama-3.1-8b.Q4_K_M.gguf")

# Initialize the Llama model
llm = Llama(model_path=model_path, n_ctx=2048, n_gpu_layers=10)

# Simple prompt completion
prompt = "Hello, how are you?"
response = llm(prompt, max_tokens=128)
print("Completion:", response["choices"][0]["text"])

# Chat completion example
messages = [
    {"role": "user", "content": "Explain Retrieval-Augmented Generation (RAG)."}
]
chat_response = llm.create_chat_completion(messages=messages, max_tokens=150)
print("Chat Completion:", chat_response["choices"][0]["message"]["content"])

output

Completion: Hello! I'm doing well, thank you. How can I assist you today?
Chat Completion: Retrieval-Augmented Generation (RAG) is a technique that combines document retrieval with language model generation to improve accuracy and relevance.

API trace

Request

json

{"model_path": "path/to/model.gguf", "prompt": "Hello, how are you?", "max_tokens": 128}

Response

json

{"choices": [{"text": "Hello! I'm doing well, thank you..."}]}

Extractresponse["choices"][0]["text"]

Variants

Streaming output ›

Use streaming to display tokens as they are generated for better user experience with long outputs.

python

from llama_cpp import Llama
import os

model_path = os.path.expanduser("~/models/llama-3.1-8b.Q4_K_M.gguf")
llm = Llama(model_path=model_path, n_ctx=2048, n_gpu_layers=10)

# Streaming generator
for chunk in llm.stream(prompt="Tell me a joke.", max_tokens=50):
    print(chunk["choices"][0]["text"], end="", flush=True)
print()

Async usage with server ›

Use when integrating llama-cpp-python in async web servers or concurrent applications.

python

# llama-cpp-python does not natively support async calls; run a local server and query via HTTP or use threading for concurrency.

Alternative smaller model ›

Use smaller models for faster inference and lower resource usage when high accuracy is not critical.

python

from llama_cpp import Llama
import os

model_path = os.path.expanduser("~/models/llama-3.1-4b.Q4_K_M.gguf")
llm = Llama(model_path=model_path, n_ctx=2048, n_gpu_layers=5)
response = llm("Summarize AI.", max_tokens=100)
print(response["choices"][0]["text"])

Performance

Latency~500ms to 2s per 100 tokens on a modern GPU-enabled desktop

CostFree for local inference; hardware cost only

Rate limitsNo API rate limits; limited by local hardware resources

Limit <code>max_tokens</code> to reduce latency and memory usage.
Use <code>n_ctx</code> to control context window size for memory efficiency.
Offload layers to GPU with <code>n_gpu_layers</code> for faster generation.

Approach	Latency	Cost/call	Best for
Local llama-cpp-python (CPU)	~1-3s per 100 tokens	Free (hardware only)	Offline, privacy-sensitive use
Local llama-cpp-python (GPU)	~500ms-1s per 100 tokens	Free (hardware only)	Faster local inference with GPU
Cloud LLM APIs (OpenAI, Anthropic)	~200-800ms per 100 tokens	Paid per token	High accuracy, no hardware setup

✓

Quick tip

Set <code>n_gpu_layers</code> to a positive number to offload initial layers to GPU for faster inference if available.

⚠

Common mistake

Not specifying the correct GGUF model path or using incompatible model formats causes silent failures or errors.

Verified 2026-04 · llama-3.1-8b.Q4_K_M.gguf, llama-3.1-4b.Q4_K_M.gguf

Verify ↗

Community Notes

No notes yetBe the first to share a version-specific fix or tip.