Code beginner · 3 min read

How to call llama.cpp from Python

Q: How to call llama.cpp from Python

Use the llama_cpp Python package to load a GGUF llama.cpp model and call llm.create_chat_completion() or llm() for text generation directly from Python.

Direct answer

Use the llama_cpp Python package to load a GGUF llama.cpp model and call llm.create_chat_completion() or llm() for text generation directly from Python.

Setup

Install

bash

pip install llama-cpp-python

Imports

python

from llama_cpp import Llama

Examples

inHello, how are you?

outHello! I'm doing well, thank you. How can I assist you today?

inExplain quantum computing in simple terms.

outQuantum computing uses quantum bits that can be in multiple states at once, enabling faster problem solving for certain tasks.

outError: No prompt provided.

Integration steps

Install the llama-cpp-python package via pip.
Download or prepare a GGUF format llama.cpp model file.
Import the Llama class from llama_cpp.
Initialize the Llama client with the model path.
Call the create_chat_completion method with chat messages or llm() with a prompt string.
Extract the generated text from the response and use it in your application.

Full code

python

from llama_cpp import Llama

# Initialize the Llama model with the GGUF model path
llm = Llama(model_path="./models/llama-3.1-8b.Q4_K_M.gguf")

# Simple prompt completion
prompt = "Hello, how are you?"
response = llm(prompt, max_tokens=128)
print("Completion:", response["choices"][0]["text"])

# Chat completion example
chat_response = llm.create_chat_completion(messages=[
    {"role": "user", "content": "Explain quantum computing in simple terms."}
], max_tokens=128)
print("Chat Completion:", chat_response["choices"][0]["message"]["content"])

output

Completion: Hello! I'm doing well, thank you. How can I assist you today?
Chat Completion: Quantum computing uses quantum bits that can be in multiple states at once, enabling faster problem solving for certain tasks.

API trace

Request

json

{"model_path": "./models/llama-3.1-8b.Q4_K_M.gguf", "prompt": "Hello, how are you?", "max_tokens": 128}

Response

json

{"choices": [{"text": "Hello! I'm doing well, thank you. How can I assist you today?"}], "usage": {"total_tokens": 45}}

Extractresponse["choices"][0]["text"]

Variants

Streaming output ›

Use streaming to display tokens as they are generated for better user experience in interactive apps.

python

from llama_cpp import Llama

llm = Llama(model_path="./models/llama-3.1-8b.Q4_K_M.gguf")

for output in llm.stream("Tell me a joke", max_tokens=50):
    print(output["choices"][0]["text"], end="", flush=True)
print()

Async call with llama-cpp-python (if supported) ›

Use concurrency patterns in Python to handle multiple llama.cpp calls in parallel.

python

# Currently llama-cpp-python does not support async calls natively; use threading or multiprocessing for concurrency.

Using create_chat_completion for chat-style prompts ›

Use chat completion method when working with chat-based conversational prompts.

python

from llama_cpp import Llama

llm = Llama(model_path="./models/llama-3.1-8b.Q4_K_M.gguf")

chat_response = llm.create_chat_completion(messages=[
    {"role": "user", "content": "Summarize the latest AI trends."}
], max_tokens=100)
print(chat_response["choices"][0]["message"]["content"])

Performance

Latency~500ms to 2s per request depending on model size and hardware

CostFree for local inference; hardware costs apply

Rate limitsNo API rate limits; limited by local hardware resources

Limit max_tokens to reduce latency and memory usage.
Use smaller models for faster responses on CPU.
Cache frequent prompts and completions locally.

Approach	Latency	Cost/call	Best for
Local llama-cpp-python	~0.5-2s	Free (local hardware)	Offline, privacy-sensitive use
OpenAI GPT-4o API	~0.8s	Paid API	High-quality cloud inference
Streaming llama-cpp-python	Token-by-token ~0.1s delay	Free (local hardware)	Interactive applications

✓

Quick tip

Always use GGUF format models with llama-cpp-python for best compatibility and performance.

⚠

Common mistake

Trying to use llama-cpp-python without specifying a valid GGUF model path or using incompatible model formats.

Verified 2026-04 · llama-3.1-8b.Q4_K_M.gguf

Verify ↗