Code beginner · 3 min read

How to call llama.cpp from Python

Direct answer
Use the llama_cpp Python package to load a GGUF llama.cpp model and call llm.create_chat_completion() or llm() for text generation directly from Python.

Setup

Install
bash
pip install llama-cpp-python
Imports
python
from llama_cpp import Llama

Examples

inHello, how are you?
outHello! I'm doing well, thank you. How can I assist you today?
inExplain quantum computing in simple terms.
outQuantum computing uses quantum bits that can be in multiple states at once, enabling faster problem solving for certain tasks.
in
outError: No prompt provided.

Integration steps

  1. Install the llama-cpp-python package via pip.
  2. Download or prepare a GGUF format llama.cpp model file.
  3. Import the Llama class from llama_cpp.
  4. Initialize the Llama client with the model path.
  5. Call the create_chat_completion method with chat messages or llm() with a prompt string.
  6. Extract the generated text from the response and use it in your application.

Full code

python
from llama_cpp import Llama

# Initialize the Llama model with the GGUF model path
llm = Llama(model_path="./models/llama-3.1-8b.Q4_K_M.gguf")

# Simple prompt completion
prompt = "Hello, how are you?"
response = llm(prompt, max_tokens=128)
print("Completion:", response["choices"][0]["text"])

# Chat completion example
chat_response = llm.create_chat_completion(messages=[
    {"role": "user", "content": "Explain quantum computing in simple terms."}
], max_tokens=128)
print("Chat Completion:", chat_response["choices"][0]["message"]["content"])
output
Completion: Hello! I'm doing well, thank you. How can I assist you today?
Chat Completion: Quantum computing uses quantum bits that can be in multiple states at once, enabling faster problem solving for certain tasks.

API trace

Request
json
{"model_path": "./models/llama-3.1-8b.Q4_K_M.gguf", "prompt": "Hello, how are you?", "max_tokens": 128}
Response
json
{"choices": [{"text": "Hello! I'm doing well, thank you. How can I assist you today?"}], "usage": {"total_tokens": 45}}
Extractresponse["choices"][0]["text"]

Variants

Streaming output

Use streaming to display tokens as they are generated for better user experience in interactive apps.

python
from llama_cpp import Llama

llm = Llama(model_path="./models/llama-3.1-8b.Q4_K_M.gguf")

for output in llm.stream("Tell me a joke", max_tokens=50):
    print(output["choices"][0]["text"], end="", flush=True)
print()
Async call with llama-cpp-python (if supported)

Use concurrency patterns in Python to handle multiple llama.cpp calls in parallel.

python
# Currently llama-cpp-python does not support async calls natively; use threading or multiprocessing for concurrency.
Using create_chat_completion for chat-style prompts

Use chat completion method when working with chat-based conversational prompts.

python
from llama_cpp import Llama

llm = Llama(model_path="./models/llama-3.1-8b.Q4_K_M.gguf")

chat_response = llm.create_chat_completion(messages=[
    {"role": "user", "content": "Summarize the latest AI trends."}
], max_tokens=100)
print(chat_response["choices"][0]["message"]["content"])

Performance

Latency~500ms to 2s per request depending on model size and hardware
CostFree for local inference; hardware costs apply
Rate limitsNo API rate limits; limited by local hardware resources
  • Limit max_tokens to reduce latency and memory usage.
  • Use smaller models for faster responses on CPU.
  • Cache frequent prompts and completions locally.
ApproachLatencyCost/callBest for
Local llama-cpp-python~0.5-2sFree (local hardware)Offline, privacy-sensitive use
OpenAI GPT-4o API~0.8sPaid APIHigh-quality cloud inference
Streaming llama-cpp-pythonToken-by-token ~0.1s delayFree (local hardware)Interactive applications

Quick tip

Always use GGUF format models with llama-cpp-python for best compatibility and performance.

Common mistake

Trying to use llama-cpp-python without specifying a valid GGUF model path or using incompatible model formats.

Verified 2026-04 · llama-3.1-8b.Q4_K_M.gguf
Verify ↗