How to call llama.cpp from Python
Direct answer
Use the
llama_cpp Python package to load a GGUF llama.cpp model and call llm.create_chat_completion() or llm() for text generation directly from Python.Setup
Install
pip install llama-cpp-python Imports
from llama_cpp import Llama Examples
inHello, how are you?
outHello! I'm doing well, thank you. How can I assist you today?
inExplain quantum computing in simple terms.
outQuantum computing uses quantum bits that can be in multiple states at once, enabling faster problem solving for certain tasks.
in
outError: No prompt provided.
Integration steps
- Install the llama-cpp-python package via pip.
- Download or prepare a GGUF format llama.cpp model file.
- Import the Llama class from llama_cpp.
- Initialize the Llama client with the model path.
- Call the
create_chat_completionmethod with chat messages orllm()with a prompt string. - Extract the generated text from the response and use it in your application.
Full code
from llama_cpp import Llama
# Initialize the Llama model with the GGUF model path
llm = Llama(model_path="./models/llama-3.1-8b.Q4_K_M.gguf")
# Simple prompt completion
prompt = "Hello, how are you?"
response = llm(prompt, max_tokens=128)
print("Completion:", response["choices"][0]["text"])
# Chat completion example
chat_response = llm.create_chat_completion(messages=[
{"role": "user", "content": "Explain quantum computing in simple terms."}
], max_tokens=128)
print("Chat Completion:", chat_response["choices"][0]["message"]["content"]) output
Completion: Hello! I'm doing well, thank you. How can I assist you today? Chat Completion: Quantum computing uses quantum bits that can be in multiple states at once, enabling faster problem solving for certain tasks.
API trace
Request
{"model_path": "./models/llama-3.1-8b.Q4_K_M.gguf", "prompt": "Hello, how are you?", "max_tokens": 128} Response
{"choices": [{"text": "Hello! I'm doing well, thank you. How can I assist you today?"}], "usage": {"total_tokens": 45}} Extract
response["choices"][0]["text"]Variants
Streaming output ›
Use streaming to display tokens as they are generated for better user experience in interactive apps.
from llama_cpp import Llama
llm = Llama(model_path="./models/llama-3.1-8b.Q4_K_M.gguf")
for output in llm.stream("Tell me a joke", max_tokens=50):
print(output["choices"][0]["text"], end="", flush=True)
print() Async call with llama-cpp-python (if supported) ›
Use concurrency patterns in Python to handle multiple llama.cpp calls in parallel.
# Currently llama-cpp-python does not support async calls natively; use threading or multiprocessing for concurrency. Using create_chat_completion for chat-style prompts ›
Use chat completion method when working with chat-based conversational prompts.
from llama_cpp import Llama
llm = Llama(model_path="./models/llama-3.1-8b.Q4_K_M.gguf")
chat_response = llm.create_chat_completion(messages=[
{"role": "user", "content": "Summarize the latest AI trends."}
], max_tokens=100)
print(chat_response["choices"][0]["message"]["content"]) Performance
Latency~500ms to 2s per request depending on model size and hardware
CostFree for local inference; hardware costs apply
Rate limitsNo API rate limits; limited by local hardware resources
- Limit max_tokens to reduce latency and memory usage.
- Use smaller models for faster responses on CPU.
- Cache frequent prompts and completions locally.
| Approach | Latency | Cost/call | Best for |
|---|---|---|---|
| Local llama-cpp-python | ~0.5-2s | Free (local hardware) | Offline, privacy-sensitive use |
| OpenAI GPT-4o API | ~0.8s | Paid API | High-quality cloud inference |
| Streaming llama-cpp-python | Token-by-token ~0.1s delay | Free (local hardware) | Interactive applications |
Quick tip
Always use GGUF format models with llama-cpp-python for best compatibility and performance.
Common mistake
Trying to use llama-cpp-python without specifying a valid GGUF model path or using incompatible model formats.