How to install llama-cpp-python
Direct answer
Install
llama-cpp-python via pip install llama-cpp-python and import Llama from llama_cpp to run local LLaMA models in Python.Setup
Install
pip install llama-cpp-python Imports
from llama_cpp import Llama
import os Examples
inHello, how are you?
outI'm doing well, thank you! How can I assist you today?
inExplain the benefits of local LLaMA inference.
outLocal LLaMA inference offers privacy, low latency, and no cloud costs.
in
outError: Prompt cannot be empty.
Integration steps
- Install the llama-cpp-python package using pip.
- Import the Llama class from the llama_cpp module.
- Load your local GGUF LLaMA model file with Llama(model_path=...).
- Call the Llama instance with a prompt string to generate text.
- Extract and print the generated text from the response dictionary.
Full code
from llama_cpp import Llama
import os
# Path to your local GGUF LLaMA model file
model_path = os.path.expanduser("~/models/llama-3.1-8b.Q4_K_M.gguf")
# Initialize the Llama model
llm = Llama(model_path=model_path, n_ctx=2048, n_gpu_layers=10)
# Input prompt
prompt = "Hello, how are you?"
# Generate completion
output = llm(prompt, max_tokens=64)
# Extract generated text
text = output["choices"][0]["text"]
print("Generated text:", text) output
Generated text: I'm doing well, thank you! How can I assist you today?
API trace
Request
{"model_path": "path/to/model.gguf", "prompt": "Hello, how are you?", "max_tokens": 64, "n_ctx": 2048, "n_gpu_layers": 10} Response
{"choices": [{"text": "I'm doing well, thank you! How can I assist you today?"}], "usage": {"tokens": 20}} Extract
response["choices"][0]["text"]Variants
Streaming output ›
Use streaming to display tokens as they are generated for better UX on long outputs.
from llama_cpp import Llama
import os
model_path = os.path.expanduser("~/models/llama-3.1-8b.Q4_K_M.gguf")
llm = Llama(model_path=model_path, n_ctx=2048, n_gpu_layers=10)
prompt = "Tell me a joke."
for output in llm.create_completion(prompt=prompt, max_tokens=64, stream=True):
print(output["choices"][0]["text"], end="", flush=True)
print() Async usage with llama-cpp-python (if supported) ›
Use async or concurrency patterns externally when running multiple llama-cpp-python calls.
# Currently llama-cpp-python does not provide async API; use threading or multiprocessing for concurrency. Alternative model with smaller context ›
Use smaller context and fewer GPU layers to reduce memory usage and speed up inference.
from llama_cpp import Llama
import os
model_path = os.path.expanduser("~/models/llama-3.1-8b.Q4_K_M.gguf")
llm = Llama(model_path=model_path, n_ctx=1024, n_gpu_layers=5)
prompt = "Summarize the benefits of local LLaMA inference."
output = llm(prompt, max_tokens=50)
print(output["choices"][0]["text"]) Performance
Latency~500ms to 2s per request depending on model size and hardware
CostFree for local inference; hardware costs apply
Rate limitsNo API rate limits; limited by local hardware resources
- Limit <code>max_tokens</code> to reduce latency and memory usage.
- Use smaller context window (<code>n_ctx</code>) if your task allows.
- Adjust <code>n_gpu_layers</code> to balance speed and VRAM usage.
| Approach | Latency | Cost/call | Best for |
|---|---|---|---|
| Standard call | ~1s | Free (local) | Simple local inference |
| Streaming output | ~1s initial + streaming | Free (local) | Interactive applications |
| Reduced context model | ~500ms | Free (local) | Low-memory environments |
Quick tip
Ensure your GGUF model file is compatible and placed in an accessible path before initializing <code>Llama</code>.
Common mistake
Forgetting to specify the correct <code>model_path</code> or using an incompatible model format causes runtime errors.