Code beginner · 3 min read

How to install llama-cpp-python

Direct answer
Install llama-cpp-python via pip install llama-cpp-python and import Llama from llama_cpp to run local LLaMA models in Python.

Setup

Install
bash
pip install llama-cpp-python
Imports
python
from llama_cpp import Llama
import os

Examples

inHello, how are you?
outI'm doing well, thank you! How can I assist you today?
inExplain the benefits of local LLaMA inference.
outLocal LLaMA inference offers privacy, low latency, and no cloud costs.
in
outError: Prompt cannot be empty.

Integration steps

  1. Install the llama-cpp-python package using pip.
  2. Import the Llama class from the llama_cpp module.
  3. Load your local GGUF LLaMA model file with Llama(model_path=...).
  4. Call the Llama instance with a prompt string to generate text.
  5. Extract and print the generated text from the response dictionary.

Full code

python
from llama_cpp import Llama
import os

# Path to your local GGUF LLaMA model file
model_path = os.path.expanduser("~/models/llama-3.1-8b.Q4_K_M.gguf")

# Initialize the Llama model
llm = Llama(model_path=model_path, n_ctx=2048, n_gpu_layers=10)

# Input prompt
prompt = "Hello, how are you?"

# Generate completion
output = llm(prompt, max_tokens=64)

# Extract generated text
text = output["choices"][0]["text"]

print("Generated text:", text)
output
Generated text: I'm doing well, thank you! How can I assist you today?

API trace

Request
json
{"model_path": "path/to/model.gguf", "prompt": "Hello, how are you?", "max_tokens": 64, "n_ctx": 2048, "n_gpu_layers": 10}
Response
json
{"choices": [{"text": "I'm doing well, thank you! How can I assist you today?"}], "usage": {"tokens": 20}}
Extractresponse["choices"][0]["text"]

Variants

Streaming output

Use streaming to display tokens as they are generated for better UX on long outputs.

python
from llama_cpp import Llama
import os

model_path = os.path.expanduser("~/models/llama-3.1-8b.Q4_K_M.gguf")
llm = Llama(model_path=model_path, n_ctx=2048, n_gpu_layers=10)

prompt = "Tell me a joke."

for output in llm.create_completion(prompt=prompt, max_tokens=64, stream=True):
    print(output["choices"][0]["text"], end="", flush=True)
print()
Async usage with llama-cpp-python (if supported)

Use async or concurrency patterns externally when running multiple llama-cpp-python calls.

python
# Currently llama-cpp-python does not provide async API; use threading or multiprocessing for concurrency.
Alternative model with smaller context

Use smaller context and fewer GPU layers to reduce memory usage and speed up inference.

python
from llama_cpp import Llama
import os

model_path = os.path.expanduser("~/models/llama-3.1-8b.Q4_K_M.gguf")
llm = Llama(model_path=model_path, n_ctx=1024, n_gpu_layers=5)

prompt = "Summarize the benefits of local LLaMA inference."
output = llm(prompt, max_tokens=50)
print(output["choices"][0]["text"])

Performance

Latency~500ms to 2s per request depending on model size and hardware
CostFree for local inference; hardware costs apply
Rate limitsNo API rate limits; limited by local hardware resources
  • Limit <code>max_tokens</code> to reduce latency and memory usage.
  • Use smaller context window (<code>n_ctx</code>) if your task allows.
  • Adjust <code>n_gpu_layers</code> to balance speed and VRAM usage.
ApproachLatencyCost/callBest for
Standard call~1sFree (local)Simple local inference
Streaming output~1s initial + streamingFree (local)Interactive applications
Reduced context model~500msFree (local)Low-memory environments

Quick tip

Ensure your GGUF model file is compatible and placed in an accessible path before initializing <code>Llama</code>.

Common mistake

Forgetting to specify the correct <code>model_path</code> or using an incompatible model format causes runtime errors.

Verified 2026-04 · llama-3.1-8b.Q4_K_M.gguf
Verify ↗