Code beginner · 3 min read

How to install llama-cpp-python

Q: How to install llama-cpp-python

Install llama-cpp-python via pip install llama-cpp-python and import Llama from llama_cpp to run local LLaMA models in Python.

Direct answer

Install llama-cpp-python via pip install llama-cpp-python and import Llama from llama_cpp to run local LLaMA models in Python.

Setup

Install

bash

pip install llama-cpp-python

Imports

python

from llama_cpp import Llama
import os

Examples

inHello, how are you?

outI'm doing well, thank you! How can I assist you today?

inExplain the benefits of local LLaMA inference.

outLocal LLaMA inference offers privacy, low latency, and no cloud costs.

outError: Prompt cannot be empty.

Integration steps

Install the llama-cpp-python package using pip.
Import the Llama class from the llama_cpp module.
Load your local GGUF LLaMA model file with Llama(model_path=...).
Call the Llama instance with a prompt string to generate text.
Extract and print the generated text from the response dictionary.

Full code

python

from llama_cpp import Llama
import os

# Path to your local GGUF LLaMA model file
model_path = os.path.expanduser("~/models/llama-3.1-8b.Q4_K_M.gguf")

# Initialize the Llama model
llm = Llama(model_path=model_path, n_ctx=2048, n_gpu_layers=10)

# Input prompt
prompt = "Hello, how are you?"

# Generate completion
output = llm(prompt, max_tokens=64)

# Extract generated text
text = output["choices"][0]["text"]

print("Generated text:", text)

output

Generated text: I'm doing well, thank you! How can I assist you today?

API trace

Request

json

{"model_path": "path/to/model.gguf", "prompt": "Hello, how are you?", "max_tokens": 64, "n_ctx": 2048, "n_gpu_layers": 10}

Response

json

{"choices": [{"text": "I'm doing well, thank you! How can I assist you today?"}], "usage": {"tokens": 20}}

Extractresponse["choices"][0]["text"]

Variants

Streaming output ›

Use streaming to display tokens as they are generated for better UX on long outputs.

python

from llama_cpp import Llama
import os

model_path = os.path.expanduser("~/models/llama-3.1-8b.Q4_K_M.gguf")
llm = Llama(model_path=model_path, n_ctx=2048, n_gpu_layers=10)

prompt = "Tell me a joke."

for output in llm.create_completion(prompt=prompt, max_tokens=64, stream=True):
    print(output["choices"][0]["text"], end="", flush=True)
print()

Async usage with llama-cpp-python (if supported) ›

Use async or concurrency patterns externally when running multiple llama-cpp-python calls.

python

# Currently llama-cpp-python does not provide async API; use threading or multiprocessing for concurrency.

Alternative model with smaller context ›

Use smaller context and fewer GPU layers to reduce memory usage and speed up inference.

python

from llama_cpp import Llama
import os

model_path = os.path.expanduser("~/models/llama-3.1-8b.Q4_K_M.gguf")
llm = Llama(model_path=model_path, n_ctx=1024, n_gpu_layers=5)

prompt = "Summarize the benefits of local LLaMA inference."
output = llm(prompt, max_tokens=50)
print(output["choices"][0]["text"])

Performance

Latency~500ms to 2s per request depending on model size and hardware

CostFree for local inference; hardware costs apply

Rate limitsNo API rate limits; limited by local hardware resources

Limit <code>max_tokens</code> to reduce latency and memory usage.
Use smaller context window (<code>n_ctx</code>) if your task allows.
Adjust <code>n_gpu_layers</code> to balance speed and VRAM usage.

Approach	Latency	Cost/call	Best for
Standard call	~1s	Free (local)	Simple local inference
Streaming output	~1s initial + streaming	Free (local)	Interactive applications
Reduced context model	~500ms	Free (local)	Low-memory environments

✓

Quick tip

Ensure your GGUF model file is compatible and placed in an accessible path before initializing <code>Llama</code>.

⚠

Common mistake

Forgetting to specify the correct <code>model_path</code> or using an incompatible model format causes runtime errors.

Verified 2026-04 · llama-3.1-8b.Q4_K_M.gguf

Verify ↗