How to beginner · 3 min read

How to run llama.cpp as API server

Q: How to run llama.cpp as API server

Run llama.cpp as an API server by starting its built-in HTTP server with python -m llama_cpp.server and then query it using the OpenAI-compatible Python SDK. Use llm.create_chat_completion() or llm() to send prompts to the server programmatically.

Quick answer

Run llama.cpp as an API server by starting its built-in HTTP server with python -m llama_cpp.server and then query it using the OpenAI-compatible Python SDK. Use llm.create_chat_completion() or llm() to send prompts to the server programmatically.

PREREQUISITES

Python 3.8+
pip install llama-cpp-python
llama.cpp GGUF model file downloaded
Basic knowledge of FastAPI (optional)

Setup

Install the llama-cpp-python package which provides Python bindings and a built-in API server. Download a compatible GGUF model file from Hugging Face or other sources. Ensure Python 3.8 or higher is installed.

bash

pip install llama-cpp-python
# Download a GGUF model, e.g. llama-3.1-8b.Q4_K_M.gguf from Hugging Face

output

Collecting llama-cpp-python
  Downloading llama_cpp_python-0.1.0-cp38-cp38-manylinux1_x86_64.whl (10 MB)
Installing collected packages: llama-cpp-python
Successfully installed llama-cpp-python-0.1.0

Step by step

Start the llama.cpp API server locally, then use the Python client to send chat completion requests.

python

from llama_cpp import Llama

# Start the server in a separate terminal:
# python -m llama_cpp.server --model ./models/llama-3.1-8b.Q4_K_M.gguf --port 8080

# Python client connecting to the running server
llm = Llama(model_path=None, n_ctx=2048, port=8080)

# Simple chat completion example
messages = [
    {"role": "user", "content": "Hello, llama.cpp!"}
]

response = llm.create_chat_completion(messages=messages, max_tokens=50)
print("Response:", response["choices"][0]["message"]["content"])

output

Response: Hello! How can I assist you today?

Common variations

Use llm(prompt) for simple text completions instead of chat completions.
Run the server with different ports or models by changing CLI args.
Use async frameworks like FastAPI to build a custom API server that proxies requests to llama.cpp.

python

from fastapi import FastAPI
from llama_cpp import Llama

app = FastAPI()
llm = Llama(model_path=None, port=8080)

@app.post("/generate")
async def generate(prompt: str):
    response = llm(prompt, max_tokens=100)
    return {"text": response["choices"][0]["text"]}

# Run with: uvicorn myapp:app --reload

output

INFO:     Started server process [12345]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)

Troubleshooting

If you see connection refused errors, ensure the llama.cpp server is running on the correct port.
Model loading errors usually mean the GGUF model path is incorrect or the file is corrupted.
For performance issues, try reducing n_ctx or using quantized models.

✅

Key Takeaways

Use the built-in llama_cpp.server module to run llama.cpp as an API server.
Connect with the Python llama_cpp client using model_path=None and port to query the server.
Build custom API endpoints with FastAPI to proxy requests to llama.cpp for scalable deployments.

Verified 2026-04 · llama-3.1-8b.Q4_K_M.gguf

Verify ↗