How to run llama.cpp as API server
Quick answer
Run
llama.cpp as an API server by starting its built-in HTTP server with python -m llama_cpp.server and then query it using the OpenAI-compatible Python SDK. Use llm.create_chat_completion() or llm() to send prompts to the server programmatically.PREREQUISITES
Python 3.8+pip install llama-cpp-pythonllama.cpp GGUF model file downloadedBasic knowledge of FastAPI (optional)
Setup
Install the llama-cpp-python package which provides Python bindings and a built-in API server. Download a compatible GGUF model file from Hugging Face or other sources. Ensure Python 3.8 or higher is installed.
pip install llama-cpp-python
# Download a GGUF model, e.g. llama-3.1-8b.Q4_K_M.gguf from Hugging Face output
Collecting llama-cpp-python Downloading llama_cpp_python-0.1.0-cp38-cp38-manylinux1_x86_64.whl (10 MB) Installing collected packages: llama-cpp-python Successfully installed llama-cpp-python-0.1.0
Step by step
Start the llama.cpp API server locally, then use the Python client to send chat completion requests.
from llama_cpp import Llama
# Start the server in a separate terminal:
# python -m llama_cpp.server --model ./models/llama-3.1-8b.Q4_K_M.gguf --port 8080
# Python client connecting to the running server
llm = Llama(model_path=None, n_ctx=2048, port=8080)
# Simple chat completion example
messages = [
{"role": "user", "content": "Hello, llama.cpp!"}
]
response = llm.create_chat_completion(messages=messages, max_tokens=50)
print("Response:", response["choices"][0]["message"]["content"]) output
Response: Hello! How can I assist you today?
Common variations
- Use
llm(prompt)for simple text completions instead of chat completions. - Run the server with different ports or models by changing CLI args.
- Use async frameworks like FastAPI to build a custom API server that proxies requests to
llama.cpp.
from fastapi import FastAPI
from llama_cpp import Llama
app = FastAPI()
llm = Llama(model_path=None, port=8080)
@app.post("/generate")
async def generate(prompt: str):
response = llm(prompt, max_tokens=100)
return {"text": response["choices"][0]["text"]}
# Run with: uvicorn myapp:app --reload output
INFO: Started server process [12345] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)
Troubleshooting
- If you see connection refused errors, ensure the
llama.cppserver is running on the correct port. - Model loading errors usually mean the GGUF model path is incorrect or the file is corrupted.
- For performance issues, try reducing
n_ctxor using quantized models.
Key Takeaways
- Use the built-in
llama_cpp.servermodule to run llama.cpp as an API server. - Connect with the Python
llama_cppclient usingmodel_path=Noneandportto query the server. - Build custom API endpoints with FastAPI to proxy requests to llama.cpp for scalable deployments.