llama.cpp server endpoints
Quick answer
Run the
llama_cpp.server CLI to start a local HTTP server exposing endpoints for chat and completions. Query these endpoints via HTTP or the OpenAI-compatible Python SDK by setting base_url to the server address and calling client.chat.completions.create() or client.completions.create().PREREQUISITES
Python 3.8+pip install llama-cpp-python>=0.1.81Basic knowledge of HTTP requests and Python
Setup llama.cpp server
Install the llama-cpp-python package and download a compatible GGUF model file. Then start the server with the CLI command specifying the model path and port.
pip install llama-cpp-python
# Download a GGUF model from Hugging Face or other source
# Example server start command:
python -m llama_cpp.server --model ./models/llama-3.1-8b.Q4_K_M.gguf --port 8080 output
INFO: Starting llama.cpp server on http://localhost:8080 INFO: Model loaded: llama-3.1-8b.Q4_K_M.gguf Server ready to accept requests
Step by step: Query server endpoints
Use the OpenAI Python SDK with base_url pointing to the local llama.cpp server to send chat or completion requests. The server supports OpenAI-compatible endpoints.
import os
from openai import OpenAI
client = OpenAI(
api_key=None, # No API key needed for local server
base_url="http://localhost:8080/v1"
)
# Chat completion example
response = client.chat.completions.create(
model="llama-3.1-8b",
messages=[{"role": "user", "content": "Hello from llama.cpp server!"}]
)
print("Chat response:", response.choices[0].message.content)
# Text completion example
response2 = client.completions.create(
model="llama-3.1-8b",
prompt="Write a haiku about AI",
max_tokens=50
)
print("Completion response:", response2.choices[0].text) output
Chat response: Hello! How can I assist you today? Completion response: Silent circuits hum Learning minds weave words anew AI dreams in code
Common variations
- Use async calls with
asyncioandawaitfor non-blocking requests. - Change
modelparameter to match your loaded GGUF model name. - Use streaming by setting
stream=Trueinchat.completions.create()to receive tokens incrementally.
import asyncio
from openai import OpenAI
async def async_chat():
client = OpenAI(base_url="http://localhost:8080/v1")
stream = await client.chat.completions.create(
model="llama-3.1-8b",
messages=[{"role": "user", "content": "Stream a poem about spring."}],
stream=True
)
async for chunk in stream:
print(chunk.choices[0].delta.content or "", end="", flush=True)
asyncio.run(async_chat()) output
Gentle breeze whispers Blossoms dance in warm sunlight Spring breathes life anew
Troubleshooting tips
- If the server fails to start, verify the model path and that the GGUF model file is valid.
- Check that the port (default 8080) is free and accessible.
- For connection errors, ensure
base_urlmatches the server address and includes/v1. - Use verbose logging by adding
--verboseflag to the server CLI for debugging.
Key Takeaways
- Run llama.cpp server with the CLI to expose OpenAI-compatible endpoints locally.
- Use the OpenAI Python SDK with
base_urlpointed to the server for easy integration. - Support for chat, completions, streaming, and async calls enables flexible usage.
- Ensure correct model path and server port to avoid startup and connection issues.