How to beginner · 3 min read

llama.cpp server endpoints

Q: llama.cpp server endpoints

Run the llama_cpp.server CLI to start a local HTTP server exposing endpoints for chat and completions. Query these endpoints via HTTP or the OpenAI-compatible Python SDK by setting base_url to the server address and calling client.chat.completions.create() or client.completions.create().

Quick answer

Run the llama_cpp.server CLI to start a local HTTP server exposing endpoints for chat and completions. Query these endpoints via HTTP or the OpenAI-compatible Python SDK by setting base_url to the server address and calling client.chat.completions.create() or client.completions.create().

PREREQUISITES

Python 3.8+
pip install llama-cpp-python>=0.1.81
Basic knowledge of HTTP requests and Python

Setup llama.cpp server

Install the llama-cpp-python package and download a compatible GGUF model file. Then start the server with the CLI command specifying the model path and port.

bash

pip install llama-cpp-python

# Download a GGUF model from Hugging Face or other source
# Example server start command:
python -m llama_cpp.server --model ./models/llama-3.1-8b.Q4_K_M.gguf --port 8080

output

INFO: Starting llama.cpp server on http://localhost:8080
INFO: Model loaded: llama-3.1-8b.Q4_K_M.gguf
Server ready to accept requests

Step by step: Query server endpoints

Use the OpenAI Python SDK with base_url pointing to the local llama.cpp server to send chat or completion requests. The server supports OpenAI-compatible endpoints.

python

import os
from openai import OpenAI

client = OpenAI(
    api_key=None,  # No API key needed for local server
    base_url="http://localhost:8080/v1"
)

# Chat completion example
response = client.chat.completions.create(
    model="llama-3.1-8b",
    messages=[{"role": "user", "content": "Hello from llama.cpp server!"}]
)
print("Chat response:", response.choices[0].message.content)

# Text completion example
response2 = client.completions.create(
    model="llama-3.1-8b",
    prompt="Write a haiku about AI",
    max_tokens=50
)
print("Completion response:", response2.choices[0].text)

output

Chat response: Hello! How can I assist you today?
Completion response: Silent circuits hum
Learning minds weave words anew
AI dreams in code

Common variations

Use async calls with asyncio and await for non-blocking requests.
Change model parameter to match your loaded GGUF model name.
Use streaming by setting stream=True in chat.completions.create() to receive tokens incrementally.

python

import asyncio
from openai import OpenAI

async def async_chat():
    client = OpenAI(base_url="http://localhost:8080/v1")
    stream = await client.chat.completions.create(
        model="llama-3.1-8b",
        messages=[{"role": "user", "content": "Stream a poem about spring."}],
        stream=True
    )
    async for chunk in stream:
        print(chunk.choices[0].delta.content or "", end="", flush=True)

asyncio.run(async_chat())

output

Gentle breeze whispers
Blossoms dance in warm sunlight
Spring breathes life anew

Troubleshooting tips

If the server fails to start, verify the model path and that the GGUF model file is valid.
Check that the port (default 8080) is free and accessible.
For connection errors, ensure base_url matches the server address and includes /v1.
Use verbose logging by adding --verbose flag to the server CLI for debugging.

✅

Key Takeaways

Run llama.cpp server with the CLI to expose OpenAI-compatible endpoints locally.
Use the OpenAI Python SDK with base_url pointed to the server for easy integration.
Support for chat, completions, streaming, and async calls enables flexible usage.
Ensure correct model path and server port to avoid startup and connection issues.

Verified 2026-04 · llama-3.1-8b

Verify ↗