How to beginner · 3 min read

How to query llama.cpp server from Python

Quick answer
Use the OpenAI Python SDK with the base_url set to your llama.cpp server endpoint, then call client.chat.completions.create() with your messages. This sends chat requests to the llama.cpp server running locally or remotely, returning the generated response.

PREREQUISITES

  • Python 3.8+
  • pip install openai>=1.0
  • Running llama.cpp server (e.g. python -m llama_cpp.server --model ./model.gguf --port 8080)

Setup

Install the official OpenAI Python SDK and ensure your llama.cpp server is running locally or remotely. The server typically listens on http://localhost:8080 by default.

Install the SDK with:

bash
pip install openai
output
Collecting openai
  Downloading openai-1.x.x-py3-none-any.whl (xx kB)
Installing collected packages: openai
Successfully installed openai-1.x.x

Step by step

Use the OpenAI SDK with base_url pointing to your llama.cpp server. Create a client, then send chat completion requests with the chat message format.

python
import os
from openai import OpenAI

# Set your llama.cpp server URL
LLAMA_CPP_SERVER_URL = "http://localhost:8080/v1"

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"], base_url=LLAMA_CPP_SERVER_URL)

messages = [
    {"role": "user", "content": "Hello from Python to llama.cpp server!"}
]

response = client.chat.completions.create(
    model="llama-3.1-8b",  # Use your model name as configured on the server
    messages=messages
)

print("Response:", response.choices[0].message.content)
output
Response: Hello from Python to llama.cpp server! How can I assist you today?

Common variations

  • Async calls: Use asyncio and await client.chat.completions.create(...) for asynchronous requests.
  • Streaming: Add stream=True to receive tokens incrementally.
  • Different models: Change the model parameter to match your llama.cpp server model.
python
import asyncio
from openai import OpenAI

async def async_query():
    client = OpenAI(api_key=os.environ["OPENAI_API_KEY"], base_url="http://localhost:8080/v1")
    messages = [{"role": "user", "content": "Async hello!"}]
    
    stream = await client.chat.completions.create(
        model="llama-3.1-8b",
        messages=messages,
        stream=True
    )

    async for chunk in stream:
        print(chunk.choices[0].delta.content or "", end="", flush=True)

asyncio.run(async_query())
output
Async hello! How can I help you today?

Troubleshooting

  • If you get connection errors, verify your llama.cpp server is running and accessible at the specified base_url.
  • Ensure the model name matches exactly what the server expects.
  • Check firewall or network settings if connecting to a remote server.
  • Use logs from the llama.cpp server for debugging request handling.

Key Takeaways

  • Use the OpenAI Python SDK with the llama.cpp server URL as the base_url.
  • Send chat completions with messages in the OpenAI chat format to query the server.
  • Support async and streaming by enabling stream=True and using async calls.
  • Verify server availability and model names to avoid connection or model errors.
Verified 2026-04 · llama-3.1-8b
Verify ↗