How to beginner · 3 min read

How to query llama.cpp server from Python

Q: How to query llama.cpp server from Python

Use the OpenAI Python SDK with the base_url set to your llama.cpp server endpoint, then call client.chat.completions.create() with your messages. This sends chat requests to the llama.cpp server running locally or remotely, returning the generated response.

Quick answer

Use the OpenAI Python SDK with the base_url set to your llama.cpp server endpoint, then call client.chat.completions.create() with your messages. This sends chat requests to the llama.cpp server running locally or remotely, returning the generated response.

PREREQUISITES

Python 3.8+
pip install openai>=1.0
Running llama.cpp server (e.g. python -m llama_cpp.server --model ./model.gguf --port 8080)

Setup

Install the official OpenAI Python SDK and ensure your llama.cpp server is running locally or remotely. The server typically listens on http://localhost:8080 by default.

Install the SDK with:

bash

pip install openai

output

Collecting openai
  Downloading openai-1.x.x-py3-none-any.whl (xx kB)
Installing collected packages: openai
Successfully installed openai-1.x.x

Step by step

Use the OpenAI SDK with base_url pointing to your llama.cpp server. Create a client, then send chat completion requests with the chat message format.

python

import os
from openai import OpenAI

# Set your llama.cpp server URL
LLAMA_CPP_SERVER_URL = "http://localhost:8080/v1"

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"], base_url=LLAMA_CPP_SERVER_URL)

messages = [
    {"role": "user", "content": "Hello from Python to llama.cpp server!"}
]

response = client.chat.completions.create(
    model="llama-3.1-8b",  # Use your model name as configured on the server
    messages=messages
)

print("Response:", response.choices[0].message.content)

output

Response: Hello from Python to llama.cpp server! How can I assist you today?

Common variations

Async calls: Use asyncio and await client.chat.completions.create(...) for asynchronous requests.
Streaming: Add stream=True to receive tokens incrementally.
Different models: Change the model parameter to match your llama.cpp server model.

python

import asyncio
from openai import OpenAI

async def async_query():
    client = OpenAI(api_key=os.environ["OPENAI_API_KEY"], base_url="http://localhost:8080/v1")
    messages = [{"role": "user", "content": "Async hello!"}]
    
    stream = await client.chat.completions.create(
        model="llama-3.1-8b",
        messages=messages,
        stream=True
    )

    async for chunk in stream:
        print(chunk.choices[0].delta.content or "", end="", flush=True)

asyncio.run(async_query())

output

Async hello! How can I help you today?

Troubleshooting

If you get connection errors, verify your llama.cpp server is running and accessible at the specified base_url.
Ensure the model name matches exactly what the server expects.
Check firewall or network settings if connecting to a remote server.
Use logs from the llama.cpp server for debugging request handling.

✅

Key Takeaways

Use the OpenAI Python SDK with the llama.cpp server URL as the base_url.
Send chat completions with messages in the OpenAI chat format to query the server.
Support async and streaming by enabling stream=True and using async calls.
Verify server availability and model names to avoid connection or model errors.

Verified 2026-04 · llama-3.1-8b

Verify ↗