How to beginner · 3 min read

llama-cpp-python OpenAI compatible server

Q: llama-cpp-python OpenAI compatible server

Use llama-cpp-python to start a local server that exposes an OpenAI compatible API endpoint. Run the server with python -m llama_cpp.server --model ./model.gguf --port 8080 and query it using the OpenAI SDK with base_url='http://localhost:8080/v1'.

Quick answer

Use llama-cpp-python to start a local server that exposes an OpenAI compatible API endpoint. Run the server with python -m llama_cpp.server --model ./model.gguf --port 8080 and query it using the OpenAI SDK with base_url='http://localhost:8080/v1'.

PREREQUISITES

Python 3.8+
pip install llama-cpp-python openai>=1.0
A GGUF format Llama model file

Setup

Install the llama-cpp-python package and prepare a GGUF Llama model file. The server module in llama-cpp-python provides an OpenAI compatible HTTP API.

bash

pip install llama-cpp-python openai

output

Collecting llama-cpp-python
Installing collected packages: llama-cpp-python
Successfully installed llama-cpp-python-0.1.0
Collecting openai
Installing collected packages: openai
Successfully installed openai-1.0.0

Step by step

Start the llama-cpp-python server locally with your GGUF model and then query it using the OpenAI SDK with the base_url override for local inference.

python

import os
from openai import OpenAI

# Start the server in a separate terminal:
# python -m llama_cpp.server --model ./models/llama-3.1-8b.Q4_K_M.gguf --port 8080

# Python client code to query the local server
client = OpenAI(api_key="", base_url="http://localhost:8080/v1")

response = client.chat.completions.create(
    model="llama-3.1-8b",
    messages=[{"role": "user", "content": "Hello, llama-cpp-python!"}]
)
print(response.choices[0].message.content)

output

Hello, llama-cpp-python! How can I assist you today?

Common variations

Use different GGUF models by changing the --model path when starting the server.
Run the server on a different port by changing --port.
Use async calls with the OpenAI SDK by importing asyncio and using await with client.chat.completions.acreate().

python

import asyncio
from openai import OpenAI

async def main():
    client = OpenAI(api_key="", base_url="http://localhost:8080/v1")
    response = await client.chat.completions.acreate(
        model="llama-3.1-8b",
        messages=[{"role": "user", "content": "Async hello!"}]
    )
    print(response.choices[0].message.content)

asyncio.run(main())

output

Async hello! How can I help you today?

Troubleshooting

If you get connection errors, ensure the server is running on the specified port and accessible.
Check that your model path is correct and the GGUF model is compatible with llama-cpp-python.
Use netstat or similar tools to verify the port is open.

✅

Key Takeaways

Run llama-cpp-python server with the --model and --port flags for OpenAI compatible API.
Use the OpenAI Python SDK with base_url pointing to the local server for inference.
Async calls and different GGUF models are supported by adjusting server and client parameters.

Verified 2026-04 · llama-3.1-8b, llama-3.1-8b.Q4_K_M.gguf

Verify ↗