How to beginner · 3 min read

llama-cpp-python OpenAI compatible server

Quick answer
Use llama-cpp-python to start a local server that exposes an OpenAI compatible API endpoint. Run the server with python -m llama_cpp.server --model ./model.gguf --port 8080 and query it using the OpenAI SDK with base_url='http://localhost:8080/v1'.

PREREQUISITES

  • Python 3.8+
  • pip install llama-cpp-python openai>=1.0
  • A GGUF format Llama model file

Setup

Install the llama-cpp-python package and prepare a GGUF Llama model file. The server module in llama-cpp-python provides an OpenAI compatible HTTP API.

bash
pip install llama-cpp-python openai
output
Collecting llama-cpp-python
Installing collected packages: llama-cpp-python
Successfully installed llama-cpp-python-0.1.0
Collecting openai
Installing collected packages: openai
Successfully installed openai-1.0.0

Step by step

Start the llama-cpp-python server locally with your GGUF model and then query it using the OpenAI SDK with the base_url override for local inference.

python
import os
from openai import OpenAI

# Start the server in a separate terminal:
# python -m llama_cpp.server --model ./models/llama-3.1-8b.Q4_K_M.gguf --port 8080

# Python client code to query the local server
client = OpenAI(api_key="", base_url="http://localhost:8080/v1")

response = client.chat.completions.create(
    model="llama-3.1-8b",
    messages=[{"role": "user", "content": "Hello, llama-cpp-python!"}]
)
print(response.choices[0].message.content)
output
Hello, llama-cpp-python! How can I assist you today?

Common variations

  • Use different GGUF models by changing the --model path when starting the server.
  • Run the server on a different port by changing --port.
  • Use async calls with the OpenAI SDK by importing asyncio and using await with client.chat.completions.acreate().
python
import asyncio
from openai import OpenAI

async def main():
    client = OpenAI(api_key="", base_url="http://localhost:8080/v1")
    response = await client.chat.completions.acreate(
        model="llama-3.1-8b",
        messages=[{"role": "user", "content": "Async hello!"}]
    )
    print(response.choices[0].message.content)

asyncio.run(main())
output
Async hello! How can I help you today?

Troubleshooting

  • If you get connection errors, ensure the server is running on the specified port and accessible.
  • Check that your model path is correct and the GGUF model is compatible with llama-cpp-python.
  • Use netstat or similar tools to verify the port is open.

Key Takeaways

  • Run llama-cpp-python server with the --model and --port flags for OpenAI compatible API.
  • Use the OpenAI Python SDK with base_url pointing to the local server for inference.
  • Async calls and different GGUF models are supported by adjusting server and client parameters.
Verified 2026-04 · llama-3.1-8b, llama-3.1-8b.Q4_K_M.gguf
Verify ↗