How to beginner · 3 min read

How to use vLLM OpenAI compatible API

Q: How to use vLLM OpenAI compatible API

Use the openai Python SDK with the base_url parameter pointing to your running vLLM server (e.g., http://localhost:8000/v1). Call client.chat.completions.create() with your prompt and model name to get completions from vLLM via its OpenAI-compatible API.

Quick answer

Use the openai Python SDK with the base_url parameter pointing to your running vLLM server (e.g., http://localhost:8000/v1). Call client.chat.completions.create() with your prompt and model name to get completions from vLLM via its OpenAI-compatible API.

PREREQUISITES

Python 3.8+
OpenAI API key (can be dummy if querying local vLLM server)
pip install openai>=1.0
vLLM server running with OpenAI compatible API (e.g., vllm serve meta-llama/Llama-3.1-8B-Instruct --port 8000)

Setup

Install the official openai Python SDK to interact with the vLLM server's OpenAI-compatible API. Ensure you have a running vLLM server exposing the API on http://localhost:8000/v1 or your chosen endpoint.

bash

pip install openai>=1.0

Step by step

This example shows how to send a chat completion request to a local vLLM server using the OpenAI-compatible API with the openai SDK.

python

import os
from openai import OpenAI

# Initialize client with base_url pointing to vLLM server
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY", "unused"), base_url="http://localhost:8000/v1")

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Hello from vLLM!"}]
)

print(response.choices[0].message.content)

output

Hello from vLLM! How can I assist you today?

Common variations

Change model to any model your vLLM server supports.
Use the same openai SDK for embeddings or completions if your server supports those endpoints.
For async usage, use Python's asyncio with the openai SDK's async client methods.
To serve the model, run: vllm serve meta-llama/Llama-3.1-8B-Instruct --port 8000.

Troubleshooting

If you get connection errors, verify the vLLM server is running and accessible at the base_url.
If authentication fails, note that vLLM's OpenAI-compatible API typically does not require a real API key; you can pass a dummy key.
Check logs of the vLLM server for errors if responses are empty or malformed.

✅

Key Takeaways

Use the official openai Python SDK with base_url pointed to your vLLM server.
Run the vLLM server with vllm serve <model> --port 8000 to expose the OpenAI-compatible API.
Pass a dummy API key since vLLM does not require authentication by default.
The API call pattern matches OpenAI's chat.completions.create method exactly.
Troubleshoot by verifying server availability and checking server logs for errors.

Verified 2026-04 · meta-llama/Llama-3.1-8B-Instruct

Verify ↗