How to use vLLM OpenAI compatible API
Quick answer
Use the
openai Python SDK with the base_url parameter pointing to your running vLLM server (e.g., http://localhost:8000/v1). Call client.chat.completions.create() with your prompt and model name to get completions from vLLM via its OpenAI-compatible API.PREREQUISITES
Python 3.8+OpenAI API key (can be dummy if querying local vLLM server)pip install openai>=1.0vLLM server running with OpenAI compatible API (e.g., vllm serve meta-llama/Llama-3.1-8B-Instruct --port 8000)
Setup
Install the official openai Python SDK to interact with the vLLM server's OpenAI-compatible API. Ensure you have a running vLLM server exposing the API on http://localhost:8000/v1 or your chosen endpoint.
pip install openai>=1.0 Step by step
This example shows how to send a chat completion request to a local vLLM server using the OpenAI-compatible API with the openai SDK.
import os
from openai import OpenAI
# Initialize client with base_url pointing to vLLM server
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY", "unused"), base_url="http://localhost:8000/v1")
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": "Hello from vLLM!"}]
)
print(response.choices[0].message.content) output
Hello from vLLM! How can I assist you today?
Common variations
- Change
modelto any model your vLLM server supports. - Use the same
openaiSDK for embeddings or completions if your server supports those endpoints. - For async usage, use Python's
asynciowith theopenaiSDK's async client methods. - To serve the model, run:
vllm serve meta-llama/Llama-3.1-8B-Instruct --port 8000.
Troubleshooting
- If you get connection errors, verify the vLLM server is running and accessible at the
base_url. - If authentication fails, note that vLLM's OpenAI-compatible API typically does not require a real API key; you can pass a dummy key.
- Check logs of the vLLM server for errors if responses are empty or malformed.
Key Takeaways
- Use the official
openaiPython SDK withbase_urlpointed to your vLLM server. - Run the vLLM server with
vllm serve <model> --port 8000to expose the OpenAI-compatible API. - Pass a dummy API key since vLLM does not require authentication by default.
- The API call pattern matches OpenAI's
chat.completions.createmethod exactly. - Troubleshoot by verifying server availability and checking server logs for errors.