How to beginner · 3 min read

How to use vLLM with OpenAI Python SDK

Quick answer
Use the OpenAI Python SDK with the base_url parameter pointing to your running vLLM server (e.g., http://localhost:8000/v1). Start the vLLM server via CLI, then send chat completions requests through the SDK as you would with OpenAI's API.

PREREQUISITES

  • Python 3.8+
  • OpenAI API key (can be dummy for local vLLM server)
  • pip install openai>=1.0
  • vLLM installed and running locally

Setup

Install the openai Python package and vLLM. Run the vLLM server locally to serve the model over HTTP.

  • Install OpenAI SDK: pip install openai
  • Install vLLM: pip install vllm
  • Start the vLLM server with a compatible model:
vllm serve meta-llama/Llama-3.1-8B-Instruct --port 8000
bash
pip install openai vllm

Step by step

Use the OpenAI Python SDK to query the running vLLM server by setting base_url to http://localhost:8000/v1. This example sends a chat completion request to the local server.

python
import os
from openai import OpenAI

# Initialize client with base_url pointing to local vLLM server
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY", "unused"), base_url="http://localhost:8000/v1")

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Hello from vLLM!"}]
)

print(response.choices[0].message.content)
output
Hello from vLLM! How can I assist you today?

Common variations

You can use different models supported by your vLLM server by changing the model parameter. For async usage, use Python's asyncio with the OpenAI SDK's async client. Streaming is not supported by vLLM server currently.

python
import asyncio
from openai import OpenAI

async def main():
    client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY", "unused"), base_url="http://localhost:8000/v1")
    response = await client.chat.completions.acreate(
        model="meta-llama/Llama-3.1-8B-Instruct",
        messages=[{"role": "user", "content": "Async hello from vLLM!"}]
    )
    print(response.choices[0].message.content)

asyncio.run(main())
output
Async hello from vLLM! How can I help you today?

Troubleshooting

  • If you get connection errors, ensure the vLLM server is running on localhost:8000.
  • If responses are empty or errors occur, verify the model name matches one served by vLLM.
  • Check your environment variable OPENAI_API_KEY is set; for local vLLM usage, the key can be dummy but must be present.

Key Takeaways

  • Run the vLLM server locally with the CLI command before querying.
  • Use OpenAI SDK with base_url set to the local vLLM server endpoint.
  • Model names must match those served by your running vLLM instance.
  • Async calls are supported via OpenAI SDK's async methods.
  • Ensure environment variables are set even if API key is unused locally.
Verified 2026-04 · meta-llama/Llama-3.1-8B-Instruct
Verify ↗