How to stream responses with vLLM
Quick answer
To stream responses with
vLLM, start a local vllm serve server and query it using the openai SDK with stream=True. This enables real-time token-by-token output from the model over HTTP.PREREQUISITES
Python 3.8+pip install openai>=1.0vLLM installed (pip install vllm)Local vLLM server running (vllm serve command)
Setup local vLLM server
Install vllm and start the server locally to enable streaming over HTTP. The server listens on port 8000 by default.
pip install vllm
# Start the vLLM server with a model (e.g., llama-3.1-8B-Instruct)
vllm serve meta-llama/Llama-3.1-8B-Instruct --port 8000 Step by step streaming code
Use the openai Python SDK to connect to the local vLLM server and stream the response token-by-token.
import os
from openai import OpenAI
# Connect to local vLLM server
client = OpenAI(api_key="", base_url="http://localhost:8000/v1")
messages = [{"role": "user", "content": "Write a short poem about AI."}]
# Create streaming chat completion
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=messages,
stream=True
)
# Stream and print tokens as they arrive
for chunk in response:
print(chunk.choices[0].delta.get("content", ""), end="", flush=True)
print() output
Write a short poem about AI. AI whispers in circuits bright, Learning fast, day and night, Dreams in code, thoughts anew, Infinite worlds it can construe.
Common variations
- Use different models by changing the
modelparameter in the request. - Run the
vllm serveserver on a different port and updatebase_urlaccordingly. - Use synchronous calls without streaming by omitting
stream=True.
Troubleshooting streaming issues
- If you see connection errors, verify the
vllm serveserver is running and reachable at the specifiedbase_url. - Ensure no API key is set or required when connecting locally (pass empty string or omit).
- Check firewall or port conflicts that may block streaming HTTP connections.
Key Takeaways
- Start the vLLM server locally with the desired model before streaming.
- Use the OpenAI SDK with
stream=Trueandbase_urlpointing to the local server. - Stream tokens in a loop to get real-time output from vLLM.
- No API key is needed for local vLLM server connections.
- Adjust model and server port as needed for your environment.