How to Intermediate · 3 min read

How to stream responses with vLLM

Q: How to stream responses with vLLM

To stream responses with vLLM, start a local vllm serve server and query it using the openai SDK with stream=True. This enables real-time token-by-token output from the model over HTTP.

Quick answer

To stream responses with vLLM, start a local vllm serve server and query it using the openai SDK with stream=True. This enables real-time token-by-token output from the model over HTTP.

PREREQUISITES

Python 3.8+
pip install openai>=1.0
vLLM installed (pip install vllm)
Local vLLM server running (vllm serve command)

Setup local vLLM server

Install vllm and start the server locally to enable streaming over HTTP. The server listens on port 8000 by default.

bash

pip install vllm

# Start the vLLM server with a model (e.g., llama-3.1-8B-Instruct)
vllm serve meta-llama/Llama-3.1-8B-Instruct --port 8000

Step by step streaming code

Use the openai Python SDK to connect to the local vLLM server and stream the response token-by-token.

python

import os
from openai import OpenAI

# Connect to local vLLM server
client = OpenAI(api_key="", base_url="http://localhost:8000/v1")

messages = [{"role": "user", "content": "Write a short poem about AI."}]

# Create streaming chat completion
response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=messages,
    stream=True
)

# Stream and print tokens as they arrive
for chunk in response:
    print(chunk.choices[0].delta.get("content", ""), end="", flush=True)
print()

output

Write a short poem about AI.

AI whispers in circuits bright,
Learning fast, day and night,
Dreams in code, thoughts anew,
Infinite worlds it can construe.

Common variations

Use different models by changing the model parameter in the request.
Run the vllm serve server on a different port and update base_url accordingly.
Use synchronous calls without streaming by omitting stream=True.

Troubleshooting streaming issues

If you see connection errors, verify the vllm serve server is running and reachable at the specified base_url.
Ensure no API key is set or required when connecting locally (pass empty string or omit).
Check firewall or port conflicts that may block streaming HTTP connections.

✅

Key Takeaways

Start the vLLM server locally with the desired model before streaming.
Use the OpenAI SDK with stream=True and base_url pointing to the local server.
Stream tokens in a loop to get real-time output from vLLM.
No API key is needed for local vLLM server connections.
Adjust model and server port as needed for your environment.

Verified 2026-04 · meta-llama/Llama-3.1-8B-Instruct

Verify ↗