How to beginner · 3 min read

How to serve Qwen with vLLM

Q: How to serve Qwen with vLLM

Use vLLM to serve the Qwen model locally by running the vllm serve CLI with the Qwen model path or identifier. Then query the running server via the OpenAI SDK by setting base_url to http://localhost:8000/v1 and specifying the qwen model in your requests.

Quick answer

Use vLLM to serve the Qwen model locally by running the vllm serve CLI with the Qwen model path or identifier. Then query the running server via the OpenAI SDK by setting base_url to http://localhost:8000/v1 and specifying the qwen model in your requests.

PREREQUISITES

Python 3.8+
pip install vllm openai
Qwen model files downloaded locally or accessible
Basic knowledge of command line and Python

Setup

Install vLLM and openai Python packages. Download the Qwen model files locally or ensure you have the correct model identifier if using a local path.

bash

pip install vllm openai

Step by step

Start the vLLM server with the Qwen model, then query it using the OpenAI-compatible Python client.

python

import os
from openai import OpenAI

# Step 1: Run the vLLM server in your terminal (replace <model_path_or_id> with your Qwen model path or identifier):
# vllm serve <model_path_or_id> --port 8000

# Step 2: Query the running server from Python
client = OpenAI(base_url="http://localhost:8000/v1")

response = client.chat.completions.create(
    model="qwen",
    messages=[{"role": "user", "content": "Explain the benefits of vLLM."}]
)

print(response.choices[0].message.content)

output

The benefits of vLLM include efficient batching, low latency, and high throughput for serving large language models locally.

Common variations

Use different ports by changing the --port argument in the vllm serve command.
For async Python calls, use asyncio with the OpenAI client.
Serve other Qwen variants by specifying their exact model path or identifier.

python

import asyncio
from openai import OpenAI

async def async_query():
    client = OpenAI(base_url="http://localhost:8000/v1")
    response = await client.chat.completions.acreate(
        model="qwen",
        messages=[{"role": "user", "content": "What is vLLM?"}]
    )
    print(response.choices[0].message.content)

asyncio.run(async_query())

output

vLLM is a high-performance inference server optimized for large language models, enabling fast and efficient local serving.

Troubleshooting

If you see connection errors, ensure the vllm serve process is running and accessible on the specified port.
Check that the model path or identifier is correct and the model files are properly downloaded.
Use netstat or similar tools to verify the port is open.

✅

Key Takeaways

Run vllm serve with the Qwen model to start a local inference server.
Query the server using the OpenAI SDK with base_url="http://localhost:8000/v1" and model="qwen".
Use async calls for non-blocking queries with the OpenAI client.
Verify model path and server port if connection issues arise.

Verified 2026-04 · qwen

Verify ↗