How to beginner · 3 min read

How to serve Qwen with vLLM

Quick answer
Use vLLM to serve the Qwen model locally by running the vllm serve CLI with the Qwen model path or identifier. Then query the running server via the OpenAI SDK by setting base_url to http://localhost:8000/v1 and specifying the qwen model in your requests.

PREREQUISITES

  • Python 3.8+
  • pip install vllm openai
  • Qwen model files downloaded locally or accessible
  • Basic knowledge of command line and Python

Setup

Install vLLM and openai Python packages. Download the Qwen model files locally or ensure you have the correct model identifier if using a local path.

bash
pip install vllm openai

Step by step

Start the vLLM server with the Qwen model, then query it using the OpenAI-compatible Python client.

python
import os
from openai import OpenAI

# Step 1: Run the vLLM server in your terminal (replace <model_path_or_id> with your Qwen model path or identifier):
# vllm serve <model_path_or_id> --port 8000

# Step 2: Query the running server from Python
client = OpenAI(base_url="http://localhost:8000/v1")

response = client.chat.completions.create(
    model="qwen",
    messages=[{"role": "user", "content": "Explain the benefits of vLLM."}]
)

print(response.choices[0].message.content)
output
The benefits of vLLM include efficient batching, low latency, and high throughput for serving large language models locally.

Common variations

  • Use different ports by changing the --port argument in the vllm serve command.
  • For async Python calls, use asyncio with the OpenAI client.
  • Serve other Qwen variants by specifying their exact model path or identifier.
python
import asyncio
from openai import OpenAI

async def async_query():
    client = OpenAI(base_url="http://localhost:8000/v1")
    response = await client.chat.completions.acreate(
        model="qwen",
        messages=[{"role": "user", "content": "What is vLLM?"}]
    )
    print(response.choices[0].message.content)

asyncio.run(async_query())
output
vLLM is a high-performance inference server optimized for large language models, enabling fast and efficient local serving.

Troubleshooting

  • If you see connection errors, ensure the vllm serve process is running and accessible on the specified port.
  • Check that the model path or identifier is correct and the model files are properly downloaded.
  • Use netstat or similar tools to verify the port is open.

Key Takeaways

  • Run vllm serve with the Qwen model to start a local inference server.
  • Query the server using the OpenAI SDK with base_url="http://localhost:8000/v1" and model="qwen".
  • Use async calls for non-blocking queries with the OpenAI client.
  • Verify model path and server port if connection issues arise.
Verified 2026-04 · qwen
Verify ↗