How to serve Qwen with vLLM
Quick answer
Use
vLLM to serve the Qwen model locally by running the vllm serve CLI with the Qwen model path or identifier. Then query the running server via the OpenAI SDK by setting base_url to http://localhost:8000/v1 and specifying the qwen model in your requests.PREREQUISITES
Python 3.8+pip install vllm openaiQwen model files downloaded locally or accessibleBasic knowledge of command line and Python
Setup
Install vLLM and openai Python packages. Download the Qwen model files locally or ensure you have the correct model identifier if using a local path.
pip install vllm openai Step by step
Start the vLLM server with the Qwen model, then query it using the OpenAI-compatible Python client.
import os
from openai import OpenAI
# Step 1: Run the vLLM server in your terminal (replace <model_path_or_id> with your Qwen model path or identifier):
# vllm serve <model_path_or_id> --port 8000
# Step 2: Query the running server from Python
client = OpenAI(base_url="http://localhost:8000/v1")
response = client.chat.completions.create(
model="qwen",
messages=[{"role": "user", "content": "Explain the benefits of vLLM."}]
)
print(response.choices[0].message.content) output
The benefits of vLLM include efficient batching, low latency, and high throughput for serving large language models locally.
Common variations
- Use different ports by changing the
--portargument in thevllm servecommand. - For async Python calls, use
asynciowith the OpenAI client. - Serve other Qwen variants by specifying their exact model path or identifier.
import asyncio
from openai import OpenAI
async def async_query():
client = OpenAI(base_url="http://localhost:8000/v1")
response = await client.chat.completions.acreate(
model="qwen",
messages=[{"role": "user", "content": "What is vLLM?"}]
)
print(response.choices[0].message.content)
asyncio.run(async_query()) output
vLLM is a high-performance inference server optimized for large language models, enabling fast and efficient local serving.
Troubleshooting
- If you see connection errors, ensure the
vllm serveprocess is running and accessible on the specified port. - Check that the model path or identifier is correct and the model files are properly downloaded.
- Use
netstator similar tools to verify the port is open.
Key Takeaways
- Run
vllm servewith the Qwen model to start a local inference server. - Query the server using the OpenAI SDK with
base_url="http://localhost:8000/v1"andmodel="qwen". - Use async calls for non-blocking queries with the OpenAI client.
- Verify model path and server port if connection issues arise.