How to serve Qwen model with vLLM
Quick answer
Use the
vllm CLI to serve the Qwen model locally by running vllm serve qwen/qwen-7b. Then query the running server via the openai Python SDK by setting base_url="http://localhost:8000/v1" and calling client.chat.completions.create with your prompt.PREREQUISITES
Python 3.8+pip install vllm openaiQwen model weights downloaded or accessiblePort 8000 available for serving
Setup vLLM and Qwen model
Install vllm Python package and ensure you have the Qwen model weights available locally or accessible via Hugging Face Hub. The vllm package provides the CLI to serve models efficiently with batching and GPU acceleration.
pip install vllm openai Step by step serving and querying
Start the vLLM server for the Qwen model using the CLI, then query it with Python using the OpenAI-compatible SDK.
### Start the vLLM server (run in terminal)
vllm serve qwen/qwen-7b --port 8000
### Python client code to query the running server
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"], base_url="http://localhost:8000/v1")
response = client.chat.completions.create(
model="qwen-7b",
messages=[{"role": "user", "content": "Hello, Qwen!"}]
)
print(response.choices[0].message.content) output
Hello, Qwen! How can I assist you today?
Common variations
- Use different Qwen variants like
qwen/qwen-14bby changing the CLI model argument. - Run the server on a custom port with
--portflag. - Use async Python calls with
asyncioandopenaiSDK for concurrency. - Integrate with other OpenAI-compatible clients by setting
base_urlto the vLLM server endpoint.
Troubleshooting
- If the server fails to start, verify the Qwen model path or internet connection for model download.
- Port conflicts: ensure port 8000 is free or specify another port with
--port. - Timeouts or connection errors: check firewall settings and that the server is running.
- For GPU memory errors, reduce batch size or use a smaller Qwen model variant.
Key Takeaways
- Use the
vllm serveCLI to launch a local Qwen model server efficiently. - Query the running server with the OpenAI Python SDK by setting
base_urltohttp://localhost:8000/v1. - Adjust model variant and server port easily via CLI arguments for flexibility.
- Troubleshoot common issues by checking model availability, port conflicts, and GPU resources.