How to beginner · 3 min read

How to serve Qwen model with vLLM

Q: How to serve Qwen model with vLLM

Use the vllm CLI to serve the Qwen model locally by running vllm serve qwen/qwen-7b. Then query the running server via the openai Python SDK by setting base_url="http://localhost:8000/v1" and calling client.chat.completions.create with your prompt.

Quick answer

Use the vllm CLI to serve the Qwen model locally by running vllm serve qwen/qwen-7b. Then query the running server via the openai Python SDK by setting base_url="http://localhost:8000/v1" and calling client.chat.completions.create with your prompt.

PREREQUISITES

Python 3.8+
pip install vllm openai
Qwen model weights downloaded or accessible
Port 8000 available for serving

Setup vLLM and Qwen model

Install vllm Python package and ensure you have the Qwen model weights available locally or accessible via Hugging Face Hub. The vllm package provides the CLI to serve models efficiently with batching and GPU acceleration.

bash

pip install vllm openai

Step by step serving and querying

Start the vLLM server for the Qwen model using the CLI, then query it with Python using the OpenAI-compatible SDK.

python

### Start the vLLM server (run in terminal)
vllm serve qwen/qwen-7b --port 8000

### Python client code to query the running server
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"], base_url="http://localhost:8000/v1")

response = client.chat.completions.create(
    model="qwen-7b",
    messages=[{"role": "user", "content": "Hello, Qwen!"}]
)
print(response.choices[0].message.content)

output

Hello, Qwen! How can I assist you today?

Common variations

Use different Qwen variants like qwen/qwen-14b by changing the CLI model argument.
Run the server on a custom port with --port flag.
Use async Python calls with asyncio and openai SDK for concurrency.
Integrate with other OpenAI-compatible clients by setting base_url to the vLLM server endpoint.

Troubleshooting

If the server fails to start, verify the Qwen model path or internet connection for model download.
Port conflicts: ensure port 8000 is free or specify another port with --port.
Timeouts or connection errors: check firewall settings and that the server is running.
For GPU memory errors, reduce batch size or use a smaller Qwen model variant.

✅

Key Takeaways

Use the vllm serve CLI to launch a local Qwen model server efficiently.
Query the running server with the OpenAI Python SDK by setting base_url to http://localhost:8000/v1.
Adjust model variant and server port easily via CLI arguments for flexibility.
Troubleshoot common issues by checking model availability, port conflicts, and GPU resources.

Verified 2026-04 · qwen-7b, qwen-14b

Verify ↗