How to Intermediate · 3 min read

How to run Qwen with vLLM

Quick answer
Use the vllm Python package to run the meta-qwen/Qwen-7B model locally by serving it with the vllm serve CLI command and then querying it via the openai SDK with base_url="http://localhost:8000/v1". This enables efficient, low-latency inference of Qwen with familiar OpenAI-compatible API calls.

PREREQUISITES

  • Python 3.8+
  • pip install vllm openai
  • Qwen model weights downloaded or accessible
  • Basic knowledge of command line and Python

Setup

Install the vllm package and the openai Python SDK. Download the Qwen model weights or ensure you have access to meta-qwen/Qwen-7B locally. The vllm server will host the model for inference.

bash
pip install vllm openai

Step by step

Start the vllm server to serve the Qwen model, then run a Python script to query it using the OpenAI-compatible API.

python
# Step 1: Start the vLLM server with Qwen model
vllm serve meta-qwen/Qwen-7B --port 8000

# Step 2: Python script to query the running server
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"], base_url="http://localhost:8000/v1")

response = client.chat.completions.create(
    model="meta-qwen/Qwen-7B",
    messages=[{"role": "user", "content": "Explain the benefits of using vLLM with Qwen."}]
)
print(response.choices[0].message.content)
output
The benefits of using vLLM with Qwen include efficient memory usage, fast batch inference, and compatibility with OpenAI API calls, enabling seamless integration into existing workflows.

Common variations

  • Use different Qwen variants by changing the model name in the vllm serve command and the Python client.
  • Enable streaming responses by setting stream=True in the chat.completions.create call.
  • Run asynchronous queries using Python asyncio with the OpenAI SDK.
python
import asyncio
import os
from openai import OpenAI

async def async_query():
    client = OpenAI(api_key=os.environ["OPENAI_API_KEY"], base_url="http://localhost:8000/v1")
    response = await client.chat.completions.acreate(
        model="meta-qwen/Qwen-7B",
        messages=[{"role": "user", "content": "What is vLLM?"}]
    )
    print(response.choices[0].message.content)

asyncio.run(async_query())
output
vLLM is a high-performance inference engine designed for large language models, providing efficient batching and low latency.

Troubleshooting

  • If the server fails to start, verify the model path and that you have sufficient GPU memory.
  • If you get connection errors, ensure the vllm serve process is running on port 8000 and accessible.
  • For model loading issues, confirm the Qwen model weights are correctly downloaded and compatible with vllm.

Key Takeaways

  • Run Qwen locally with vLLM using the CLI vllm serve command.
  • Query the running server via OpenAI-compatible Python SDK with base_url="http://localhost:8000/v1".
  • Use async and streaming options for flexible inference patterns.
  • Ensure model weights and GPU resources are properly configured for smooth operation.
Verified 2026-04 · meta-qwen/Qwen-7B
Verify ↗