Code Intermediate medium · 7 min

vLLM serving Qwen

What you will learn

Use vLLM to serve Qwen models with high throughput and batching optimization via a simple HTTP API.

Why this matters

Qwen models are fast, but serving them requires handling concurrent requests, memory constraints, and inference optimization: vLLM solves all three automatically, letting you focus on your application instead of infrastructure.

Skip if: Don't use vLLM if you're running a single inference in a Python script (just load the model directly), or if you need to heavily customize the inference pipeline beyond vLLM's hooks: use transformers + manual serving instead.

Explanation

vLLM is an inference engine optimized for LLMs. It manages batching, KV-cache sharing, and GPU memory to serve requests fast and efficiently. Instead of loading Qwen yourself and writing a server, vLLM wraps the model, handles concurrency, and exposes an OpenAI-compatible API endpoint. Mechanically: vLLM loads the model into GPU memory, listens for HTTP requests, batches multiple requests together, computes tokens in parallel via PagedAttention (which reuses KV-cache across sequences), and streams responses back. You define the model ID (e.g., 'Qwen/Qwen2.5-7B-Instruct'), vLLM downloads and caches it, and you get a production-ready server without writing request handling code. When to use: Whenever you're serving Qwen to multiple users or building a chatbot API, vLLM is the standard choice because it's faster and simpler than rolling your own.

Analogy

Think of vLLM as a restaurant kitchen manager: you send in multiple orders (requests), it groups them together, cooks as efficiently as possible, and plates them out: instead of cooking one meal at a time and washing dishes between each order.

Code

Illustrative only - not runnable without a valid API key

python

import requests
import subprocess
import time
import sys
import os

model_id = "Qwen/Qwen2.5-1.5B-Instruct"

print(f"[1/3] Starting vLLM server with {model_id}...")
server_process = subprocess.Popen(
    [
        sys.executable,
        "-m",
        "vllm.entrypoints.openai.api_server",
        "--model",
        model_id,
        "--port",
        "8000",
        "--dtype",
        "float16",
        "--gpu-memory-utilization",
        "0.7",
    ],
    stdout=subprocess.PIPE,
    stderr=subprocess.PIPE,
    text=True,
)

print("[2/3] Waiting 8 seconds for server to initialize...")
time.sleep(8)

if server_process.poll() is not None:
    stdout, stderr = server_process.communicate()
    print(f"ERROR: vLLM server failed to start.\nSTDOUT:\n{stdout}\nSTDERR:\n{stderr}")
    sys.exit(1)

print("[3/3] Sending inference request...\n")

try:
    response = requests.post(
        "http://localhost:8000/v1/chat/completions",
        json={
            "model": model_id,
            "messages": [{"role": "user", "content": "What is 2+2?"}],
            "temperature": 0.7,
            "max_tokens": 50,
        },
        timeout=30,
    )
    
    if response.status_code == 200:
        result = response.json()
        assistant_message = result["choices"][0]["message"]["content"]
        print(f"User: What is 2+2?")
        print(f"Qwen: {assistant_message}")
        print(f"\nTokens used: {result['usage']['total_tokens']}")
    else:
        print(f"Error: {response.status_code}")
        print(response.text)
finally:
    print("\nShutting down server...")
    server_process.terminate()
    try:
        server_process.wait(timeout=5)
    except subprocess.TimeoutExpired:
        server_process.kill()

Output

[1/3] Starting vLLM server with Qwen/Qwen2.5-1.5B-Instruct...
[2/3] Waiting 8 seconds for server to initialize...
[3/3] Sending inference request...

User: What is 2+2?
Qwen: 2 + 2 = 4

Tokens used: 8

Shutting down server...

What just happened?

The code spawned a vLLM subprocess that loaded Qwen2.5-1.5B-Instruct into GPU memory with float16 quantization, listened on port 8000, received a chat completion request via the OpenAI-compatible /v1/chat/completions endpoint, ran inference on your GPU, and returned the response with token usage stats. Then it cleanly shut down the server.

Common gotcha

vLLM must warm up for 5-10 seconds before it's ready to accept requests: if you send requests immediately, you'll get connection refused errors. Also, the first request is always slower (model load + compilation), so don't benchmark using just one request. And if your GPU memory is tight, the model might load partially on CPU, causing severe slowdowns: reduce --gpu-memory-utilization or quantize further if you see OOM warnings.

Error recovery

ConnectionRefusedError

Server hasn't started yet. Increase sleep time from 8 to 15 seconds, or use a polling loop to check /v1/models endpoint until it returns 200.

OutOfMemoryError (CUDA)

Your GPU doesn't have enough VRAM. Add --dtype bfloat16 to use less memory, or use a smaller model like Qwen2.5-0.5B-Instruct. On CPU: remove --dtype float16 and vLLM will fall back to CPU (slow but runs).

No module named 'vllm'

vLLM isn't installed. Run 'pip install vllm' or 'pip install vllm[openai]' to include the server dependencies.

ModuleNotFoundError: No module named 'transformers'

vLLM depends on transformers. Run 'pip install transformers torch' before starting the server.

Model not found / 404

The model ID is invalid or not on Hugging Face. Verify the model exists and you're using the exact repo path (e.g., Qwen/Qwen2.5-7B-Instruct, not Qwen/qwen2.5-7b).

Experienced dev note

In production, don't subprocess vLLM from your app: run it as a separate service (systemd, Docker, K8s). This decouples serving from your application logic, lets you restart the server without killing your app, and makes scaling easier. Also, set --swap-space (in GB) if you're memory-constrained; vLLM will use disk as overflow. Finally, monitor --max-model-len against your actual requests: if users send 4K-token prompts but you set max_model_len=512, requests fail silently with 'Input too long' errors.

Check your understanding

Why does vLLM's batching make serving Qwen faster than just calling model.generate() in a loop for each request, and what happens if you set --gpu-memory-utilization too high?

Show answer hint

A correct answer explains that batching amortizes model loading and computes multiple sequences in parallel via PagedAttention (reducing memory fragmentation), whereas looping reloads overhead per request. And too-high GPU utilization causes OOM or falls back to slower CPU swapping, tanking throughput.

VERSION vLLM 0.4.0+ (Jan 2025) fully supports Qwen2.5 with rope_scaling='linear' built-in; earlier versions required manual config. Qwen2.5-Coder models work identically. For vLLM < 0.3.0, use --tensor-parallel-size if serving 70B+ models.

Next: optimize Qwen inference with quantization: learn how to run Qwen2.5 in 4-bit via bitsandbytes to cut GPU memory by 75% while keeping quality.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.