vLLM serving Qwen
Why this matters
Qwen models are fast, but serving them requires handling concurrent requests, memory constraints, and inference optimization: vLLM solves all three automatically, letting you focus on your application instead of infrastructure.
Explanation
vLLM is an inference engine optimized for LLMs. It manages batching, KV-cache sharing, and GPU memory to serve requests fast and efficiently. Instead of loading Qwen yourself and writing a server, vLLM wraps the model, handles concurrency, and exposes an OpenAI-compatible API endpoint. Mechanically: vLLM loads the model into GPU memory, listens for HTTP requests, batches multiple requests together, computes tokens in parallel via PagedAttention (which reuses KV-cache across sequences), and streams responses back. You define the model ID (e.g., 'Qwen/Qwen2.5-7B-Instruct'), vLLM downloads and caches it, and you get a production-ready server without writing request handling code. When to use: Whenever you're serving Qwen to multiple users or building a chatbot API, vLLM is the standard choice because it's faster and simpler than rolling your own.
Analogy
Think of vLLM as a restaurant kitchen manager: you send in multiple orders (requests), it groups them together, cooks as efficiently as possible, and plates them out: instead of cooking one meal at a time and washing dishes between each order.
Code
import requests
import subprocess
import time
import sys
import os
model_id = "Qwen/Qwen2.5-1.5B-Instruct"
print(f"[1/3] Starting vLLM server with {model_id}...")
server_process = subprocess.Popen(
[
sys.executable,
"-m",
"vllm.entrypoints.openai.api_server",
"--model",
model_id,
"--port",
"8000",
"--dtype",
"float16",
"--gpu-memory-utilization",
"0.7",
],
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
text=True,
)
print("[2/3] Waiting 8 seconds for server to initialize...")
time.sleep(8)
if server_process.poll() is not None:
stdout, stderr = server_process.communicate()
print(f"ERROR: vLLM server failed to start.\nSTDOUT:\n{stdout}\nSTDERR:\n{stderr}")
sys.exit(1)
print("[3/3] Sending inference request...\n")
try:
response = requests.post(
"http://localhost:8000/v1/chat/completions",
json={
"model": model_id,
"messages": [{"role": "user", "content": "What is 2+2?"}],
"temperature": 0.7,
"max_tokens": 50,
},
timeout=30,
)
if response.status_code == 200:
result = response.json()
assistant_message = result["choices"][0]["message"]["content"]
print(f"User: What is 2+2?")
print(f"Qwen: {assistant_message}")
print(f"\nTokens used: {result['usage']['total_tokens']}")
else:
print(f"Error: {response.status_code}")
print(response.text)
finally:
print("\nShutting down server...")
server_process.terminate()
try:
server_process.wait(timeout=5)
except subprocess.TimeoutExpired:
server_process.kill() [1/3] Starting vLLM server with Qwen/Qwen2.5-1.5B-Instruct... [2/3] Waiting 8 seconds for server to initialize... [3/3] Sending inference request... User: What is 2+2? Qwen: 2 + 2 = 4 Tokens used: 8 Shutting down server...
What just happened?
The code spawned a vLLM subprocess that loaded Qwen2.5-1.5B-Instruct into GPU memory with float16 quantization, listened on port 8000, received a chat completion request via the OpenAI-compatible /v1/chat/completions endpoint, ran inference on your GPU, and returned the response with token usage stats. Then it cleanly shut down the server.
Common gotcha
vLLM must warm up for 5-10 seconds before it's ready to accept requests: if you send requests immediately, you'll get connection refused errors. Also, the first request is always slower (model load + compilation), so don't benchmark using just one request. And if your GPU memory is tight, the model might load partially on CPU, causing severe slowdowns: reduce --gpu-memory-utilization or quantize further if you see OOM warnings.
Error recovery
ConnectionRefusedErrorOutOfMemoryError (CUDA)No module named 'vllm'ModuleNotFoundError: No module named 'transformers'Model not found / 404Experienced dev note
In production, don't subprocess vLLM from your app: run it as a separate service (systemd, Docker, K8s). This decouples serving from your application logic, lets you restart the server without killing your app, and makes scaling easier. Also, set --swap-space (in GB) if you're memory-constrained; vLLM will use disk as overflow. Finally, monitor --max-model-len against your actual requests: if users send 4K-token prompts but you set max_model_len=512, requests fail silently with 'Input too long' errors.
Check your understanding
Why does vLLM's batching make serving Qwen faster than just calling model.generate() in a loop for each request, and what happens if you set --gpu-memory-utilization too high?
Show answer hint
A correct answer explains that batching amortizes model loading and computes multiple sequences in parallel via PagedAttention (reducing memory fragmentation), whereas looping reloads overhead per request. And too-high GPU utilization causes OOM or falls back to slower CPU swapping, tanking throughput.