Model warm-up time
Why this matters
Without warm-up, your first inference request takes 5-10x longer than steady-state because the model is being loaded into GPU memory, compiled, and initialized. In production, this means your first user experiences a multi-second delay while others get sub-100ms responses. Warm-up eliminates this cold-start penalty.
Explanation
When vLLM starts, the model weights sit on disk. The first inference request triggers GPU memory allocation, kernel compilation, and attention mechanism initialization: work that only happens once. Subsequent requests hit a warm GPU with cached kernels and allocated memory, running 5-10x faster. Warm-up is the practice of sending one or more dummy inference requests immediately after starting the server, before accepting real traffic. This forces all initialization to happen in a controlled way, guaranteeing that production requests arrive to a fully-initialized system. The warm-up request itself is discarded; only the timing cost matters. For most production deployments, warm-up takes 10-30 seconds upfront and eliminates an unpredictable cold-start penalty on user requests.
Configuration
#!/bin/bash
# Start vLLM server in the background
vllm serve meta-llama/Llama-3.2-8B-Instruct \
--port 8000 \
--gpu-memory-utilization 0.9 \
--disable-log-requests &
SERVER_PID=$!
# Wait for server to be ready (basic HTTP check)
echo "Waiting for server to start..."
sleep 15
# Warm up the model with a dummy request
echo "Warming up model..."
curl -X POST http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.2-8B-Instruct",
"prompt": "Once upon a time",
"max_tokens": 10,
"temperature": 0.7
}' \
-w "\nWarmup latency: %{time_total}s\n"
echo "Model is now warm. Server ready for production traffic."
# Keep server running
wait $SERVER_PID Why this order?
Start the server first, wait for it to bind to the port, then send the warm-up request. If you send the request before the server is listening, it will fail. The sleep ensures the server has time to initialize before we probe it.
Wrong vs Right
# Wrong: sending production traffic immediately after server start
vllm serve meta-llama/Llama-3.2-8B-Instruct &
# Server is still loading, but you immediately send real requests
curl http://localhost:8000/v1/completions -d '{"prompt": "user query"}'
# Result: first request takes 12 seconds, subsequent requests take 0.3 seconds # Right: explicit warm-up phase before production traffic
vllm serve meta-llama/Llama-3.2-8B-Instruct &
sleep 15 # Allow server initialization
# Send warm-up request (throw away response)
curl -X POST http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{"model": "meta-llama/Llama-3.2-8B-Instruct", "prompt": "test", "max_tokens": 10}' > /dev/null
echo "Warm-up complete. Ready for traffic."
# Now production requests hit a fully-initialized GPU Tool vitals
vllm serve with --disable-log-requests, then call /v1/completions while monitoring latency None: warm-up is handled via startup scripts or orchestration layer curl with time measurement before and after warm-up Integration notes
In Kubernetes, wrap vLLM in a StatefulSet with a startup probe that performs warm-up before marking the pod as Ready. In Docker Compose, add the warm-up script as a sidecar service that runs after the vLLM container starts. In production load balancers (NGINX, HAProxy), use health checks that hit the warm-up endpoint first, ensuring only fully-initialized replicas receive traffic.
Migration path
If you move to a different inference server (TensorRT-LLM, Ollama, HuggingFace Text Generation Inference), each has its own warm-up mechanism or doesn't need it. vLLM warm-up is not portable: you'll need to rewrite the startup script for the new tool. However, the principle (eliminating cold-start latency) remains universal.
Common gotcha
Warm-up requests still consume GPU memory and must complete before they're discarded. If your warm-up prompt is too long or max_tokens is too high, the warm-up itself can OOM. Keep warm-up requests tiny (10-20 tokens max). Also, warm-up only survives for the lifetime of the server process: if the server restarts (crash, update, redeployment), you lose the warm state and the next user hits cold-start latency again. Use orchestration (Kubernetes, systemd, Docker) to restart the warm-up script automatically after server failure.
Team adoption
Add warm-up to your vLLM startup script as the default behavior: make it opt-out, not opt-in. Document in your runbooks that restarting the server takes an extra 30 seconds (for warm-up) before it's ready. In on-call documentation, note that if users report 5-10x latency variance, check recent server restarts; the problem is likely a missing warm-up step. Test warm-up in your staging environment before rolling to production so you know the exact warm-up time for your hardware and model.
Experienced dev note
Set --gpu-memory-utilization 0.9 during serve, then warm-up with a realistic batch size (not 1). If your production workload typically batches 4-8 requests, warm-up with batch_size=4 to initialize the batching kernels. Single-request warm-up won't initialize the batching code path, so your first batched request will still be slow. Also, use --disable-log-requests during startup to keep logs clean while warm-up happens.
Check your understanding
Why does sending a warm-up request with max_tokens=500 and a 2000-token prompt still not guarantee your next user request will be fast?
Show answer hint
Warm-up initializes a specific code path. If your warm-up uses single requests but production uses batching (or vice versa), the code paths are different and the new path will still be cold.