Tool Beginner easy · 5 min concept

Model warm-up time

What you will learn

Warm up vLLM models before production traffic to eliminate first-request latency spikes.

Why this matters

Without warm-up, your first inference request takes 5-10x longer than steady-state because the model is being loaded into GPU memory, compiled, and initialized. In production, this means your first user experiences a multi-second delay while others get sub-100ms responses. Warm-up eliminates this cold-start penalty.

Skip if: If you're running one-off batch jobs or testing locally, warm-up adds unnecessary startup time. Skip it for development environments or if request volume is low enough that the first-request delay doesn't matter.

Explanation

When vLLM starts, the model weights sit on disk. The first inference request triggers GPU memory allocation, kernel compilation, and attention mechanism initialization: work that only happens once. Subsequent requests hit a warm GPU with cached kernels and allocated memory, running 5-10x faster. Warm-up is the practice of sending one or more dummy inference requests immediately after starting the server, before accepting real traffic. This forces all initialization to happen in a controlled way, guaranteeing that production requests arrive to a fully-initialized system. The warm-up request itself is discarded; only the timing cost matters. For most production deployments, warm-up takes 10-30 seconds upfront and eliminates an unpredictable cold-start penalty on user requests.

Configuration

bash

#!/bin/bash
# Start vLLM server in the background
vllm serve meta-llama/Llama-3.2-8B-Instruct \
  --port 8000 \
  --gpu-memory-utilization 0.9 \
  --disable-log-requests &

SERVER_PID=$!

# Wait for server to be ready (basic HTTP check)
echo "Waiting for server to start..."
sleep 15

# Warm up the model with a dummy request
echo "Warming up model..."
curl -X POST http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.2-8B-Instruct",
    "prompt": "Once upon a time",
    "max_tokens": 10,
    "temperature": 0.7
  }' \
  -w "\nWarmup latency: %{time_total}s\n"

echo "Model is now warm. Server ready for production traffic."

# Keep server running
wait $SERVER_PID

Why this order?

Start the server first, wait for it to bind to the port, then send the warm-up request. If you send the request before the server is listening, it will fail. The sleep ensures the server has time to initialize before we probe it.

Wrong vs Right

Wrong way

bash

# Wrong: sending production traffic immediately after server start
vllm serve meta-llama/Llama-3.2-8B-Instruct &
# Server is still loading, but you immediately send real requests
curl http://localhost:8000/v1/completions -d '{"prompt": "user query"}'
# Result: first request takes 12 seconds, subsequent requests take 0.3 seconds

Right way

bash

# Right: explicit warm-up phase before production traffic
vllm serve meta-llama/Llama-3.2-8B-Instruct &
sleep 15  # Allow server initialization
# Send warm-up request (throw away response)
curl -X POST http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "meta-llama/Llama-3.2-8B-Instruct", "prompt": "test", "max_tokens": 10}' > /dev/null
echo "Warm-up complete. Ready for traffic."
# Now production requests hit a fully-initialized GPU

Tool vitals

Primary command

bash

vllm serve with --disable-log-requests, then call /v1/completions while monitoring latency

Config file None: warm-up is handled via startup scripts or orchestration layer

Verify

bash

curl with time measurement before and after warm-up

Integration notes

In Kubernetes, wrap vLLM in a StatefulSet with a startup probe that performs warm-up before marking the pod as Ready. In Docker Compose, add the warm-up script as a sidecar service that runs after the vLLM container starts. In production load balancers (NGINX, HAProxy), use health checks that hit the warm-up endpoint first, ensuring only fully-initialized replicas receive traffic.

Migration path

If you move to a different inference server (TensorRT-LLM, Ollama, HuggingFace Text Generation Inference), each has its own warm-up mechanism or doesn't need it. vLLM warm-up is not portable: you'll need to rewrite the startup script for the new tool. However, the principle (eliminating cold-start latency) remains universal.

Common gotcha

Warm-up requests still consume GPU memory and must complete before they're discarded. If your warm-up prompt is too long or max_tokens is too high, the warm-up itself can OOM. Keep warm-up requests tiny (10-20 tokens max). Also, warm-up only survives for the lifetime of the server process: if the server restarts (crash, update, redeployment), you lose the warm state and the next user hits cold-start latency again. Use orchestration (Kubernetes, systemd, Docker) to restart the warm-up script automatically after server failure.

Team adoption

Add warm-up to your vLLM startup script as the default behavior: make it opt-out, not opt-in. Document in your runbooks that restarting the server takes an extra 30 seconds (for warm-up) before it's ready. In on-call documentation, note that if users report 5-10x latency variance, check recent server restarts; the problem is likely a missing warm-up step. Test warm-up in your staging environment before rolling to production so you know the exact warm-up time for your hardware and model.

Experienced dev note

Set --gpu-memory-utilization 0.9 during serve, then warm-up with a realistic batch size (not 1). If your production workload typically batches 4-8 requests, warm-up with batch_size=4 to initialize the batching kernels. Single-request warm-up won't initialize the batching code path, so your first batched request will still be slow. Also, use --disable-log-requests during startup to keep logs clean while warm-up happens.

Check your understanding

Why does sending a warm-up request with max_tokens=500 and a 2000-token prompt still not guarantee your next user request will be fast?

Show answer hint

Warm-up initializes a specific code path. If your warm-up uses single requests but production uses batching (or vice versa), the code paths are different and the new path will still be cold.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.