How to load balance vLLM servers
Quick answer
Load balance
vLLM servers by deploying multiple instances behind an HTTP reverse proxy like nginx or a load balancer, distributing requests with round-robin or least-connections. Query each server via the OpenAI-compatible API endpoint, ensuring high availability and scalability.PREREQUISITES
Python 3.8+vLLM installed (pip install vllm)Basic knowledge of HTTP reverse proxies (e.g., nginx)OpenAI Python SDK installed (pip install openai>=1.0)Multiple vLLM server instances running
Setup vLLM servers
Start multiple vLLM server instances on different ports or machines to serve your model concurrently. Use the CLI command to launch each server with the OpenAI-compatible API enabled.
vllm serve meta-llama/Llama-3.1-8B-Instruct --port 8000
vllm serve meta-llama/Llama-3.1-8B-Instruct --port 8001 Step by step load balancing with nginx
Configure nginx as a reverse proxy to distribute incoming requests to your vLLM servers using round-robin load balancing. This setup balances load and provides fault tolerance.
http {
upstream vllm_backend {
server localhost:8000;
server localhost:8001;
}
server {
listen 8080;
location /v1/chat/completions {
proxy_pass http://vllm_backend;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
}
}
} Query load balanced vLLM servers in Python
Use the OpenAI Python SDK to send chat completion requests to the nginx load balancer endpoint. The proxy will distribute requests across your vLLM servers automatically.
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"], base_url="http://localhost:8080/v1")
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": "Hello from load balanced vLLM!"}]
)
print(response.choices[0].message.content) output
Hello from load balanced vLLM!
Common variations
- Use other load balancers like HAProxy or cloud-managed solutions (AWS ALB, GCP Load Balancer).
- Implement client-side round-robin by cycling through server URLs in your application.
- Enable health checks in your proxy to avoid routing to unhealthy
vLLMinstances. - Use TLS termination at the proxy for secure communication.
Troubleshooting
- If requests fail, verify each
vLLMserver is running and reachable on its port. - Check
nginxlogs for proxy errors and adjust timeout settings if needed. - Ensure the
base_urlin your client matches the proxy address and port. - For high latency, consider scaling servers horizontally or optimizing model loading.
Key Takeaways
- Run multiple vLLM servers on different ports or machines for concurrency.
- Use nginx or similar reverse proxies to load balance requests with round-robin.
- Query the load balancer endpoint via OpenAI-compatible API in your client.
- Enable health checks and TLS termination for production-grade setups.
- Troubleshoot by verifying server availability and proxy configuration.