How to Intermediate · 3 min read

How to load balance vLLM servers

Q: How to load balance vLLM servers

Load balance vLLM servers by deploying multiple instances behind an HTTP reverse proxy like nginx or a load balancer, distributing requests with round-robin or least-connections. Query each server via the OpenAI-compatible API endpoint, ensuring high availability and scalability.

Quick answer

Load balance vLLM servers by deploying multiple instances behind an HTTP reverse proxy like nginx or a load balancer, distributing requests with round-robin or least-connections. Query each server via the OpenAI-compatible API endpoint, ensuring high availability and scalability.

PREREQUISITES

Python 3.8+
vLLM installed (pip install vllm)
Basic knowledge of HTTP reverse proxies (e.g., nginx)
OpenAI Python SDK installed (pip install openai>=1.0)
Multiple vLLM server instances running

Setup vLLM servers

Start multiple vLLM server instances on different ports or machines to serve your model concurrently. Use the CLI command to launch each server with the OpenAI-compatible API enabled.

bash

vllm serve meta-llama/Llama-3.1-8B-Instruct --port 8000
vllm serve meta-llama/Llama-3.1-8B-Instruct --port 8001

Step by step load balancing with nginx

Configure nginx as a reverse proxy to distribute incoming requests to your vLLM servers using round-robin load balancing. This setup balances load and provides fault tolerance.

python

http {
    upstream vllm_backend {
        server localhost:8000;
        server localhost:8001;
    }

    server {
        listen 8080;

        location /v1/chat/completions {
            proxy_pass http://vllm_backend;
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        }
    }
}

Query load balanced vLLM servers in Python

Use the OpenAI Python SDK to send chat completion requests to the nginx load balancer endpoint. The proxy will distribute requests across your vLLM servers automatically.

python

import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"], base_url="http://localhost:8080/v1")

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Hello from load balanced vLLM!"}]
)

print(response.choices[0].message.content)

output

Hello from load balanced vLLM!

Common variations

Use other load balancers like HAProxy or cloud-managed solutions (AWS ALB, GCP Load Balancer).
Implement client-side round-robin by cycling through server URLs in your application.
Enable health checks in your proxy to avoid routing to unhealthy vLLM instances.
Use TLS termination at the proxy for secure communication.

Troubleshooting

If requests fail, verify each vLLM server is running and reachable on its port.
Check nginx logs for proxy errors and adjust timeout settings if needed.
Ensure the base_url in your client matches the proxy address and port.
For high latency, consider scaling servers horizontally or optimizing model loading.

✅

Key Takeaways

Run multiple vLLM servers on different ports or machines for concurrency.
Use nginx or similar reverse proxies to load balance requests with round-robin.
Query the load balancer endpoint via OpenAI-compatible API in your client.
Enable health checks and TLS termination for production-grade setups.
Troubleshoot by verifying server availability and proxy configuration.

Verified 2026-04 · meta-llama/Llama-3.1-8B-Instruct

Verify ↗