How to Intermediate · 3 min read

How to run vLLM on multiple GPUs

Q: How to run vLLM on multiple GPUs

Run vLLM on multiple GPUs by launching the vllm serve CLI with the --num-gpus flag or by configuring the num_gpus parameter in Python. This enables distributed inference across GPUs for faster throughput and lower latency.

Quick answer

Run vLLM on multiple GPUs by launching the vllm serve CLI with the --num-gpus flag or by configuring the num_gpus parameter in Python. This enables distributed inference across GPUs for faster throughput and lower latency.

PREREQUISITES

Python 3.8+
vLLM installed via pip (pip install vllm)
CUDA-enabled GPUs with proper drivers
NVIDIA NCCL installed for multi-GPU communication

Setup

Install vLLM via pip and ensure your system has CUDA-enabled GPUs with the correct drivers and NVIDIA NCCL installed for multi-GPU communication.

Run this command to install vLLM:

bash

pip install vllm

Step by step

Use the vllm serve CLI with the --num-gpus option to run inference on multiple GPUs. Alternatively, use the Python API with the num_gpus parameter.

Example CLI command:

bash

vllm serve meta-llama/Llama-3.1-8B-Instruct --port 8000 --num-gpus 2

output

Starting vLLM server on port 8000 using 2 GPUs...

Step by step

Python example to run vLLM on multiple GPUs for batch inference:

python

from vllm import LLM, SamplingParams

# Initialize LLM
llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")

prompts = ["Hello, how are you?", "What is the capital of France?"]
sampling_params = SamplingParams(temperature=0.7, max_tokens=50)

outputs = llm.generate(prompts, sampling_params)

for i, output in enumerate(outputs):
    print(f"Prompt {i+1}: {prompts[i]}")
    print(f"Response: {output.outputs[0].text}")
    print("---")

output

Prompt 1: Hello, how are you?
Response: I'm doing well, thank you! How can I assist you today?
---
Prompt 2: What is the capital of France?
Response: The capital of France is Paris.
---

Common variations

You can run the vLLM server asynchronously or change the number of GPUs dynamically. For example, to run with 4 GPUs, set --num-gpus 4 in CLI or configure your environment accordingly.

To query a running vLLM server from Python using the OpenAI-compatible SDK:

python

from openai import OpenAI
import os

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"], base_url="http://localhost:8000/v1")

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Tell me a joke."}]
)

print(response.choices[0].message.content)

output

Why did the scarecrow win an award? Because he was outstanding in his field!

Troubleshooting

If you see errors related to NCCL or CUDA, verify your GPU drivers and NCCL installation.
Ensure your GPUs have enough memory for the model size; reduce batch size or model size if out of memory.
Check that num_gpus does not exceed the number of available GPUs.

✅

Key Takeaways

Use the CLI --num-gpus flag to enable multi-GPU inference with vLLM.
Ensure CUDA drivers and NVIDIA NCCL are properly installed for multi-GPU communication.
Query a running multi-GPU vLLM server via OpenAI-compatible API by setting base_url to the server endpoint.

Verified 2026-04 · meta-llama/Llama-3.1-8B-Instruct

Verify ↗