How to Intermediate · 3 min read

How to run vLLM on multiple GPUs

Quick answer
Run vLLM on multiple GPUs by launching the vllm serve CLI with the --num-gpus flag or by configuring the num_gpus parameter in Python. This enables distributed inference across GPUs for faster throughput and lower latency.

PREREQUISITES

  • Python 3.8+
  • vLLM installed via pip (pip install vllm)
  • CUDA-enabled GPUs with proper drivers
  • NVIDIA NCCL installed for multi-GPU communication

Setup

Install vLLM via pip and ensure your system has CUDA-enabled GPUs with the correct drivers and NVIDIA NCCL installed for multi-GPU communication.

Run this command to install vLLM:

bash
pip install vllm

Step by step

Use the vllm serve CLI with the --num-gpus option to run inference on multiple GPUs. Alternatively, use the Python API with the num_gpus parameter.

Example CLI command:

bash
vllm serve meta-llama/Llama-3.1-8B-Instruct --port 8000 --num-gpus 2
output
Starting vLLM server on port 8000 using 2 GPUs...

Step by step

Python example to run vLLM on multiple GPUs for batch inference:

python
from vllm import LLM, SamplingParams

# Initialize LLM
llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")

prompts = ["Hello, how are you?", "What is the capital of France?"]
sampling_params = SamplingParams(temperature=0.7, max_tokens=50)

outputs = llm.generate(prompts, sampling_params)

for i, output in enumerate(outputs):
    print(f"Prompt {i+1}: {prompts[i]}")
    print(f"Response: {output.outputs[0].text}")
    print("---")
output
Prompt 1: Hello, how are you?
Response: I'm doing well, thank you! How can I assist you today?
---
Prompt 2: What is the capital of France?
Response: The capital of France is Paris.
---

Common variations

You can run the vLLM server asynchronously or change the number of GPUs dynamically. For example, to run with 4 GPUs, set --num-gpus 4 in CLI or configure your environment accordingly.

To query a running vLLM server from Python using the OpenAI-compatible SDK:

python
from openai import OpenAI
import os

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"], base_url="http://localhost:8000/v1")

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Tell me a joke."}]
)

print(response.choices[0].message.content)
output
Why did the scarecrow win an award? Because he was outstanding in his field!

Troubleshooting

  • If you see errors related to NCCL or CUDA, verify your GPU drivers and NCCL installation.
  • Ensure your GPUs have enough memory for the model size; reduce batch size or model size if out of memory.
  • Check that num_gpus does not exceed the number of available GPUs.

Key Takeaways

  • Use the CLI --num-gpus flag to enable multi-GPU inference with vLLM.
  • Ensure CUDA drivers and NVIDIA NCCL are properly installed for multi-GPU communication.
  • Query a running multi-GPU vLLM server via OpenAI-compatible API by setting base_url to the server endpoint.
Verified 2026-04 · meta-llama/Llama-3.1-8B-Instruct
Verify ↗