How to run vLLM on multiple GPUs
Quick answer
Run
vLLM on multiple GPUs by launching the vllm serve CLI with the --num-gpus flag or by configuring the num_gpus parameter in Python. This enables distributed inference across GPUs for faster throughput and lower latency.PREREQUISITES
Python 3.8+vLLM installed via pip (pip install vllm)CUDA-enabled GPUs with proper driversNVIDIA NCCL installed for multi-GPU communication
Setup
Install vLLM via pip and ensure your system has CUDA-enabled GPUs with the correct drivers and NVIDIA NCCL installed for multi-GPU communication.
Run this command to install vLLM:
pip install vllm Step by step
Use the vllm serve CLI with the --num-gpus option to run inference on multiple GPUs. Alternatively, use the Python API with the num_gpus parameter.
Example CLI command:
vllm serve meta-llama/Llama-3.1-8B-Instruct --port 8000 --num-gpus 2 output
Starting vLLM server on port 8000 using 2 GPUs...
Step by step
Python example to run vLLM on multiple GPUs for batch inference:
from vllm import LLM, SamplingParams
# Initialize LLM
llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")
prompts = ["Hello, how are you?", "What is the capital of France?"]
sampling_params = SamplingParams(temperature=0.7, max_tokens=50)
outputs = llm.generate(prompts, sampling_params)
for i, output in enumerate(outputs):
print(f"Prompt {i+1}: {prompts[i]}")
print(f"Response: {output.outputs[0].text}")
print("---") output
Prompt 1: Hello, how are you? Response: I'm doing well, thank you! How can I assist you today? --- Prompt 2: What is the capital of France? Response: The capital of France is Paris. ---
Common variations
You can run the vLLM server asynchronously or change the number of GPUs dynamically. For example, to run with 4 GPUs, set --num-gpus 4 in CLI or configure your environment accordingly.
To query a running vLLM server from Python using the OpenAI-compatible SDK:
from openai import OpenAI
import os
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"], base_url="http://localhost:8000/v1")
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": "Tell me a joke."}]
)
print(response.choices[0].message.content) output
Why did the scarecrow win an award? Because he was outstanding in his field!
Troubleshooting
- If you see errors related to NCCL or CUDA, verify your GPU drivers and NCCL installation.
- Ensure your GPUs have enough memory for the model size; reduce batch size or model size if out of memory.
- Check that
num_gpusdoes not exceed the number of available GPUs.
Key Takeaways
- Use the CLI
--num-gpusflag to enable multi-GPU inference with vLLM. - Ensure CUDA drivers and NVIDIA NCCL are properly installed for multi-GPU communication.
- Query a running multi-GPU vLLM server via OpenAI-compatible API by setting
base_urlto the server endpoint.