RuntimeError
vllm.RuntimeError: Tensor parallel size mismatch with GPU count
Stack trace
RuntimeError: Tensor parallel size mismatch with GPU count: tensor_parallel_size=4 but found 2 GPUs available
File "app.py", line 42, in <module>
llm = LLM(model="llama-3b", tensor_parallel_size=4)
File "/usr/local/lib/python3.9/site-packages/vllm/llm.py", line 123, in __init__
raise RuntimeError(f"Tensor parallel size mismatch with GPU count: tensor_parallel_size={tensor_parallel_size} but found {gpu_count} GPUs available") Why it happens
vLLM requires the tensor_parallel_size parameter to match the number of GPUs available for parallel processing. If tensor_parallel_size is set larger than the detected GPU count, the runtime throws this error because it cannot distribute model shards correctly across GPUs.
Detection
Check your system's GPU count with nvidia-smi or torch.cuda.device_count() before initializing vLLM with tensor_parallel_size to ensure they match exactly.
Causes & fixes
tensor_parallel_size is set larger than the number of GPUs physically available
Reduce tensor_parallel_size to match the actual GPU count detected by your system.
Environment variables or CUDA_VISIBLE_DEVICES mask GPUs, causing fewer GPUs to be visible than expected
Verify and adjust CUDA_VISIBLE_DEVICES environment variable to expose all intended GPUs to the process.
Launching vLLM on a machine with fewer GPUs than the tensor_parallel_size specified in code or config
Deploy the application on a machine with at least as many GPUs as tensor_parallel_size or lower the tensor_parallel_size accordingly.
Code: broken vs fixed
from vllm import LLM
llm = LLM(model="llama-3b", tensor_parallel_size=4) # Error if fewer than 4 GPUs available
print("LLM initialized") import os
from vllm import LLM
import torch
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1" # Ensure only 2 GPUs visible
gpu_count = torch.cuda.device_count()
llm = LLM(model="llama-3b", tensor_parallel_size=gpu_count) # Fixed: match tensor_parallel_size to GPUs
print("LLM initialized with tensor_parallel_size", gpu_count) Workaround
Temporarily set tensor_parallel_size=1 to run on a single GPU until you can adjust your environment or scale your hardware.
Prevention
Always programmatically detect GPU count at runtime and set tensor_parallel_size accordingly, and verify CUDA_VISIBLE_DEVICES to avoid hidden GPU masking.