Critical severity intermediate · Fix: 2-5 min

RuntimeError

vllm.RuntimeError: Tensor parallel size mismatch with GPU count

What this error means

vLLM throws a runtime error when the configured tensor parallel size does not match the number of available GPUs.

Stack trace

traceback

RuntimeError: Tensor parallel size mismatch with GPU count: tensor_parallel_size=4 but found 2 GPUs available
  File "app.py", line 42, in <module>
    llm = LLM(model="llama-3b", tensor_parallel_size=4)
  File "/usr/local/lib/python3.9/site-packages/vllm/llm.py", line 123, in __init__
    raise RuntimeError(f"Tensor parallel size mismatch with GPU count: tensor_parallel_size={tensor_parallel_size} but found {gpu_count} GPUs available")

QUICK FIX

Set tensor_parallel_size equal to the number of GPUs detected by torch.cuda.device_count() before creating the LLM instance.

Why it happens

vLLM requires the tensor_parallel_size parameter to match the number of GPUs available for parallel processing. If tensor_parallel_size is set larger than the detected GPU count, the runtime throws this error because it cannot distribute model shards correctly across GPUs.

Detection

Check your system's GPU count with nvidia-smi or torch.cuda.device_count() before initializing vLLM with tensor_parallel_size to ensure they match exactly.

Causes & fixes

tensor_parallel_size is set larger than the number of GPUs physically available

✓ Fix

Reduce tensor_parallel_size to match the actual GPU count detected by your system.

Environment variables or CUDA_VISIBLE_DEVICES mask GPUs, causing fewer GPUs to be visible than expected

✓ Fix

Verify and adjust CUDA_VISIBLE_DEVICES environment variable to expose all intended GPUs to the process.

Launching vLLM on a machine with fewer GPUs than the tensor_parallel_size specified in code or config

✓ Fix

Deploy the application on a machine with at least as many GPUs as tensor_parallel_size or lower the tensor_parallel_size accordingly.

Code: broken vs fixed

Broken - triggers the error

python

from vllm import LLM

llm = LLM(model="llama-3b", tensor_parallel_size=4)  # Error if fewer than 4 GPUs available
print("LLM initialized")

Fixed - works correctly

python

import os
from vllm import LLM
import torch

os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"  # Ensure only 2 GPUs visible

gpu_count = torch.cuda.device_count()
llm = LLM(model="llama-3b", tensor_parallel_size=gpu_count)  # Fixed: match tensor_parallel_size to GPUs
print("LLM initialized with tensor_parallel_size", gpu_count)

Adjusted tensor_parallel_size to match the actual GPU count detected by torch.cuda.device_count(), ensuring vLLM can correctly distribute model shards.

⚠

Workaround

Temporarily set tensor_parallel_size=1 to run on a single GPU until you can adjust your environment or scale your hardware.

✓

Prevention

Always programmatically detect GPU count at runtime and set tensor_parallel_size accordingly, and verify CUDA_VISIBLE_DEVICES to avoid hidden GPU masking.

Python 3.9+ · vllm >=0.3.0 · tested on 0.3.x

Verified 2026-04

Verify ↗

Community Notes

No notes yetBe the first to share a version-specific fix or tip.