NVLink vs PCIe performance
Why this matters
In production, multi-GPU inference with vLLM on PCIe can bottleneck at 50-60% of theoretical throughput during distributed prefill and decode. NVLink clusters eliminate this bottleneck but cost 3-5x more. Choosing wrong wastes either compute budget or hardware investment.
Explanation
vLLM distributes token generation across GPUs using tensor parallelism and pipeline parallelism. Both require GPU-to-GPU communication during every forward pass. PCIe 4.0 provides ~32 GB/s bidirectional throughput per GPU pair; NVLink 4 (H100/H200) provides ~900 GB/s (28x faster). At large batch sizes or long sequence lengths, the communication-to-computation ratio favors NVLink. vLLM's --num-gpus and --tensor-parallel-size flags determine how much inter-GPU traffic flows. The performance gap widens with larger models (70B+) and higher concurrency (>8 concurrent requests). For cost-benefit: PCIe clusters are viable for models up to 13B with batch ≤ 256; 70B+ models require NVLink for sub-second latencies at production scale.
Configuration
# vllm_config.yaml: configure tensor parallelism for PCIe vs NVLink
model: meta-llama/Llama-3.2-70B-Instruct
# PCIe 4.0 cluster (8x A100 40GB)
engine_use_ray: false
tensor_parallel_size: 4 # 4 GPUs per tensor group avoids cross-switch communication
pipeline_parallel_size: 2 # split across 2 pipeline stages
# For NVLink H100 cluster, safely use all 8 GPUs in single tensor group
# tensor_parallel_size: 8
# pipeline_parallel_size: 1
gpu_memory_utilization: 0.85
max_num_seqs: 256
max_model_len: 2048
# PCIe: reduce batch size to avoid communication stalls
max_num_batched_tokens: 4096
# NVLink: can increase 2-3x without saturation
# max_num_batched_tokens: 12288
scheduler_delay_factor: 0.5 # optimize for latency on PCIe
load_format: auto Why this order?
tensor_parallel_size must be set before gpu_memory_utilization because parallel strategy determines memory per GPU. Pipeline parallelism stage count follows. Batch tuning (max_num_batched_tokens) is last because it depends on the memory and communication strategy above it.
Wrong vs Right
# WRONG: Assume PCIe can handle same tensor parallelism as NVLink
model: meta-llama/Llama-3.2-70B-Instruct
tensor_parallel_size: 8 # all 8 GPUs, PCIe interconnect
gpu_memory_utilization: 0.95
max_num_batched_tokens: 16384 # NVLink-scale batch size
# Result: 60% GPU utilization, 2-3x slower throughput due to communication bottleneck # RIGHT: Match tensor parallelism strategy to interconnect bandwidth
# PCIe 4.0 cluster
model: meta-llama/Llama-3.2-70B-Instruct
tensor_parallel_size: 4 # Limit to minimize cross-switch PCIe traffic
pipeline_parallel_size: 2 # Use pipeline to hide PCIe latency
gpu_memory_utilization: 0.85
max_num_batched_tokens: 6144 # Conservative batch size
# To transition to NVLink:
# 1. Change tensor_parallel_size: 8 (use all GPUs in single group)
# 2. Change pipeline_parallel_size: 1 (unnecessary with NVLink bandwidth)
# 3. Increase max_num_batched_tokens: 12288+ (saturate NVLink fully) Tool vitals
vllm serve meta-llama/Llama-3.2-70B-Instruct --tensor-parallel-size 8 --gpu-memory-utilization 0.9 vllm_config.yaml curl -s http://localhost:8000/v1/completions -H 'Content-Type: application/json' -d '{"model":"meta-llama/Llama-3.2-70B-Instruct","prompt":"test","max_tokens":10}' | jq '.usage' Integration notes
vLLM's tensor parallelism strategy must align with your orchestration layer. Kubernetes GPU selectors cannot guarantee NVLink placement across pods: use node affinity and guaranteed GPU topology (with NVIDIA GPU-affinity scheduler). Ray (vLLM's distributed engine) can auto-detect NVLink via ray.init()code> but requires explicit --num-gpus tuning to avoid over-subscription. Always test with nvidia-smi topo -m first to verify your cluster's actual interconnect before deploying.
Migration path
If moving from PCIe to NVLink: (1) reduce tensor_parallel_size from 8→4 on old cluster to baseline, (2) measure tokens/sec as ground truth, (3) on new NVLink cluster, set tensor_parallel_size: 8 and gradually increase max_num_batched_tokens until GPU util drops, (4) compare tokens/sec at same batch size: NVLink should show 3-5x gain. Rolling back requires only reversing step 1; no code changes needed.
Common gotcha
Setting tensor_parallel_size: 8 on a PCIe cluster with vLLM will not error: it will silently run at 40-60% GPU utilization. vLLM cannot automatically detect interconnect topology. The server starts successfully, requests complete, but throughput stays mysteriously low. Monitor gpu-smi dmon cross-GPU power/clock; if any GPU consistently under-clocked, communication is saturated. Use VLLM_TRACE_FUNCTION=1 python -m vllm.entrypoints.openai.api_server to see inter-GPU communication delays in logs.
Team adoption
Start all new vLLM clusters with a 10-minute interconnect audit: nvidia-smi topo -m and nccl-tests/build/all_reduce_perf -b 100M -e 100M 2>&1 | grep -E 'Gbps|Algorithm'. Create a runbook with two config templates (PCIe and NVLink paths). In PR reviews, require a comment explaining which template applies and why. Most teams skip this and waste weeks tuning batch size on wrong hardware.
Experienced dev note
The trick experienced vLLM operators use: run vllm serve --disable-log-stats --num-gpus 8 ... &, then immediately check watch -n 0.1 'nvidia-smi | grep vllm'. On PCIe, you'll see uneven power draw (some GPUs idle while others compute). On NVLink, all GPUs draw equal power per batch: that's the signal you've tuned correctly. Set scheduler_delay_factor: 0.1 on NVLink (aggressive scheduling) and scheduler_delay_factor: 0.5 on PCIe (hide communication latency). This single flag is worth 15-20% throughput difference and no one documents it.
Check your understanding
You're deploying a 70B model on an 8-GPU PCIe 4.0 cluster. Your current config uses tensor_parallel_size: 8 and achieves 800 tokens/sec. You upgrade to 8-GPU NVLink and expect 4000 tokens/sec (5x). After upgrade, you measure only 1200 tokens/sec. What's the most likely cause, and what's the first config change you'd make?
Show answer hint
Simply upgrading hardware doesn't trigger better parallelism strategy. You must explicitly reconfigure <code>tensor_parallel_size</code>, <code>pipeline_parallel_size</code>, and batching parameters to match NVLink's bandwidth. The PCIe config is still constraining you. Change <code>tensor_parallel_size: 8</code> (already correct), but increase <code>max_num_batched_tokens</code> from 4096→12288 and drop <code>pipeline_parallel_size</code> to 1 to fully use NVLink. Measure again.