Tool Intermediate medium · 6 min concept

NVLink vs PCIe performance

What you will learn

Understand how GPU interconnect topology affects vLLM throughput and when to upgrade from PCIe to NVLink.

Why this matters

In production, multi-GPU inference with vLLM on PCIe can bottleneck at 50-60% of theoretical throughput during distributed prefill and decode. NVLink clusters eliminate this bottleneck but cost 3-5x more. Choosing wrong wastes either compute budget or hardware investment.

Skip if: If you're running single-GPU inference or batch sizes < 32, PCIe is sufficient. If your model fits on one GPU and latency < 500ms is acceptable, upgrade cost rarely pays back.

Explanation

vLLM distributes token generation across GPUs using tensor parallelism and pipeline parallelism. Both require GPU-to-GPU communication during every forward pass. PCIe 4.0 provides ~32 GB/s bidirectional throughput per GPU pair; NVLink 4 (H100/H200) provides ~900 GB/s (28x faster). At large batch sizes or long sequence lengths, the communication-to-computation ratio favors NVLink. vLLM's --num-gpus and --tensor-parallel-size flags determine how much inter-GPU traffic flows. The performance gap widens with larger models (70B+) and higher concurrency (>8 concurrent requests). For cost-benefit: PCIe clusters are viable for models up to 13B with batch ≤ 256; 70B+ models require NVLink for sub-second latencies at production scale.

Configuration

yaml

# vllm_config.yaml: configure tensor parallelism for PCIe vs NVLink

model: meta-llama/Llama-3.2-70B-Instruct

# PCIe 4.0 cluster (8x A100 40GB)
engine_use_ray: false
tensor_parallel_size: 4  # 4 GPUs per tensor group avoids cross-switch communication
pipeline_parallel_size: 2  # split across 2 pipeline stages

# For NVLink H100 cluster, safely use all 8 GPUs in single tensor group
# tensor_parallel_size: 8
# pipeline_parallel_size: 1

gpu_memory_utilization: 0.85
max_num_seqs: 256
max_model_len: 2048

# PCIe: reduce batch size to avoid communication stalls
max_num_batched_tokens: 4096

# NVLink: can increase 2-3x without saturation
# max_num_batched_tokens: 12288

scheduler_delay_factor: 0.5  # optimize for latency on PCIe
load_format: auto

Why this order?

tensor_parallel_size must be set before gpu_memory_utilization because parallel strategy determines memory per GPU. Pipeline parallelism stage count follows. Batch tuning (max_num_batched_tokens) is last because it depends on the memory and communication strategy above it.

Wrong vs Right

Wrong way

yaml

# WRONG: Assume PCIe can handle same tensor parallelism as NVLink
model: meta-llama/Llama-3.2-70B-Instruct
tensor_parallel_size: 8  # all 8 GPUs, PCIe interconnect
gpu_memory_utilization: 0.95
max_num_batched_tokens: 16384  # NVLink-scale batch size

# Result: 60% GPU utilization, 2-3x slower throughput due to communication bottleneck

Right way

yaml

# RIGHT: Match tensor parallelism strategy to interconnect bandwidth
# PCIe 4.0 cluster
model: meta-llama/Llama-3.2-70B-Instruct
tensor_parallel_size: 4  # Limit to minimize cross-switch PCIe traffic
pipeline_parallel_size: 2  # Use pipeline to hide PCIe latency
gpu_memory_utilization: 0.85
max_num_batched_tokens: 6144  # Conservative batch size

# To transition to NVLink:
# 1. Change tensor_parallel_size: 8 (use all GPUs in single group)
# 2. Change pipeline_parallel_size: 1 (unnecessary with NVLink bandwidth)
# 3. Increase max_num_batched_tokens: 12288+ (saturate NVLink fully)

Tool vitals

Primary command

bash

vllm serve meta-llama/Llama-3.2-70B-Instruct --tensor-parallel-size 8 --gpu-memory-utilization 0.9

Config file vllm_config.yaml

Verify

bash

curl -s http://localhost:8000/v1/completions -H 'Content-Type: application/json' -d '{"model":"meta-llama/Llama-3.2-70B-Instruct","prompt":"test","max_tokens":10}' | jq '.usage'

Integration notes

vLLM's tensor parallelism strategy must align with your orchestration layer. Kubernetes GPU selectors cannot guarantee NVLink placement across pods: use node affinity and guaranteed GPU topology (with NVIDIA GPU-affinity scheduler). Ray (vLLM's distributed engine) can auto-detect NVLink via ray.init()code> but requires explicit --num-gpus tuning to avoid over-subscription. Always test with nvidia-smi topo -m first to verify your cluster's actual interconnect before deploying.

  Migration path
 If moving from PCIe to NVLink: (1) reduce tensor_parallel_size from 8→4 on old cluster to baseline, (2) measure tokens/sec as ground truth, (3) on new NVLink cluster, set tensor_parallel_size: 8 and gradually increase max_num_batched_tokens until GPU util drops, (4) compare tokens/sec at same batch size: NVLink should show 3-5x gain. Rolling back requires only reversing step 1; no code changes needed.
 
                             
  Common gotcha
 Setting tensor_parallel_size: 8 on a PCIe cluster with vLLM will not error: it will silently run at 40-60% GPU utilization. vLLM cannot automatically detect interconnect topology. The server starts successfully, requests complete, but throughput stays mysteriously low. Monitor gpu-smi dmon cross-GPU power/clock; if any GPU consistently under-clocked, communication is saturated. Use VLLM_TRACE_FUNCTION=1 python -m vllm.entrypoints.openai.api_server to see inter-GPU communication delays in logs.
 
 
   
      
  Team adoption
 
Start all new vLLM clusters with a 10-minute interconnect audit: nvidia-smi topo -m and nccl-tests/build/all_reduce_perf -b 100M -e 100M 2>&1 | grep -E 'Gbps|Algorithm'. Create a runbook with two config templates (PCIe and NVLink paths). In PR reviews, require a comment explaining which template applies and why. Most teams skip this and waste weeks tuning batch size on wrong hardware.
 
 
    Experienced dev note
 
 
The trick experienced vLLM operators use: run vllm serve --disable-log-stats --num-gpus 8 ... &, then immediately check watch -n 0.1 'nvidia-smi | grep vllm'. On PCIe, you'll see uneven power draw (some GPUs idle while others compute). On NVLink, all GPUs draw equal power per batch: that's the signal you've tuned correctly. Set scheduler_delay_factor: 0.1 on NVLink (aggressive scheduling) and scheduler_delay_factor: 0.5 on PCIe (hide communication latency). This single flag is worth 15-20% throughput difference and no one documents it.
 
       Check your understanding
 
 You're deploying a 70B model on an 8-GPU PCIe 4.0 cluster. Your current config uses tensor_parallel_size: 8 and achieves 800 tokens/sec. You upgrade to 8-GPU NVLink and expect 4000 tokens/sec (5x). After upgrade, you measure only 1200 tokens/sec. What's the most likely cause, and what's the first config change you'd make?
  Show answer hint
 Simply upgrading hardware doesn't trigger better parallelism strategy. You must explicitly reconfigure <code>tensor_parallel_size</code>, <code>pipeline_parallel_size</code>, and batching parameters to match NVLink's bandwidth. The PCIe config is still constraining you. Change <code>tensor_parallel_size: 8</code> (already correct), but increase <code>max_num_batched_tokens</code> from 4096→12288 and drop <code>pipeline_parallel_size</code> to 1 to fully use NVLink. Measure again.
 
 
           
  Community Notes
No notes yetBe the first to share a version-specific fix or tip.
Include a code snippet
Displayed with monospace formatting
0 / 1000