Comparison intermediate · 7 min read

vLLM vs TensorRT-LLM: GPU Serving Speed vs Inference Optimization

Quick pick

Use vLLM if you want fast setup with broad model support and OpenAI-compatible API. Use TensorRT-LLM if you need maximum throughput and are willing to invest in model optimization and compilation.

VERDICT

Use vLLM for production serving to multiple concurrent users with minimal setup: it achieves 1,500–2,500 tokens/sec on A100 with zero compilation overhead. Use TensorRT-LLM when you need absolute maximum throughput (3,000–4,500 tokens/sec) and can afford model compilation time and NVIDIA-specific optimization. vLLM wins on ease and flexibility; TensorRT-LLM wins on peak performance for fixed workloads.

Side-by-side comparison

Dimension	vLLM	TensorRT-LLM	Winner
Throughput (7B model, A100)	~1,500–2,500 tokens/sec	~3,000–4,500 tokens/sec	tensorrt-llm
Time to first token (7B)	~80–120ms	~50–80ms	tensorrt-llm
Setup complexity	pip install vllm: runs immediately	Requires model compilation, NVIDIA TensorRT toolkit	vllm
Model compatibility	500+ HuggingFace models (Llama, Mistral, Qwen, etc.)	Selected models (Llama, GPT, Falcon, Qwen: community support)	vllm
API compatibility	OpenAI-compatible /v1/chat/completions	HTTP API + Python SDK (not OpenAI-compatible)	vllm
Quantization support	GPTQ, AWQ, INT8, bfloat16: auto-loading	INT4, INT8, FP8: requires TensorRT conversion	vllm
Multi-GPU support	Tensor parallelism and pipeline parallelism built-in	Tensor parallelism (pipeline still experimental)	vllm
License	Apache 2.0	Apache 2.0	Tie
GPU requirement	NVIDIA CUDA 11.8+ or AMD ROCm	NVIDIA CUDA 12.0+, NVIDIA GPUs only	vllm
Customization flexibility	High: modify Python code, add custom operators easily	Low: locked into TensorRT engine, difficult to extend	vllm

Performance benchmarks

Throughput (Llama 2 7B, batch=32, A100 80GB)

vLLM ~2,000 tokens/sec

TensorRT-LLM ~3,500 tokens/sec

vLLM uses continuous batching; TensorRT-LLM uses static batch compilation. TensorRT-LLM gains 75% throughput advantage but requires pre-compilation of batch sizes.

Time to first token (7B model, single request)

vLLM ~100ms (with vLLM overhead)

TensorRT-LLM ~60ms (optimized engine)

TensorRT-LLM prefixes optimization reduces latency. vLLM prioritizes concurrent user throughput over single-request latency.

Model compilation time (7B model)

vLLM 0 seconds (no compilation)

TensorRT-LLM 5–15 minutes (one-time, per batch size)

vLLM loads models directly from HuggingFace. TensorRT-LLM requires offline compilation step before serving.

Memory footprint (7B model, fp16)

vLLM ~16GB VRAM

TensorRT-LLM ~14GB VRAM (post-optimization)

Both are comparable; TensorRT-LLM slightly smaller due to kernel fusion.

When to use each

vLLM

✓ Serving a diverse set of models (Llama, Mistral, Qwen, Phi, etc.) without model-specific tuning: vLLM's auto-loading handles quantization and architecture differences.
✓ Need OpenAI-compatible /v1/chat/completions endpoint for zero client-side changes: vLLM is a drop-in replacement for existing OpenAI integrations.
✓ Rapid iteration and A/B testing: vLLM loads new models in seconds; TensorRT-LLM requires 5–15 min recompilation.
✓ Handling variable batch sizes and unpredictable traffic patterns: vLLM's continuous batching scales elegantly; TensorRT-LLM requires static batch pre-compilation.
✓ Limited NVIDIA expertise or small team: vLLM requires zero CUDA knowledge; TensorRT-LLM assumes TensorRT proficiency.

TensorRT-LLM

✓ Fixed production workload with known batch sizes and throughput SLA: TensorRT-LLM's 3,000–4,500 tok/sec crushes vLLM's 1,500–2,500 tok/sec for latency-sensitive applications.
✓ Extreme cost efficiency at scale (e.g., serving 100k req/sec): TensorRT-LLM's throughput per GPU means fewer GPUs needed, offsetting compilation overhead.
✓ Inference-only deployments with no model updates: once compiled, TensorRT-LLM is locked and stable; no risk of model loading errors.
✓ Specialized optimization for latency: TensorRT-LLM's prefix tuning and layer fusion optimize time-to-first-token critical for chat applications.
✓ You have NVIDIA expertise in-house: TensorRT-LLM's configuration space is deep; benefits compound with TensorRT knowledge.

Common misconceptions

vLLM

✗ vLLM is slower than native CUDA inference.

✓ vLLM's continuous batching overhead is minimal (~50–100ms vs raw CUDA). For concurrent users, vLLM's batching actually wins: single requests don't block others, improving overall system throughput by 2–3x.

✗ vLLM only works with small models on a single GPU.

✓ vLLM supports tensor parallelism and pipeline parallelism across multiple GPUs. A 70B model can be sharded across 4×A100s with near-linear scaling; vLLM handles the complexity automatically.

✗ vLLM requires extensive configuration to run.

✓ vLLM works out of the box: `from vllm import LLM; llm = LLM('meta-llama/Llama-2-7b-hf')`. No CUDA code, no compilation, no tuning: it just works.

TensorRT-LLM

✗ TensorRT-LLM gives you 10x faster inference than vLLM.

✓ TensorRT-LLM is typically 1.5–2.5x faster per request, not 10x. The 75% throughput advantage only applies at high batch sizes. Single requests see modest gains (60ms vs 100ms).

✗ TensorRT-LLM works with any HuggingFace model.

✓ TensorRT-LLM requires explicit community support or custom conversion. Llama, Falcon, Qwen are supported; most niche or new models aren't. You may spend days writing conversion scripts.

✗ TensorRT-LLM compilation is a one-time cost: compile once and forget.

✓ You must pre-compile for each batch size you plan to serve. Batch=32 and batch=64 are different engines; changing batch size requires recompilation. Variable batch traffic is painful.

Code examples

Task: Load a 7B model from HuggingFace and run a single inference request.

vLLM: basic inference

python

from vllm import LLM, SamplingParams

# vLLM loads directly from HuggingFace: no compilation needed
llm = LLM(model="meta-llama/Llama-2-7b-hf", tensor_parallel_size=1, gpu_memory_utilization=0.9)

prompt = "What is the capital of France?"
sampling_params = SamplingParams(temperature=0.7, top_p=0.95, max_tokens=100)

outputs = llm.generate([prompt], sampling_params)
for output in outputs:
    print(output.outputs[0].text)

vLLM loads and serves models immediately with zero compilation. The generate() call returns within 100–150ms for a 7B model on A100.

TensorRT-LLM: basic inference

python

from tensorrt_llm.runtime import ModelRunner
import tensorrt_llm

# TensorRT-LLM requires offline compilation first (done separately)
# trtllm-build --model_dir meta-llama/Llama-2-7b --output_dir ./trt_model --batch_size 1

runner = ModelRunner.from_dir(engine_dir="./trt_model", rank=0, debug_mode=False)

prompt = "What is the capital of France?"
input_ids = [6, 524, 16, 279, 4386, 310, 9244, 29973]  # pre-tokenized

output = runner.generate(input_ids, max_new_tokens=100, temperature=0.7)
print(output[0].text)

TensorRT-LLM requires offline compilation before serving (5–15 min setup). Inference is faster (60–80ms vs 100ms), but the compiled engine is locked to a specific batch size and model version.

Migration path

Switching from vLLM to TensorRT-LLM:
Compile your model offline using trtllm-build (one-time, 5–15 min).
Replace `from vllm import LLM` with `from tensorrt_llm.runtime import ModelRunner`.
Load the compiled engine: `runner = ModelRunner.from_dir('./trt_model')` instead of `llm = LLM(model='...')`.
Replace `llm.generate(prompts, sampling_params)` with `runner.generate(tokenized_input, max_new_tokens=...)`.
Pre-tokenize inputs: TensorRT-LLM doesn't include a tokenizer in the runtime (use HuggingFace transformers separately).
Hardcode batch sizes: you can't change batch size after compilation; recompile for each target batch. Reverse (TensorRT-LLM → vLLM): Remove all compilation steps, replace ModelRunner with LLM, use auto-tokenization. vLLM is simpler; TensorRT-LLM is faster but rigid.

RECOMMENDATION

Use vLLM for production unless you have a fixed, high-throughput workload (>1,000 req/sec with known batch sizes). vLLM's ease and flexibility save weeks of tuning. Use TensorRT-LLM only if you've benchmarked and confirmed that 75% extra throughput justifies 5–15 min per model compilation and per-batch-size rigidity.

Verified 2026-04 · meta-llama/Llama-2-7b-hf

Verify ↗

Community Notes

No notes yetBe the first to share a version-specific fix or tip.