vLLM vs TensorRT-LLM: GPU Serving Speed vs Inference Optimization
Use vLLM if you want fast setup with broad model support and OpenAI-compatible API. Use TensorRT-LLM if you need maximum throughput and are willing to invest in model optimization and compilation.
VERDICT
Side-by-side comparison
| Dimension | vLLM | TensorRT-LLM | Winner |
|---|---|---|---|
| Throughput (7B model, A100) | ~1,500–2,500 tokens/sec | ~3,000–4,500 tokens/sec | tensorrt-llm |
| Time to first token (7B) | ~80–120ms | ~50–80ms | tensorrt-llm |
| Setup complexity | pip install vllm: runs immediately | Requires model compilation, NVIDIA TensorRT toolkit | vllm |
| Model compatibility | 500+ HuggingFace models (Llama, Mistral, Qwen, etc.) | Selected models (Llama, GPT, Falcon, Qwen: community support) | vllm |
| API compatibility | OpenAI-compatible /v1/chat/completions | HTTP API + Python SDK (not OpenAI-compatible) | vllm |
| Quantization support | GPTQ, AWQ, INT8, bfloat16: auto-loading | INT4, INT8, FP8: requires TensorRT conversion | vllm |
| Multi-GPU support | Tensor parallelism and pipeline parallelism built-in | Tensor parallelism (pipeline still experimental) | vllm |
| License | Apache 2.0 | Apache 2.0 | Tie |
| GPU requirement | NVIDIA CUDA 11.8+ or AMD ROCm | NVIDIA CUDA 12.0+, NVIDIA GPUs only | vllm |
| Customization flexibility | High: modify Python code, add custom operators easily | Low: locked into TensorRT engine, difficult to extend | vllm |
Performance benchmarks
Throughput (Llama 2 7B, batch=32, A100 80GB)
vLLM uses continuous batching; TensorRT-LLM uses static batch compilation. TensorRT-LLM gains 75% throughput advantage but requires pre-compilation of batch sizes.
Time to first token (7B model, single request)
TensorRT-LLM prefixes optimization reduces latency. vLLM prioritizes concurrent user throughput over single-request latency.
Model compilation time (7B model)
vLLM loads models directly from HuggingFace. TensorRT-LLM requires offline compilation step before serving.
Memory footprint (7B model, fp16)
Both are comparable; TensorRT-LLM slightly smaller due to kernel fusion.
When to use each
- ✓ Serving a diverse set of models (Llama, Mistral, Qwen, Phi, etc.) without model-specific tuning: vLLM's auto-loading handles quantization and architecture differences.
- ✓ Need OpenAI-compatible /v1/chat/completions endpoint for zero client-side changes: vLLM is a drop-in replacement for existing OpenAI integrations.
- ✓ Rapid iteration and A/B testing: vLLM loads new models in seconds; TensorRT-LLM requires 5–15 min recompilation.
- ✓ Handling variable batch sizes and unpredictable traffic patterns: vLLM's continuous batching scales elegantly; TensorRT-LLM requires static batch pre-compilation.
- ✓ Limited NVIDIA expertise or small team: vLLM requires zero CUDA knowledge; TensorRT-LLM assumes TensorRT proficiency.
- ✓ Fixed production workload with known batch sizes and throughput SLA: TensorRT-LLM's 3,000–4,500 tok/sec crushes vLLM's 1,500–2,500 tok/sec for latency-sensitive applications.
- ✓ Extreme cost efficiency at scale (e.g., serving 100k req/sec): TensorRT-LLM's throughput per GPU means fewer GPUs needed, offsetting compilation overhead.
- ✓ Inference-only deployments with no model updates: once compiled, TensorRT-LLM is locked and stable; no risk of model loading errors.
- ✓ Specialized optimization for latency: TensorRT-LLM's prefix tuning and layer fusion optimize time-to-first-token critical for chat applications.
- ✓ You have NVIDIA expertise in-house: TensorRT-LLM's configuration space is deep; benefits compound with TensorRT knowledge.
Common misconceptions
vLLM
vLLM is slower than native CUDA inference.
vLLM's continuous batching overhead is minimal (~50–100ms vs raw CUDA). For concurrent users, vLLM's batching actually wins: single requests don't block others, improving overall system throughput by 2–3x.
vLLM only works with small models on a single GPU.
vLLM supports tensor parallelism and pipeline parallelism across multiple GPUs. A 70B model can be sharded across 4×A100s with near-linear scaling; vLLM handles the complexity automatically.
vLLM requires extensive configuration to run.
vLLM works out of the box: `from vllm import LLM; llm = LLM('meta-llama/Llama-2-7b-hf')`. No CUDA code, no compilation, no tuning: it just works.
TensorRT-LLM
TensorRT-LLM gives you 10x faster inference than vLLM.
TensorRT-LLM is typically 1.5–2.5x faster per request, not 10x. The 75% throughput advantage only applies at high batch sizes. Single requests see modest gains (60ms vs 100ms).
TensorRT-LLM works with any HuggingFace model.
TensorRT-LLM requires explicit community support or custom conversion. Llama, Falcon, Qwen are supported; most niche or new models aren't. You may spend days writing conversion scripts.
TensorRT-LLM compilation is a one-time cost: compile once and forget.
You must pre-compile for each batch size you plan to serve. Batch=32 and batch=64 are different engines; changing batch size requires recompilation. Variable batch traffic is painful.
Code examples
Task: Load a 7B model from HuggingFace and run a single inference request.
from vllm import LLM, SamplingParams
# vLLM loads directly from HuggingFace: no compilation needed
llm = LLM(model="meta-llama/Llama-2-7b-hf", tensor_parallel_size=1, gpu_memory_utilization=0.9)
prompt = "What is the capital of France?"
sampling_params = SamplingParams(temperature=0.7, top_p=0.95, max_tokens=100)
outputs = llm.generate([prompt], sampling_params)
for output in outputs:
print(output.outputs[0].text) vLLM loads and serves models immediately with zero compilation. The generate() call returns within 100–150ms for a 7B model on A100.
from tensorrt_llm.runtime import ModelRunner
import tensorrt_llm
# TensorRT-LLM requires offline compilation first (done separately)
# trtllm-build --model_dir meta-llama/Llama-2-7b --output_dir ./trt_model --batch_size 1
runner = ModelRunner.from_dir(engine_dir="./trt_model", rank=0, debug_mode=False)
prompt = "What is the capital of France?"
input_ids = [6, 524, 16, 279, 4386, 310, 9244, 29973] # pre-tokenized
output = runner.generate(input_ids, max_new_tokens=100, temperature=0.7)
print(output[0].text) TensorRT-LLM requires offline compilation before serving (5–15 min setup). Inference is faster (60–80ms vs 100ms), but the compiled engine is locked to a specific batch size and model version.
Migration path
- Switching from vLLM to TensorRT-LLM:
- Compile your model offline using trtllm-build (one-time, 5–15 min).
- Replace `from vllm import LLM` with `from tensorrt_llm.runtime import ModelRunner`.
- Load the compiled engine: `runner = ModelRunner.from_dir('./trt_model')` instead of `llm = LLM(model='...')`.
- Replace `llm.generate(prompts, sampling_params)` with `runner.generate(tokenized_input, max_new_tokens=...)`.
- Pre-tokenize inputs: TensorRT-LLM doesn't include a tokenizer in the runtime (use HuggingFace transformers separately).
- Hardcode batch sizes: you can't change batch size after compilation; recompile for each target batch. Reverse (TensorRT-LLM → vLLM): Remove all compilation steps, replace ModelRunner with LLM, use auto-tokenization. vLLM is simpler; TensorRT-LLM is faster but rigid.
RECOMMENDATION