TensorRT-LLM vs Triton: which LLM serving backend should you use?
Use TensorRT-LLM if you need maximum inference speed and memory efficiency for LLMs with native CUDA kernels. Use Triton if you need a general-purpose inference server supporting multiple model types, frameworks, and easy multi-model orchestration.
VERDICT
Side-by-side comparison
| Feature | TensorRT-LLM | Triton | Winner |
|---|---|---|---|
| Primary Use Case | LLM inference only (NVIDIA GPUs) | Multi-model, multi-framework serving | Depends on workload |
| Throughput (7B model, A100) | ~3,000-5,000 tok/s | ~1,000-2,000 tok/s (via TRT backend) | TensorRT-LLM |
| Hardware Support | NVIDIA CUDA only (A100, H100, L4, L40S) | NVIDIA, AMD, Intel, CPU (multiple backends) | Triton |
| Setup Complexity | Moderate (C++/CUDA build required) | Moderate (config-driven, no code) | Triton |
| Multi-Model Orchestration | Single model per instance (tensor parallel) | Native support via ensemble models | Triton |
| Batching Strategy | Paged attention + continuous batching | Generic batching or per-backend | TensorRT-LLM |
| Memory Efficiency (7B Q8) | ~12GB VRAM | ~15-18GB VRAM | TensorRT-LLM |
| Framework Support | PyTorch, JAX (via export) | TensorFlow, PyTorch, ONNX, custom | Triton |
| OpenAI API Compatibility | Native via vLLM wrapper | Requires wrapper/adapter | Tie |
| Production Maturity | Production-ready (2023+) | Mature (2019+) | Tie |
Performance benchmarks
Throughput: llama2-7b on single A100 (batch=32, 128 tokens)
TensorRT-LLM uses paged attention KV cache and quantization-aware optimizations. Triton result assumes TensorRT backend enabled; generic framework results are 50-70% lower.
Time-to-first-token: llama2-7b (single request)
TensorRT-LLM's fused kernels and optimized memory layout reduce latency. Triton adds marshalling overhead across process boundaries.
Memory per instance: llama2-7b (FP8 quantized)
TensorRT-LLM supports INT8/FP8 with automatic KV cache optimization. Triton's memory usage depends on backend framework selection.
Setup time: from model to serving API
TensorRT-LLM requires compilation step and CUDA toolkit. Triton uses declarative YAML config with no compilation.
When to use each
- ✓ You are serving only large language models (Llama, Mistral, Qwen, etc.) and need maximum throughput per GPU: TensorRT-LLM's specialized kernels (paged attention, fused operations) deliver 3-10x higher tokens/sec than generic servers.
- ✓ You have memory constraints and need sub-10GB footprint for 7B models: TensorRT-LLM's FP8 quantization and KV cache optimization reduce memory usage by 40-50% versus framework-native inference.
- ✓ You're building a cost-sensitive SaaS where inference dollars directly impact margins: TensorRT-LLM's efficiency means fewer GPUs needed, reducing capex and power consumption.
- ✓ You need tensor parallelism across 2-4 GPUs on a single node for larger models (13B-70B): TensorRT-LLM has native support for fast all-reduce and overlapped communication.
- ✓ You want the tightest integration with vLLM for OpenAI API compatibility: TensorRT-LLM powers vLLM's high-performance backend, so you get the fastest possible /v1/chat/completions implementation.
- ✓ You're serving multiple model types (LLMs + computer vision + audio classification) on the same cluster: Triton's unified platform handles ensemble inference and cross-framework orchestration without custom code.
- ✓ You need to serve models from TensorFlow, JAX, ONNX, or custom C++ backends alongside PyTorch: Triton abstracts framework differences, letting teams deploy without rebuilding infrastructure for each new model type.
- ✓ Your team prefers declarative configuration over writing code: Triton's YAML-based model repository is infrastructure-as-code friendly and integrates naturally with Kubernetes deployments.
- ✓ You're running on a heterogeneous cluster (NVIDIA, AMD, Intel, CPU instances): Triton's pluggable backend architecture supports multiple hardware vendors without special compilation.
- ✓ You need advanced features like A/B testing, shadow serving, or dynamic model loading without restarting: Triton's model control API and ensemble feature set support complex serving patterns.
Common misconceptions
TensorRT-LLM
TensorRT-LLM is a drop-in replacement for vLLM: I can just swap the engine and get 10x speedup.
TensorRT-LLM requires compiling models into optimized ONNX/TRT engines first (30+ min per model). You can't point it at arbitrary HuggingFace checkpoints like vLLM. The speedup is real, but setup time and model format constraints are non-trivial.
TensorRT-LLM works on any NVIDIA GPU (RTX 4090, V100, A100).
TensorRT-LLM's paged attention and specialized kernels are optimized for Ampere+ GPUs (A100, H100, L4, L40S). RTX 40-series works but lacks optimal fusions. Older Volta/Turing GPUs fall back to slower kernels. Check hardware compatibility matrix before committing.
If I use TensorRT-LLM, I get automatic quantization and FP8 support.
Quantization must be baked into the engine during compilation. Models must be quantization-aware (QAT) or you use post-training quantization (PTQ) via calibration data. This is not automatic: it requires experimenting with calibration datasets and accuracy drift.
Triton
Triton Inference Server is specifically designed for LLM serving, like TensorRT-LLM.
Triton is a general-purpose inference server built for latency-critical ML serving. Its batching and scheduling assume small, uniform-latency models (image classification, NLP tagging). For LLMs with token-by-token generation, you need to use Triton's decoupled API mode + custom batching logic or use TensorRT backend.
If I put a model in Triton, it automatically scales to 1000s of concurrent users.
Triton's default batching works well for request-in/response-out workflows. For streaming LLM responses (token-by-token), you must enable decoupled API and manage streaming batches yourself. Without this, you'll see high latency and poor concurrency at scale.
Triton's Python backend is just as fast as C++/CUDA backends.
The Python backend adds 50-200ms of per-request overhead due to GIL and Python-C++ marshalling. For high-throughput LLM serving, always use C++, ONNX, or TensorRT backends. Python backend is fine for orchestration logic, not inference computation.
Code examples
Task: Load a quantized Llama2-7B model and generate 50 tokens from a prompt using native TensorRT-LLM API.
from tensorrt_llm.runtime import ModelRunner
import torch
# TensorRT-LLM requires pre-compiled engine (./llama2_7b_engine/)
runner = ModelRunner.from_dir(
engine_dir='./llama2_7b_engine',
lora_dir=None,
rank=0,
gpu_id=0
)
prompt = 'What is machine learning?'
input_ids = torch.tensor([[1, 450, 1001, 9632]]) # tokenized
# TensorRT-LLM uses native CUDA kernels with paged attention
output = runner.generate(
input_ids,
max_new_tokens=50,
top_k=40,
top_p=0.9,
temperature=0.8
)
print(output[0, 0].tolist())
runner.release() TensorRT-LLM bypasses PyTorch entirely: it operates on compiled CUDA kernels, which is why it's 3-10x faster. The trade-off: you must pre-compile engines, and direct torch tensor I/O is required.
import tritonclient.http as httpclient
import json
# Triton server running on localhost:8000 (tritonserver --model-repository ./models)
client = httpclient.InferenceServerClient(url='localhost:8000')
# Triton expects inputs as dictionaries with model-specific format
prompt = 'What is machine learning?'
inputs = [
httpclient.InferInput('input_ids', [1, 4], 'INT32'),
httpclient.InferInput('max_tokens', [1], 'INT32')
]
inputs[0].set_data_from_numpy([[1, 450, 1001, 9632]])
inputs[1].set_data_from_numpy([[50]])
# Triton abstracts backend (PyTorch, TensorRT, ONNX) via config.pbtxt
# Decoupled API required for streaming LLM output
response = client.infer(
model_name='llama2_7b',
inputs=inputs,
outputs=[httpclient.InferRequestedOutput('output_ids')]
)
output = response.as_numpy('output_ids')
print(output) Triton adds an abstraction layer (gRPC/HTTP marshalling) over the inference backend. This generality supports multi-framework serving but introduces latency overhead. The model config drives behavior, not code.
Migration path
- Switching from Triton (generic backend) to TensorRT-LLM:
- Export your Llama model from PyTorch/HuggingFace to ONNX: `python -m tensorrt_llm.examples.llama.convert --model_dir ./llama2-7b-hf --output_dir ./llama2_onnx`.
- Compile to TRT engine: `trtllm-build --checkpoint_dir ./llama2_onnx --output_dir ./llama2_engine --world_size 1 --max_batch_size 32`.
- Replace Triton client code: instead of `httpclient.infer()`, use `ModelRunner.from_dir()` and `runner.generate()`.
- Update config: swap from Triton's model_repository/config.pbtxt to TensorRT-LLM's GIT engine directory. Switching from TensorRT-LLM to Triton (add multi-model support):
- Copy your compiled TRT engine into Triton's model_repository structure.
- Write config.pbtxt declaring inputs/outputs and backend type.
- Replace `ModelRunner` Python client with `httpclient.InferenceServerClient`.
- Gain multi-model orchestration and framework flexibility, but accept 2-3x lower single-model throughput.
RECOMMENDATION