ONNX vs TensorRT comparison
VERDICT
| Tool | Key strength | Pricing | API access | Best for |
|---|---|---|---|---|
| ONNX | Model interoperability and portability | Free, open-source | Python, C++, C#, Java | Cross-framework model exchange |
| TensorRT | GPU-accelerated inference optimization | Free with NVIDIA GPUs | Python, C++ | High-performance NVIDIA GPU inference |
| ONNX Runtime | Cross-platform inference engine | Free, open-source | Python, C++, Java, .NET | Running ONNX models efficiently on CPU/GPU |
| TensorRT with ONNX | Optimized execution of ONNX models on NVIDIA GPUs | Free with NVIDIA GPUs | Python, C++ | Maximizing NVIDIA GPU inference speed |
Key differences
ONNX is a model format standard designed for interoperability between ML frameworks like PyTorch, TensorFlow, and others. It defines a common graph representation for models.
TensorRT is an NVIDIA SDK focused on optimizing and deploying deep learning models for inference on NVIDIA GPUs, providing layer fusion, precision calibration, and kernel auto-tuning.
While ONNX enables portability, TensorRT targets performance optimization on specific hardware.
Side-by-side example: Exporting and running ONNX model
This example exports a PyTorch model to ONNX format and runs inference using ONNX Runtime.
import torch
import onnxruntime as ort
# Define a simple PyTorch model
class SimpleModel(torch.nn.Module):
def forward(self, x):
return x * 2
model = SimpleModel()
model.eval()
# Dummy input
x = torch.randn(1, 3)
# Export to ONNX
onnx_path = "simple_model.onnx"
torch.onnx.export(model, x, onnx_path, input_names=["input"], output_names=["output"])
# Run inference with ONNX Runtime
ort_session = ort.InferenceSession(onnx_path)
inputs = {"input": x.numpy()}
outputs = ort_session.run(None, inputs)
print("ONNX Runtime output:", outputs[0]) ONNX Runtime output: [[-0.345, 0.678, ...]] # example output, doubles input tensor
TensorRT equivalent: Optimizing and running ONNX model
This example loads the same ONNX model into TensorRT for GPU-accelerated inference.
import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit
import numpy as np
TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
# Load ONNX model and build TensorRT engine
with trt.Builder(TRT_LOGGER) as builder, builder.create_network() as network, trt.OnnxParser(network, TRT_LOGGER) as parser:
with open("simple_model.onnx", "rb") as model_file:
parser.parse(model_file.read())
builder.max_batch_size = 1
builder.max_workspace_size = 1 << 20 # 1 MiB
engine = builder.build_cuda_engine(network)
# Allocate buffers and create context
context = engine.create_execution_context()
input_data = np.random.randn(1, 3).astype(np.float32)
output = np.empty([1, 3], dtype=np.float32)
# Allocate device memory
d_input = cuda.mem_alloc(input_data.nbytes)
d_output = cuda.mem_alloc(output.nbytes)
# Create CUDA stream
stream = cuda.Stream()
# Transfer input data to device
cuda.memcpy_htod_async(d_input, input_data, stream)
# Execute inference
context.execute_async_v2(bindings=[int(d_input), int(d_output)], stream_handle=stream.handle)
# Transfer predictions back
cuda.memcpy_dtoh_async(output, d_output, stream)
stream.synchronize()
print("TensorRT output:", output) TensorRT output: [[-0.345, 0.678, ...]] # output matches ONNX Runtime but faster on GPU
When to use each
Use ONNX when you need to export models from various frameworks and run them on multiple platforms or runtimes.
Use TensorRT when deploying models on NVIDIA GPUs where maximum inference speed and efficiency are critical.
| Scenario | Recommended tool |
|---|---|
| Cross-framework model sharing | ONNX |
| CPU or non-NVIDIA GPU inference | ONNX Runtime |
| High-performance NVIDIA GPU inference | TensorRT |
| Optimizing ONNX models for NVIDIA GPUs | TensorRT with ONNX |
Pricing and access
| Option | Free | Paid | API access |
|---|---|---|---|
| ONNX | Yes, fully open-source | No | Yes, via multiple language bindings |
| ONNX Runtime | Yes, open-source | No | Yes, Python/C++/Java/.NET APIs |
| TensorRT | Yes, free with NVIDIA GPUs | No | Yes, Python and C++ APIs |
| TensorRT with ONNX | Yes | No | Yes |
Key Takeaways
- ONNX standardizes model format for interoperability across ML frameworks.
- TensorRT specializes in optimizing and accelerating inference on NVIDIA GPUs.
- Use ONNX Runtime for flexible, cross-platform inference including CPU and GPU.
- Combine ONNX export with TensorRT for best NVIDIA GPU performance.
- Both tools are free and open-source, with broad API support for integration.