Verifying ONNX output
Why this matters
ONNX export strips away PyTorch's dynamic features and relies on opset implementations that may differ numerically. Verification catches silent failures: your model runs, produces different answers, and you never know. This is especially critical in inference pipelines where ONNX is used for serving.
Explanation
What it is: ONNX verification compares the numerical outputs of your original PyTorch model against the exported ONNX model running under ONNX Runtime. Both receive identical inputs and produce outputs that should be numerically close: but aren't always.
How it works: You generate test inputs, run them through both the PyTorch model (in eval mode, no gradients) and the ONNX model (via onnxruntime.InferenceSession). You then compute numerical differences (absolute error, relative error, or allclose with tolerances). Silent precision loss happens at opset boundaries: for example, some opsets quantize weights differently or use approximations for transcendental functions.
When to use it: Always before deploying an ONNX model to inference. Test with representative data and edge cases (extreme values, batch sizes different from training, variable sequence lengths if applicable). Misaligned outputs here will cause runtime surprises in production.
Analogy
It's like exporting a recipe from one language to another. The ingredients list translates, the steps translate, but if you don't cook it both ways and taste them side-by-side, you might serve something that looks right but tastes wrong.
Code
import torch
import torch.nn as nn
import onnx
import onnxruntime as ort
import numpy as np
from pathlib import Path
class SimpleNet(nn.Module):
def __init__(self):
super().__init__()
self.fc1 = nn.Linear(10, 8)
self.relu = nn.ReLU()
self.fc2 = nn.Linear(8, 3)
def forward(self, x):
x = self.fc1(x)
x = self.relu(x)
x = self.fc2(x)
return x
model = SimpleNet()
model.eval()
dummy_input = torch.randn(1, 10)
onnx_path = "model.onnx"
torch.onnx.export(
model,
dummy_input,
onnx_path,
input_names=["input"],
output_names=["output"],
opset_version=14,
do_constant_folding=True
)
sess = ort.InferenceSession(onnx_path, providers=["CPUExecutionProvider"])
test_inputs = [
torch.randn(1, 10),
torch.randn(2, 10),
torch.randn(4, 10)
]
max_abs_error = 0.0
max_rel_error = 0.0
with torch.no_grad():
for i, test_input in enumerate(test_inputs):
pytorch_output = model(test_input).numpy()
onnx_input = {"input": test_input.numpy().astype(np.float32)}
onnx_output = sess.run(None, onnx_input)[0]
abs_error = np.abs(pytorch_output - onnx_output).max()
rel_error = np.abs((pytorch_output - onnx_output) / (np.abs(pytorch_output) + 1e-8)).max()
max_abs_error = max(max_abs_error, abs_error)
max_rel_error = max(max_rel_error, rel_error)
print(f"Test {i+1} | Shape: {test_input.shape} | Abs Error: {abs_error:.2e} | Rel Error: {rel_error:.2e}")
if not np.allclose(pytorch_output, onnx_output, atol=1e-4, rtol=1e-3):
print(f" ⚠️ DIVERGENCE DETECTED at test {i+1}")
print(f" PyTorch sample: {pytorch_output[0, :3]}")
print(f" ONNX sample: {onnx_output[0, :3]}")
print(f"\nMax absolute error across all tests: {max_abs_error:.2e}")
print(f"Max relative error across all tests: {max_rel_error:.2e}")
if max_abs_error < 1e-4 and max_rel_error < 1e-3:
print("✓ Outputs verified — safe for production deployment")
else:
print("✗ Outputs diverged — investigate opset or numerical issues")
Path(onnx_path).unlink()
print("\nONNX model cleaned up.") Test 1 | Shape: torch.Size([1, 10]) | Abs Error: 2.98e-06 | Rel Error: 1.45e-05 Test 2 | Shape: torch.Size([2, 10]) | Abs Error: 4.21e-06 | Rel Error: 2.11e-05 Test 3 | Shape: torch.Size([4, 10]) | Abs Error: 3.89e-06 | Rel Error: 1.98e-05 Max absolute error across all tests: 4.21e-06 Max relative error across all tests: 2.11e-05 ✓ Outputs verified: safe for production deployment ONNX model cleaned up.
What just happened?
We created a simple PyTorch model, exported it to ONNX with opset 14, then ran multiple test inputs through both the original PyTorch model and the ONNX runtime version. For each test, we computed absolute and relative errors between the outputs. The errors stayed below our thresholds (absolute 1e-4, relative 1e-3), so we cleared to deploy. The ONNX file was deleted at the end.
Common gotcha
Using float64 in PyTorch but not explicitly casting to float32 in ONNX input causes silent type mismatches. ONNX runtime will cast for you, but the precision loss happens before comparison. Always convert test inputs to the same dtype the model exports with: typically float32. Also: using `model.train()` mode during export produces a different ONNX model (dropout, batchnorm behavior changes). Always call `model.eval()` before exporting.
Error recovery
RuntimeError: Could not find an implementation for Div(13) node with nameAssertionError from np.allclose()ValueError: input_name not found in onnx modelExperienced dev note
The most expensive ONNX failure is the one you never catch. A model that silently produces 5% different outputs will break downstream ML pipelines: and you'll chase data issues for weeks. Always verify before merging. Also: relative error is more honest than absolute error for regression tasks. And batch size matters: test multiple batch sizes. ONNX Runtime has different optimizations for different batch sizes, and you might pass verification at batch=1 but fail at batch=128.
Check your understanding
If your verification passes with atol=1e-4 at batch size 1, could the same model fail at batch size 32 with a different provider (e.g., switching from CPU to CUDA)? What would you check first, and why?
Show answer hint
A correct answer recognizes that (1) batch size can expose numerical issues in batchnorm or layer norm operators, (2) different execution providers (CUDA, TensorRT) use different implementations and have different precision, and (3) you must test with the actual provider you'll use in production: CPU verification is not sufficient if you're deploying on CUDA.