How to Intermediate · 3 min read

How to optimize ONNX model

Q: How to optimize ONNX model

Use the onnxruntime Python package with onnxruntime-tools to apply graph optimizations and quantization on your ONNX model. Techniques like operator fusion, constant folding, and INT8 quantization reduce model size and improve inference speed efficiently.

Quick answer

Use the onnxruntime Python package with onnxruntime-tools to apply graph optimizations and quantization on your ONNX model. Techniques like operator fusion, constant folding, and INT8 quantization reduce model size and improve inference speed efficiently.

PREREQUISITES

Python 3.8+
pip install onnx onnxruntime onnxruntime-tools

Setup

Install the required packages to optimize ONNX models using onnxruntime and onnxruntime-tools. These tools provide graph optimizations and quantization utilities.

bash

pip install onnx onnxruntime onnxruntime-tools

Step by step

This example loads an existing ONNX model, applies graph optimizations, performs dynamic quantization, and saves the optimized model.

python

import onnx
from onnxruntime_tools import optimizer
from onnxruntime.quantization import quantize_dynamic, QuantType

# Load your ONNX model
model_path = "model.onnx"
model = onnx.load(model_path)

# Apply graph optimizations (e.g., fuse operators, constant folding)
optimized_model = optimizer.optimize_model(model_path, model_type='bert')  # Use model_type relevant to your model
optimized_model.save_model_to_file("optimized_model.onnx")

# Apply dynamic quantization to reduce model size and improve speed
quantized_model_path = "quantized_model.onnx"
quantize_dynamic(
    model_input="optimized_model.onnx",
    model_output=quantized_model_path,
    weight_type=QuantType.QInt8
)

print(f"Optimized model saved to optimized_model.onnx")
print(f"Quantized model saved to {quantized_model_path}")

output

Optimized model saved to optimized_model.onnx
Quantized model saved to quantized_model.onnx

Common variations

Use optimizer.optimize_model with different model_type values like bert, gpt2, or bert_squad depending on your model architecture.
For static quantization, use quantize_static with calibration data for better accuracy.
Use onnxruntime.InferenceSession with providers=['CUDAExecutionProvider'] to leverage GPU acceleration after optimization.

Troubleshooting

If you see errors loading the optimized model, verify the model_type passed to optimizer.optimize_model matches your model.
Quantization may reduce accuracy; test the quantized model thoroughly.
Ensure onnxruntime and onnxruntime-tools versions are compatible.

✅

Key Takeaways

Use onnxruntime-tools optimizer to apply graph-level optimizations for faster inference.
Dynamic quantization with quantize_dynamic reduces model size and speeds up CPU inference.
Match model_type in optimizer to your model architecture for best results.
Test optimized and quantized models to ensure accuracy is acceptable.
Leverage GPU providers in onnxruntime for additional speed gains.

Verified 2026-04 · onnxruntime, onnxruntime-tools

Verify ↗