How to Intermediate · 3 min read

How to optimize ONNX model

Quick answer
Use the onnxruntime Python package with onnxruntime-tools to apply graph optimizations and quantization on your ONNX model. Techniques like operator fusion, constant folding, and INT8 quantization reduce model size and improve inference speed efficiently.

PREREQUISITES

  • Python 3.8+
  • pip install onnx onnxruntime onnxruntime-tools

Setup

Install the required packages to optimize ONNX models using onnxruntime and onnxruntime-tools. These tools provide graph optimizations and quantization utilities.

bash
pip install onnx onnxruntime onnxruntime-tools

Step by step

This example loads an existing ONNX model, applies graph optimizations, performs dynamic quantization, and saves the optimized model.

python
import onnx
from onnxruntime_tools import optimizer
from onnxruntime.quantization import quantize_dynamic, QuantType

# Load your ONNX model
model_path = "model.onnx"
model = onnx.load(model_path)

# Apply graph optimizations (e.g., fuse operators, constant folding)
optimized_model = optimizer.optimize_model(model_path, model_type='bert')  # Use model_type relevant to your model
optimized_model.save_model_to_file("optimized_model.onnx")

# Apply dynamic quantization to reduce model size and improve speed
quantized_model_path = "quantized_model.onnx"
quantize_dynamic(
    model_input="optimized_model.onnx",
    model_output=quantized_model_path,
    weight_type=QuantType.QInt8
)

print(f"Optimized model saved to optimized_model.onnx")
print(f"Quantized model saved to {quantized_model_path}")
output
Optimized model saved to optimized_model.onnx
Quantized model saved to quantized_model.onnx

Common variations

  • Use optimizer.optimize_model with different model_type values like bert, gpt2, or bert_squad depending on your model architecture.
  • For static quantization, use quantize_static with calibration data for better accuracy.
  • Use onnxruntime.InferenceSession with providers=['CUDAExecutionProvider'] to leverage GPU acceleration after optimization.

Troubleshooting

  • If you see errors loading the optimized model, verify the model_type passed to optimizer.optimize_model matches your model.
  • Quantization may reduce accuracy; test the quantized model thoroughly.
  • Ensure onnxruntime and onnxruntime-tools versions are compatible.

Key Takeaways

  • Use onnxruntime-tools optimizer to apply graph-level optimizations for faster inference.
  • Dynamic quantization with quantize_dynamic reduces model size and speeds up CPU inference.
  • Match model_type in optimizer to your model architecture for best results.
  • Test optimized and quantized models to ensure accuracy is acceptable.
  • Leverage GPU providers in onnxruntime for additional speed gains.
Verified 2026-04 · onnxruntime, onnxruntime-tools
Verify ↗