How to Intermediate · 4 min read

How to quantize with ONNX Runtime

Quick answer
Use the onnxruntime.quantization Python package to apply post-training quantization on ONNX models. The quantize_dynamic or quantize_static functions convert model weights to lower precision formats like INT8, improving inference speed and reducing model size.

PREREQUISITES

  • Python 3.8+
  • pip install onnxruntime onnxruntime-tools onnx
  • An existing ONNX model file

Setup

Install the required packages for ONNX Runtime quantization and prepare your environment.

bash
pip install onnxruntime onnxruntime-tools onnx

Step by step

This example shows how to perform dynamic quantization on an ONNX model using quantize_dynamic. Dynamic quantization converts weights to INT8 while keeping activations in float, which requires no calibration data.

python
import os
from onnxruntime.quantization import quantize_dynamic, QuantType

# Path to your original ONNX model
model_fp32 = "model.onnx"
# Path to save the quantized model
model_int8 = "model_quantized.onnx"

# Perform dynamic quantization on the model
quantize_dynamic(
    model_input=model_fp32,
    model_output=model_int8,
    weight_type=QuantType.QInt8  # INT8 quantization for weights
)

print(f"Quantized model saved to {model_int8}")
output
Quantized model saved to model_quantized.onnx

Common variations

Besides dynamic quantization, ONNX Runtime supports static quantization which requires calibration data to quantize activations as well. Use quantize_static with a calibration dataset for better accuracy.

Example for static quantization:

python
from onnxruntime.quantization import quantize_static, CalibrationDataReader

class DummyCalibrationDataReader(CalibrationDataReader):
    def get_next(self):
        # Return input data for calibration as a dictionary
        # Replace with real calibration data
        return {"input": dummy_input_tensor}

calibration_data_reader = DummyCalibrationDataReader()

quantize_static(
    model_input="model.onnx",
    model_output="model_static_quantized.onnx",
    calibration_data_reader=calibration_data_reader,
    quant_format=QuantType.QOperator  # Use QOperator format
)

Troubleshooting

  • If you get errors about missing calibration data during static quantization, ensure your CalibrationDataReader correctly yields input batches.
  • For unsupported operators during quantization, check ONNX Runtime's operator support and consider fallback to dynamic quantization.
  • Verify your ONNX model is valid with onnx.checker.check_model() before quantization.

Key Takeaways

  • Use quantize_dynamic for quick INT8 weight quantization without calibration data.
  • Use quantize_static with calibration data for better accuracy by quantizing activations.
  • Install onnxruntime and onnxruntime-tools to access quantization APIs.
  • Validate your ONNX model before quantization to avoid runtime errors.
  • Static quantization requires a CalibrationDataReader to feed representative inputs.
Verified 2026-04 · ONNX Runtime quantization
Verify ↗