How to Intermediate · 3 min read

How to use ONNX Runtime quantization

Quick answer
Use onnxruntime and onnxruntime-tools to apply quantization on ONNX models, which reduces model size and improves inference speed. The quantize_dynamic or quantize_static APIs enable easy post-training quantization with minimal code changes.

PREREQUISITES

  • Python 3.8+
  • pip install onnxruntime onnxruntime-tools onnx
  • An existing ONNX model file

Setup

Install the required packages for ONNX Runtime quantization using pip. You need onnxruntime for inference, onnxruntime-tools for quantization utilities, and onnx for model manipulation.

bash
pip install onnxruntime onnxruntime-tools onnx

Step by step

This example shows how to perform dynamic quantization on an ONNX model to reduce its size and speed up inference. Dynamic quantization quantizes weights and dynamically quantizes activations during runtime.

python
import onnx
from onnxruntime.quantization import quantize_dynamic, QuantType

# Path to your original ONNX model
model_fp32 = "model.onnx"
# Path to save the quantized model
model_quant = "model_quant.onnx"

# Perform dynamic quantization
quantize_dynamic(
    model_input=model_fp32,
    model_output=model_quant,
    weight_type=QuantType.QInt8  # Quantize weights to int8
)

# Load and check the quantized model
quantized_model = onnx.load(model_quant)
print(f"Quantized model saved to {model_quant}")
output
Quantized model saved to model_quant.onnx

Common variations

You can also use static quantization which requires calibration data to quantize activations and weights more precisely. For static quantization, use quantize_static with a calibration data loader. Additionally, you can choose different quantization types like QInt8 or QUInt8 depending on your hardware support.

python
from onnxruntime.quantization import quantize_static, CalibrationDataReader, QuantType

class DummyCalibrationDataReader(CalibrationDataReader):
    def get_next(self):
        # Return input data for calibration here
        return None

# Example usage for static quantization (requires calibration data)
# quantize_static(
#     model_input=model_fp32,
#     model_output="model_static_quant.onnx",
#     calibration_data_reader=DummyCalibrationDataReader(),
#     quant_format=QuantType.QOperator
# )

Troubleshooting

  • If you see errors about missing calibration data during static quantization, ensure your CalibrationDataReader properly yields input samples.
  • If the quantized model accuracy drops significantly, try different quantization types or use static quantization with representative calibration data.
  • Check ONNX Runtime version compatibility; quantization APIs require recent versions (>=1.15.0).

Key Takeaways

  • Use quantize_dynamic for quick post-training quantization without calibration data.
  • Static quantization with calibration data yields better accuracy but requires more setup.
  • Choose quantization types based on your hardware and accuracy needs.
  • Always verify the quantized model's accuracy and performance after quantization.
Verified 2026-04
Verify ↗