How to quantize ONNX model
Quick answer
Use the
onnxruntime Python package's quantize_dynamic or quantize_static functions to quantize an ONNX model. This reduces model size and speeds up inference by converting weights to lower precision like INT8 without retraining.PREREQUISITES
Python 3.8+pip install onnx onnxruntime onnxruntime-tools
Setup
Install the required packages for ONNX model quantization using pip. You need onnx for model manipulation, onnxruntime for runtime support, and onnxruntime-tools for quantization utilities.
pip install onnx onnxruntime onnxruntime-tools Step by step
This example shows how to perform dynamic quantization on an ONNX model, which quantizes weights to INT8 dynamically at runtime without calibration data. It is the simplest and fastest quantization method.
import onnx
from onnxruntime.quantization import quantize_dynamic, QuantType
# Path to your original ONNX model
model_fp32 = "model.onnx"
# Path to save the quantized model
model_quant = "model_quant.onnx"
# Perform dynamic quantization
quantize_dynamic(
model_input=model_fp32,
model_output=model_quant,
weight_type=QuantType.QInt8 # Quantize weights to INT8
)
print(f"Quantized model saved to {model_quant}") output
Quantized model saved to model_quant.onnx
Common variations
For better accuracy, use static quantization which requires calibration data to compute activation ranges. This involves running the model on sample inputs and saving calibration data before quantizing.
Example for static quantization:
from onnxruntime.quantization import quantize_static, CalibrationDataReader
import onnxruntime as ort
import onnx
import numpy as np
class DummyDataReader(CalibrationDataReader):
def __init__(self, input_name):
self.data = [{input_name: np.random.rand(1, 3, 224, 224).astype(np.float32)}]
self.enum_data = iter(self.data)
def get_next(self):
return next(self.enum_data, None)
model_fp32 = "model.onnx"
model_quant_static = "model_quant_static.onnx"
# Load model to get input name
model = onnx.load(model_fp32)
input_name = model.graph.input[0].name
# Create calibration data reader
calibration_data_reader = DummyDataReader(input_name)
# Run static quantization
quantize_static(
model_input=model_fp32,
model_output=model_quant_static,
calibration_data_reader=calibration_data_reader,
quant_format=QuantType.QOperator, # Use QOperator format
weight_type=QuantType.QInt8,
optimize_model=True
)
print(f"Static quantized model saved to {model_quant_static}") output
Static quantized model saved to model_quant_static.onnx
Troubleshooting
- If you get errors about missing calibration data for static quantization, ensure your
CalibrationDataReadercorrectly yields input samples. - If quantized model accuracy drops significantly, try static quantization with representative calibration data instead of dynamic.
- Check ONNX model opset version compatibility; quantization tools require opset >= 11.
Key Takeaways
- Use
onnxruntime.quantization.quantize_dynamicfor fast weight-only quantization without calibration data. - Use
quantize_staticwith aCalibrationDataReaderfor better accuracy via activation calibration. - Always verify your ONNX model opset version is compatible with quantization tools (usually opset 11+).