How to use ONNX Runtime quantization
Quick answer
Use
onnxruntime and onnxruntime-tools to apply quantization on ONNX models, which reduces model size and improves inference speed. The quantize_dynamic or quantize_static APIs enable easy post-training quantization with minimal code changes.PREREQUISITES
Python 3.8+pip install onnxruntime onnxruntime-tools onnxAn existing ONNX model file
Setup
Install the required packages for ONNX Runtime quantization using pip. You need onnxruntime for inference, onnxruntime-tools for quantization utilities, and onnx for model manipulation.
pip install onnxruntime onnxruntime-tools onnx Step by step
This example shows how to perform dynamic quantization on an ONNX model to reduce its size and speed up inference. Dynamic quantization quantizes weights and dynamically quantizes activations during runtime.
import onnx
from onnxruntime.quantization import quantize_dynamic, QuantType
# Path to your original ONNX model
model_fp32 = "model.onnx"
# Path to save the quantized model
model_quant = "model_quant.onnx"
# Perform dynamic quantization
quantize_dynamic(
model_input=model_fp32,
model_output=model_quant,
weight_type=QuantType.QInt8 # Quantize weights to int8
)
# Load and check the quantized model
quantized_model = onnx.load(model_quant)
print(f"Quantized model saved to {model_quant}") output
Quantized model saved to model_quant.onnx
Common variations
You can also use static quantization which requires calibration data to quantize activations and weights more precisely. For static quantization, use quantize_static with a calibration data loader. Additionally, you can choose different quantization types like QInt8 or QUInt8 depending on your hardware support.
from onnxruntime.quantization import quantize_static, CalibrationDataReader, QuantType
class DummyCalibrationDataReader(CalibrationDataReader):
def get_next(self):
# Return input data for calibration here
return None
# Example usage for static quantization (requires calibration data)
# quantize_static(
# model_input=model_fp32,
# model_output="model_static_quant.onnx",
# calibration_data_reader=DummyCalibrationDataReader(),
# quant_format=QuantType.QOperator
# ) Troubleshooting
- If you see errors about missing calibration data during static quantization, ensure your
CalibrationDataReaderproperly yields input samples. - If the quantized model accuracy drops significantly, try different quantization types or use static quantization with representative calibration data.
- Check ONNX Runtime version compatibility; quantization APIs require recent versions (>=1.15.0).
Key Takeaways
- Use
quantize_dynamicfor quick post-training quantization without calibration data. - Static quantization with calibration data yields better accuracy but requires more setup.
- Choose quantization types based on your hardware and accuracy needs.
- Always verify the quantized model's accuracy and performance after quantization.