How to use optimum with ONNX
Quick answer
Use the
optimum library to export Hugging Face Transformers models to ONNX format and run optimized inference. Optimum provides utilities to convert, optimize, and deploy models with ONNX Runtime for faster execution.PREREQUISITES
Python 3.8+pip install optimum[onnxruntime]>=1.13.0pip install transformers>=4.30.0pip install onnxruntime
Setup
Install the required packages to use optimum with ONNX and ONNX Runtime. This includes optimum[onnxruntime], transformers, and onnxruntime.
pip install optimum[onnxruntime] transformers onnxruntime Step by step
Export a Hugging Face Transformer model to ONNX format using optimum, then run inference with ONNX Runtime.
from transformers import AutoTokenizer
from optimum.onnxruntime import ORTModelForSequenceClassification
# Load tokenizer and model
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Load the ONNX Runtime optimized model (auto-downloads or you can export manually)
model = ORTModelForSequenceClassification.from_pretrained(model_name, from_transformers=True)
# Prepare input
inputs = tokenizer("Optimum with ONNX is fast!", return_tensors="pt")
# Run inference
outputs = model(**inputs)
logits = outputs.logits
# Print predicted class
predicted_class = logits.argmax(-1).item()
print(f"Predicted class: {predicted_class}") output
Predicted class: 1
Common variations
You can export models manually to ONNX using optimum.onnxruntime.exporters for custom optimization. Async inference and different model types like question answering are supported.
from optimum.onnxruntime import ORTModelForQuestionAnswering
from transformers import AutoTokenizer
model_name = "distilbert-base-uncased-distilled-squad"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = ORTModelForQuestionAnswering.from_pretrained(model_name, from_transformers=True)
inputs = tokenizer("Where is Optimum used?", "Optimum supports ONNX Runtime.", return_tensors="pt")
outputs = model(**inputs)
start_logits = outputs.start_logits
end_logits = outputs.end_logits
print(f"Start logits shape: {start_logits.shape}")
print(f"End logits shape: {end_logits.shape}") output
Start logits shape: torch.Size([1, 16]) End logits shape: torch.Size([1, 16])
Troubleshooting
- If you see errors loading ONNX models, ensure
onnxruntimeis installed and compatible with your Python version. - Use
from_transformers=Trueto convert Hugging Face models automatically if ONNX files are missing. - Check that input tensors are on CPU as ONNX Runtime currently runs on CPU or CUDA explicitly.
Key Takeaways
- Use
optimumto export and run Hugging Face models in ONNX format for faster inference. - The
ORTModelFor*classes wrap ONNX Runtime models with familiar Hugging Face APIs. - Install
optimum[onnxruntime]andonnxruntimeto enable ONNX support. - You can convert models on the fly with
from_transformers=Trueif ONNX files are not pre-exported. - ONNX Runtime supports CPU and GPU execution for optimized performance.