How to beginner · 3 min read

How to use optimum with ONNX

Quick answer
Use the optimum library to export Hugging Face Transformers models to ONNX format and run optimized inference. Optimum provides utilities to convert, optimize, and deploy models with ONNX Runtime for faster execution.

PREREQUISITES

  • Python 3.8+
  • pip install optimum[onnxruntime]>=1.13.0
  • pip install transformers>=4.30.0
  • pip install onnxruntime

Setup

Install the required packages to use optimum with ONNX and ONNX Runtime. This includes optimum[onnxruntime], transformers, and onnxruntime.

bash
pip install optimum[onnxruntime] transformers onnxruntime

Step by step

Export a Hugging Face Transformer model to ONNX format using optimum, then run inference with ONNX Runtime.

python
from transformers import AutoTokenizer
from optimum.onnxruntime import ORTModelForSequenceClassification

# Load tokenizer and model
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Load the ONNX Runtime optimized model (auto-downloads or you can export manually)
model = ORTModelForSequenceClassification.from_pretrained(model_name, from_transformers=True)

# Prepare input
inputs = tokenizer("Optimum with ONNX is fast!", return_tensors="pt")

# Run inference
outputs = model(**inputs)
logits = outputs.logits

# Print predicted class
predicted_class = logits.argmax(-1).item()
print(f"Predicted class: {predicted_class}")
output
Predicted class: 1

Common variations

You can export models manually to ONNX using optimum.onnxruntime.exporters for custom optimization. Async inference and different model types like question answering are supported.

python
from optimum.onnxruntime import ORTModelForQuestionAnswering
from transformers import AutoTokenizer

model_name = "distilbert-base-uncased-distilled-squad"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = ORTModelForQuestionAnswering.from_pretrained(model_name, from_transformers=True)

inputs = tokenizer("Where is Optimum used?", "Optimum supports ONNX Runtime.", return_tensors="pt")
outputs = model(**inputs)
start_logits = outputs.start_logits
end_logits = outputs.end_logits

print(f"Start logits shape: {start_logits.shape}")
print(f"End logits shape: {end_logits.shape}")
output
Start logits shape: torch.Size([1, 16])
End logits shape: torch.Size([1, 16])

Troubleshooting

  • If you see errors loading ONNX models, ensure onnxruntime is installed and compatible with your Python version.
  • Use from_transformers=True to convert Hugging Face models automatically if ONNX files are missing.
  • Check that input tensors are on CPU as ONNX Runtime currently runs on CPU or CUDA explicitly.

Key Takeaways

  • Use optimum to export and run Hugging Face models in ONNX format for faster inference.
  • The ORTModelFor* classes wrap ONNX Runtime models with familiar Hugging Face APIs.
  • Install optimum[onnxruntime] and onnxruntime to enable ONNX support.
  • You can convert models on the fly with from_transformers=True if ONNX files are not pre-exported.
  • ONNX Runtime supports CPU and GPU execution for optimized performance.
Verified 2026-04 · distilbert-base-uncased-finetuned-sst-2-english, distilbert-base-uncased-distilled-squad
Verify ↗