How to beginner · 3 min read

How to use optimum with ONNX

Q: How to use optimum with ONNX

Use the optimum library to export Hugging Face Transformers models to ONNX format and run optimized inference. Optimum provides utilities to convert, optimize, and deploy models with ONNX Runtime for faster execution.

Quick answer

Use the optimum library to export Hugging Face Transformers models to ONNX format and run optimized inference. Optimum provides utilities to convert, optimize, and deploy models with ONNX Runtime for faster execution.

PREREQUISITES

Python 3.8+
pip install optimum[onnxruntime]>=1.13.0
pip install transformers>=4.30.0
pip install onnxruntime

Setup

Install the required packages to use optimum with ONNX and ONNX Runtime. This includes optimum[onnxruntime], transformers, and onnxruntime.

bash

pip install optimum[onnxruntime] transformers onnxruntime

Step by step

Export a Hugging Face Transformer model to ONNX format using optimum, then run inference with ONNX Runtime.

python

from transformers import AutoTokenizer
from optimum.onnxruntime import ORTModelForSequenceClassification

# Load tokenizer and model
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Load the ONNX Runtime optimized model (auto-downloads or you can export manually)
model = ORTModelForSequenceClassification.from_pretrained(model_name, from_transformers=True)

# Prepare input
inputs = tokenizer("Optimum with ONNX is fast!", return_tensors="pt")

# Run inference
outputs = model(**inputs)
logits = outputs.logits

# Print predicted class
predicted_class = logits.argmax(-1).item()
print(f"Predicted class: {predicted_class}")

output

Predicted class: 1

Common variations

You can export models manually to ONNX using optimum.onnxruntime.exporters for custom optimization. Async inference and different model types like question answering are supported.

python

from optimum.onnxruntime import ORTModelForQuestionAnswering
from transformers import AutoTokenizer

model_name = "distilbert-base-uncased-distilled-squad"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = ORTModelForQuestionAnswering.from_pretrained(model_name, from_transformers=True)

inputs = tokenizer("Where is Optimum used?", "Optimum supports ONNX Runtime.", return_tensors="pt")
outputs = model(**inputs)
start_logits = outputs.start_logits
end_logits = outputs.end_logits

print(f"Start logits shape: {start_logits.shape}")
print(f"End logits shape: {end_logits.shape}")

output

Start logits shape: torch.Size([1, 16])
End logits shape: torch.Size([1, 16])

Troubleshooting

If you see errors loading ONNX models, ensure onnxruntime is installed and compatible with your Python version.
Use from_transformers=True to convert Hugging Face models automatically if ONNX files are missing.
Check that input tensors are on CPU as ONNX Runtime currently runs on CPU or CUDA explicitly.

✅

Key Takeaways

Use optimum to export and run Hugging Face models in ONNX format for faster inference.
The ORTModelFor* classes wrap ONNX Runtime models with familiar Hugging Face APIs.
Install optimum[onnxruntime] and onnxruntime to enable ONNX support.
You can convert models on the fly with from_transformers=True if ONNX files are not pre-exported.
ONNX Runtime supports CPU and GPU execution for optimized performance.

Verified 2026-04 · distilbert-base-uncased-finetuned-sst-2-english, distilbert-base-uncased-distilled-squad

Verify ↗