How to deploy ONNX model in production
Quick answer
Deploy an
ONNX model in production by using ONNX Runtime for efficient inference. Load the .onnx model file, prepare input tensors, and run inference with onnxruntime.InferenceSession in your production environment.PREREQUISITES
Python 3.8+pip install onnxruntimeAn exported ONNX model file (.onnx)
Setup
Install the onnxruntime package, which provides a high-performance runtime for executing ONNX models. Ensure you have your .onnx model file ready for deployment.
pip install onnxruntime Step by step
This example shows how to load an ONNX model and run inference using onnxruntime. Replace model.onnx with your model path and prepare input data accordingly.
import onnxruntime as ort
import numpy as np
# Load the ONNX model
session = ort.InferenceSession("model.onnx")
# Get model input name
input_name = session.get_inputs()[0].name
# Prepare dummy input data (adjust shape and dtype to your model)
input_data = np.random.rand(1, 3, 224, 224).astype(np.float32)
# Run inference
outputs = session.run(None, {input_name: input_data})
# Print output shape
print("Output shape:", outputs[0].shape) output
Output shape: (1, 1000)
Common variations
- Use
onnxruntime.InferenceSessionwithprovidersparameter to specify hardware acceleration like CUDA or TensorRT. - Batch inputs for higher throughput in production.
- Integrate with web frameworks (e.g., FastAPI, Flask) for serving predictions as APIs.
import onnxruntime as ort
import numpy as np
# Use GPU if available
session = ort.InferenceSession("model.onnx", providers=["CUDAExecutionProvider", "CPUExecutionProvider"])
# Example: batch input
batch_input = np.random.rand(8, 3, 224, 224).astype(np.float32)
outputs = session.run(None, {session.get_inputs()[0].name: batch_input})
print("Batch output shape:", outputs[0].shape) output
Batch output shape: (8, 1000)
Troubleshooting
- If you see
onnxruntime.capi.onnxruntime_pybind11_state.Failerrors, verify your input shapes and data types match the model's expected inputs. - For performance issues, enable hardware acceleration providers like CUDA or TensorRT.
- Check ONNX model compatibility with
onnxruntimeversion; upgrade if needed.
Key Takeaways
- Use
onnxruntimefor fast, production-ready ONNX model inference. - Prepare input tensors matching the model's expected shape and dtype.
- Enable hardware acceleration providers for better performance in production.
- Batch inputs to improve throughput when serving multiple requests.
- Integrate
onnxruntimeinference in web APIs for scalable deployment.