How to Intermediate · 4 min read

How to deploy ONNX model in production

Quick answer
Deploy an ONNX model in production by using ONNX Runtime for efficient inference. Load the .onnx model file, prepare input tensors, and run inference with onnxruntime.InferenceSession in your production environment.

PREREQUISITES

  • Python 3.8+
  • pip install onnxruntime
  • An exported ONNX model file (.onnx)

Setup

Install the onnxruntime package, which provides a high-performance runtime for executing ONNX models. Ensure you have your .onnx model file ready for deployment.

bash
pip install onnxruntime

Step by step

This example shows how to load an ONNX model and run inference using onnxruntime. Replace model.onnx with your model path and prepare input data accordingly.

python
import onnxruntime as ort
import numpy as np

# Load the ONNX model
session = ort.InferenceSession("model.onnx")

# Get model input name
input_name = session.get_inputs()[0].name

# Prepare dummy input data (adjust shape and dtype to your model)
input_data = np.random.rand(1, 3, 224, 224).astype(np.float32)

# Run inference
outputs = session.run(None, {input_name: input_data})

# Print output shape
print("Output shape:", outputs[0].shape)
output
Output shape: (1, 1000)

Common variations

  • Use onnxruntime.InferenceSession with providers parameter to specify hardware acceleration like CUDA or TensorRT.
  • Batch inputs for higher throughput in production.
  • Integrate with web frameworks (e.g., FastAPI, Flask) for serving predictions as APIs.
python
import onnxruntime as ort
import numpy as np

# Use GPU if available
session = ort.InferenceSession("model.onnx", providers=["CUDAExecutionProvider", "CPUExecutionProvider"])

# Example: batch input
batch_input = np.random.rand(8, 3, 224, 224).astype(np.float32)
outputs = session.run(None, {session.get_inputs()[0].name: batch_input})
print("Batch output shape:", outputs[0].shape)
output
Batch output shape: (8, 1000)

Troubleshooting

  • If you see onnxruntime.capi.onnxruntime_pybind11_state.Fail errors, verify your input shapes and data types match the model's expected inputs.
  • For performance issues, enable hardware acceleration providers like CUDA or TensorRT.
  • Check ONNX model compatibility with onnxruntime version; upgrade if needed.

Key Takeaways

  • Use onnxruntime for fast, production-ready ONNX model inference.
  • Prepare input tensors matching the model's expected shape and dtype.
  • Enable hardware acceleration providers for better performance in production.
  • Batch inputs to improve throughput when serving multiple requests.
  • Integrate onnxruntime inference in web APIs for scalable deployment.
Verified 2026-04 · onnxruntime.InferenceSession
Verify ↗