How to Intermediate · 4 min read

ONNX model deployment on edge devices

Quick answer
Use ONNX Runtime to deploy ONNX models on edge devices for optimized inference. Convert your trained model to ONNX format, then run it with onnxruntime on the device, leveraging hardware acceleration when available.

PREREQUISITES

  • Python 3.8+
  • pip install onnx onnxruntime onnxruntime-tools
  • Trained model exported to ONNX format

Setup

Install the necessary Python packages for ONNX model deployment on edge devices. onnxruntime provides a lightweight runtime optimized for various hardware accelerators.

bash
pip install onnx onnxruntime onnxruntime-tools

Step by step

This example demonstrates loading an ONNX model and running inference on an edge device using onnxruntime. It assumes you have an ONNX model file model.onnx.

python
import onnxruntime as ort
import numpy as np

# Load the ONNX model
session = ort.InferenceSession("model.onnx")

# Prepare dummy input data matching model input shape and type
input_name = session.get_inputs()[0].name
input_shape = session.get_inputs()[0].shape
input_type = session.get_inputs()[0].type

# Example: create random input tensor (adjust shape as needed)
input_data = np.random.rand(*[dim if isinstance(dim, int) else 1 for dim in input_shape]).astype(np.float32)

# Run inference
outputs = session.run(None, {input_name: input_data})

print("Model output:", outputs[0])
output
Model output: [[...]]  # numpy array output from the model

Common variations

  • Use onnxruntime with hardware acceleration providers like CUDAExecutionProvider or OpenVINOExecutionProvider for better performance on supported edge devices.
  • Convert PyTorch or TensorFlow models to ONNX using torch.onnx.export() or tf2onnx.convert.
  • Use onnxruntime-tools to optimize ONNX models for edge deployment.
python
import onnxruntime as ort

# Enable CUDA if available
providers = ['CUDAExecutionProvider', 'CPUExecutionProvider']
session = ort.InferenceSession("model.onnx", providers=providers)

# Prepare dummy input data matching model input shape and type
import numpy as np
input_shape = session.get_inputs()[0].shape
input_data = np.random.rand(*[dim if isinstance(dim, int) else 1 for dim in input_shape]).astype(np.float32)

# Run inference as before
outputs = session.run(None, {session.get_inputs()[0].name: input_data})
print("Output with CUDA:", outputs[0])
output
Output with CUDA: [[...]]

Troubleshooting

  • If you get RuntimeError: Unable to load model, verify the ONNX model file path and format.
  • For shape mismatch errors, confirm input tensor shapes match the model's expected input.
  • If performance is slow, check if hardware acceleration providers are enabled and compatible with your device.
  • Use onnxruntime-tools to optimize the model graph for edge devices.

Key Takeaways

  • Use onnxruntime to run ONNX models efficiently on edge devices with hardware acceleration support.
  • Convert your trained models to ONNX format for compatibility and optimized inference.
  • Optimize ONNX models using onnxruntime-tools before deployment to improve speed and reduce size.
Verified 2026-04 · onnxruntime, CUDAExecutionProvider, OpenVINOExecutionProvider
Verify ↗