Code beginner · 3 min read

How to run inference with ONNX Runtime in Python

Direct answer
Use the onnxruntime Python package to load an ONNX model and run inference by creating an InferenceSession and calling run() with input data.

Setup

Install
bash
pip install onnxruntime numpy
Imports
python
import onnxruntime as ort
import numpy as np

Examples

inInput tensor: np.array([[1.0, 2.0, 3.0]], dtype=np.float32)
outOutput tensor: [[0.1, 0.9]] (example softmax probabilities)
inInput tensor: np.array([[0.5, -1.2, 3.3]], dtype=np.float32)
outOutput tensor: [[0.7, 0.3]] (example classification scores)
inInput tensor: np.array([[0, 0, 0]], dtype=np.float32)
outOutput tensor: [[0.5, 0.5]] (neutral prediction example)

Integration steps

  1. Install the onnxruntime and numpy packages.
  2. Load your ONNX model file with ort.InferenceSession.
  3. Prepare input data as a NumPy array matching the model's input shape and type.
  4. Run inference by calling session.run(output_names, {input_name: input_data}).
  5. Extract and use the output from the returned list.

Full code

python
import onnxruntime as ort
import numpy as np

# Load the ONNX model
model_path = "model.onnx"
session = ort.InferenceSession(model_path)

# Prepare input data (example: batch size 1, 3 features)
input_name = session.get_inputs()[0].name
input_data = np.array([[1.0, 2.0, 3.0]], dtype=np.float32)

# Run inference
outputs = session.run(None, {input_name: input_data})

# Print output
print("Output:", outputs[0])
output
Output: [[0.1 0.9]]

API trace

Request
json
{"input_name": [[1.0, 2.0, 3.0]]}
Response
json
[[0.1, 0.9]]
Extractoutputs[0]

Variants

Run inference with multiple inputs

Use when your ONNX model requires multiple input tensors.

python
import onnxruntime as ort
import numpy as np

session = ort.InferenceSession("model_multi_input.onnx")

input_names = [inp.name for inp in session.get_inputs()]
input_data = {
    input_names[0]: np.array([[1.0, 2.0]], dtype=np.float32),
    input_names[1]: np.array([[3.0]], dtype=np.float32)
}

outputs = session.run(None, input_data)
print("Outputs:", outputs)
Run inference asynchronously

Use async inference to integrate ONNX Runtime calls in asynchronous Python applications.

python
import onnxruntime as ort
import numpy as np
import asyncio

async def async_inference():
    session = ort.InferenceSession("model.onnx")
    input_name = session.get_inputs()[0].name
    input_data = np.array([[1.0, 2.0, 3.0]], dtype=np.float32)
    outputs = await session.run_async(None, {input_name: input_data})
    print("Async output:", outputs[0])

asyncio.run(async_inference())
Use GPU execution provider

Use when you have a compatible GPU and want faster inference.

python
import onnxruntime as ort
import numpy as np

# Create session with CUDA provider for GPU acceleration
session = ort.InferenceSession("model.onnx", providers=["CUDAExecutionProvider"])

input_name = session.get_inputs()[0].name
input_data = np.array([[1.0, 2.0, 3.0]], dtype=np.float32)
outputs = session.run(None, {input_name: input_data})
print("GPU output:", outputs[0])

Performance

Latency~10-50ms per inference on CPU for typical small models
CostFree for local inference; cloud costs depend on hosting environment
Rate limitsNone for local runtime; depends on cloud provider if deployed
  • Batch inputs to reduce overhead per inference call.
  • Use GPU execution provider for large models to reduce latency.
  • Optimize your ONNX model with tools like ONNX Runtime's quantization.
ApproachLatencyCost/callBest for
CPU Inference~10-50msFree (local)General purpose, no GPU required
GPU Inference~1-10msFree (local GPU)High throughput, low latency
Async Inference~10-50msFree (local)Integrating with async Python apps

Quick tip

Always check your model's input names and shapes with <code>session.get_inputs()</code> before running inference.

Common mistake

Passing input data with incorrect shape or dtype causes runtime errors or invalid outputs.

Verified 2026-04
Verify ↗