Code beginner · 3 min read

How to run inference with ONNX Runtime in Python

Q: How to run inference with ONNX Runtime in Python

Use the onnxruntime Python package to load an ONNX model and run inference by creating an InferenceSession and calling run() with input data.

Direct answer

Use the onnxruntime Python package to load an ONNX model and run inference by creating an InferenceSession and calling run() with input data.

Setup

Install

bash

pip install onnxruntime numpy

Imports

python

import onnxruntime as ort
import numpy as np

Examples

inInput tensor: np.array([[1.0, 2.0, 3.0]], dtype=np.float32)

outOutput tensor: [[0.1, 0.9]] (example softmax probabilities)

inInput tensor: np.array([[0.5, -1.2, 3.3]], dtype=np.float32)

outOutput tensor: [[0.7, 0.3]] (example classification scores)

inInput tensor: np.array([[0, 0, 0]], dtype=np.float32)

outOutput tensor: [[0.5, 0.5]] (neutral prediction example)

Integration steps

Install the onnxruntime and numpy packages.
Load your ONNX model file with ort.InferenceSession.
Prepare input data as a NumPy array matching the model's input shape and type.
Run inference by calling session.run(output_names, {input_name: input_data}).
Extract and use the output from the returned list.

Full code

python

import onnxruntime as ort
import numpy as np

# Load the ONNX model
model_path = "model.onnx"
session = ort.InferenceSession(model_path)

# Prepare input data (example: batch size 1, 3 features)
input_name = session.get_inputs()[0].name
input_data = np.array([[1.0, 2.0, 3.0]], dtype=np.float32)

# Run inference
outputs = session.run(None, {input_name: input_data})

# Print output
print("Output:", outputs[0])

output

Output: [[0.1 0.9]]

API trace

Request

json

{"input_name": [[1.0, 2.0, 3.0]]}

Response

json

[[0.1, 0.9]]

Extractoutputs[0]

Variants

Run inference with multiple inputs ›

Use when your ONNX model requires multiple input tensors.

python

import onnxruntime as ort
import numpy as np

session = ort.InferenceSession("model_multi_input.onnx")

input_names = [inp.name for inp in session.get_inputs()]
input_data = {
    input_names[0]: np.array([[1.0, 2.0]], dtype=np.float32),
    input_names[1]: np.array([[3.0]], dtype=np.float32)
}

outputs = session.run(None, input_data)
print("Outputs:", outputs)

Run inference asynchronously ›

Use async inference to integrate ONNX Runtime calls in asynchronous Python applications.

python

import onnxruntime as ort
import numpy as np
import asyncio

async def async_inference():
    session = ort.InferenceSession("model.onnx")
    input_name = session.get_inputs()[0].name
    input_data = np.array([[1.0, 2.0, 3.0]], dtype=np.float32)
    outputs = await session.run_async(None, {input_name: input_data})
    print("Async output:", outputs[0])

asyncio.run(async_inference())

Use GPU execution provider ›

Use when you have a compatible GPU and want faster inference.

python

import onnxruntime as ort
import numpy as np

# Create session with CUDA provider for GPU acceleration
session = ort.InferenceSession("model.onnx", providers=["CUDAExecutionProvider"])

input_name = session.get_inputs()[0].name
input_data = np.array([[1.0, 2.0, 3.0]], dtype=np.float32)
outputs = session.run(None, {input_name: input_data})
print("GPU output:", outputs[0])

Performance

Latency~10-50ms per inference on CPU for typical small models

CostFree for local inference; cloud costs depend on hosting environment

Rate limitsNone for local runtime; depends on cloud provider if deployed

Batch inputs to reduce overhead per inference call.
Use GPU execution provider for large models to reduce latency.
Optimize your ONNX model with tools like ONNX Runtime's quantization.

Approach	Latency	Cost/call	Best for
CPU Inference	~10-50ms	Free (local)	General purpose, no GPU required
GPU Inference	~1-10ms	Free (local GPU)	High throughput, low latency
Async Inference	~10-50ms	Free (local)	Integrating with async Python apps

✓

Quick tip

Always check your model's input names and shapes with <code>session.get_inputs()</code> before running inference.

⚠

Common mistake

Passing input data with incorrect shape or dtype causes runtime errors or invalid outputs.

Verified 2026-04

Verify ↗