How to beginner · 3 min read

How to use ONNX Runtime GPU

Q: How to use ONNX Runtime GPU

Use onnxruntime with the CUDAExecutionProvider to run ONNX models on GPU. Install onnxruntime-gpu, load your model with InferenceSession specifying providers=['CUDAExecutionProvider'], and run inference for accelerated performance.

Quick answer

Use onnxruntime with the CUDAExecutionProvider to run ONNX models on GPU. Install onnxruntime-gpu, load your model with InferenceSession specifying providers=['CUDAExecutionProvider'], and run inference for accelerated performance.

PREREQUISITES

Python 3.8+
NVIDIA GPU with CUDA 11.1 or higher
pip install onnxruntime-gpu

Setup

Install the GPU-enabled ONNX Runtime package and verify your CUDA environment is properly configured.

bash

pip install onnxruntime-gpu

Step by step

This example loads an ONNX model and runs inference on GPU using onnxruntime.

python

import onnxruntime as ort
import numpy as np

# Load ONNX model with CUDA execution provider
session = ort.InferenceSession("model.onnx", providers=["CUDAExecutionProvider"])

# Prepare dummy input matching model input shape and type
input_name = session.get_inputs()[0].name
input_shape = session.get_inputs()[0].shape
input_type = session.get_inputs()[0].type

# Example: create random input tensor (float32)
input_data = np.random.randn(*[dim if isinstance(dim, int) else 1 for dim in input_shape]).astype(np.float32)

# Run inference
outputs = session.run(None, {input_name: input_data})

print("Output shape:", [output.shape for output in outputs])

output

Output shape: [(1, 1000)]

Common variations

Use providers=["CPUExecutionProvider"] to run on CPU instead.
For async inference, use session.run_async() (available in recent versions).
Specify multiple providers to fallback if GPU is unavailable.

python

import onnxruntime as ort
import numpy as np

# Fallback to CPU if GPU unavailable
session = ort.InferenceSession("model.onnx", providers=["CUDAExecutionProvider", "CPUExecutionProvider"])

# Async inference example (Python 3.8+)
import asyncio

async def async_infer():
    input_name = session.get_inputs()[0].name
    input_data = np.random.randn(1, 3, 224, 224).astype(np.float32)
    outputs = await session.run_async(None, {input_name: input_data})
    print("Async output shape:", [output.shape for output in outputs])

asyncio.run(async_infer())

output

Async output shape: [(1, 1000)]

Troubleshooting

If you get CUDA error, verify your NVIDIA driver and CUDA toolkit versions match ONNX Runtime GPU requirements.
Ensure your GPU supports CUDA 11.1 or higher.
Use ort.get_available_providers() to check if CUDAExecutionProvider is available.
If CUDAExecutionProvider is missing, reinstall onnxruntime-gpu and check your CUDA installation.

python

import onnxruntime as ort
print("Available providers:", ort.get_available_providers())

output

Available providers: ['CUDAExecutionProvider', 'CPUExecutionProvider']

✅

Key Takeaways

Install onnxruntime-gpu to enable GPU acceleration for ONNX models.
Specify providers=['CUDAExecutionProvider'] when creating InferenceSession to run on GPU.
Verify CUDA and NVIDIA driver compatibility if GPU execution fails.
Use ort.get_available_providers() to confirm GPU provider availability.
Async inference and provider fallback improve flexibility in deployment.

Verified 2026-04 · onnxruntime

Verify ↗