How to Intermediate · 3 min read

How to run LLM with ONNX Runtime

Q: How to run LLM with ONNX Runtime

Use onnxruntime to load and run large language models exported to ONNX format. Load the model with InferenceSession, prepare input tensors, and run inference with session.run() to get model outputs efficiently on CPU or GPU.

Quick answer

Use onnxruntime to load and run large language models exported to ONNX format. Load the model with InferenceSession, prepare input tensors, and run inference with session.run() to get model outputs efficiently on CPU or GPU.

PREREQUISITES

Python 3.8+
pip install onnxruntime onnx numpy transformers torch
Pre-exported LLM ONNX model file

Setup

Install the required Python packages for ONNX Runtime and model handling. You need onnxruntime for inference, transformers and torch for tokenization and model export, and numpy for tensor manipulation.

bash

pip install onnxruntime onnx transformers torch numpy

Step by step

This example shows how to load a GPT-2 model exported to ONNX, tokenize input text, run inference with ONNX Runtime, and decode the output tokens.

python

import onnxruntime as ort
import numpy as np
from transformers import GPT2Tokenizer

# Load tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

# Load ONNX model
session = ort.InferenceSession('gpt2.onnx')

# Prepare input text
input_text = "Hello, ONNX Runtime!"
inputs = tokenizer(input_text, return_tensors='np')

# ONNX Runtime expects input as numpy arrays
input_ids = inputs['input_ids']
attention_mask = inputs['attention_mask']

# Run inference
outputs = session.run(None, {'input_ids': input_ids, 'attention_mask': attention_mask})

# The output is logits; get predicted token ids
logits = outputs[0]
predicted_token_id = np.argmax(logits[:, -1, :], axis=-1)

# Decode predicted token
predicted_token = tokenizer.decode(predicted_token_id)
print(f"Input: {input_text}")
print(f"Predicted next token: {predicted_token}")

output

Input: Hello, ONNX Runtime!
Predicted next token:  world

Common variations

Use onnxruntime-gpu package for GPU acceleration.
Export other transformer models (e.g., GPT-J, LLaMA) to ONNX using transformers.onnx or torch.onnx.export.
Run inference asynchronously with session.run_async() for improved throughput.
Batch multiple inputs by concatenating input tensors along the batch dimension.

Troubleshooting

If you get shape mismatch errors, verify input tensor shapes match the ONNX model's input signature.
For performance issues, ensure you use the GPU-enabled ONNX Runtime and optimize the ONNX model with onnxruntime-tools.
If tokenization output keys differ, inspect tokenizer output and map inputs accordingly.
Check ONNX model opset version compatibility with your ONNX Runtime version.

Key Takeaways

Use onnxruntime.InferenceSession to load and run ONNX LLM models efficiently.
Prepare inputs as numpy arrays matching the model's input names and shapes.
GPU acceleration requires installing onnxruntime-gpu and compatible hardware.
Export models carefully with correct opset versions to avoid runtime errors.
Batching inputs and async inference improve throughput for production use.

Verified 2026-04 · gpt2, GPT-J, LLaMA

Verify ↗

Community Notes

No notes yetBe the first to share a version-specific fix or tip.