How to Intermediate · 3 min read

How to run LLM with ONNX Runtime

Quick answer
Use onnxruntime to load and run large language models exported to ONNX format. Load the model with InferenceSession, prepare input tensors, and run inference with session.run() to get model outputs efficiently on CPU or GPU.

PREREQUISITES

  • Python 3.8+
  • pip install onnxruntime onnx numpy transformers torch
  • Pre-exported LLM ONNX model file

Setup

Install the required Python packages for ONNX Runtime and model handling. You need onnxruntime for inference, transformers and torch for tokenization and model export, and numpy for tensor manipulation.

bash
pip install onnxruntime onnx transformers torch numpy

Step by step

This example shows how to load a GPT-2 model exported to ONNX, tokenize input text, run inference with ONNX Runtime, and decode the output tokens.

python
import onnxruntime as ort
import numpy as np
from transformers import GPT2Tokenizer

# Load tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

# Load ONNX model
session = ort.InferenceSession('gpt2.onnx')

# Prepare input text
input_text = "Hello, ONNX Runtime!"
inputs = tokenizer(input_text, return_tensors='np')

# ONNX Runtime expects input as numpy arrays
input_ids = inputs['input_ids']
attention_mask = inputs['attention_mask']

# Run inference
outputs = session.run(None, {'input_ids': input_ids, 'attention_mask': attention_mask})

# The output is logits; get predicted token ids
logits = outputs[0]
predicted_token_id = np.argmax(logits[:, -1, :], axis=-1)

# Decode predicted token
predicted_token = tokenizer.decode(predicted_token_id)
print(f"Input: {input_text}")
print(f"Predicted next token: {predicted_token}")
output
Input: Hello, ONNX Runtime!
Predicted next token:  world

Common variations

  • Use onnxruntime-gpu package for GPU acceleration.
  • Export other transformer models (e.g., GPT-J, LLaMA) to ONNX using transformers.onnx or torch.onnx.export.
  • Run inference asynchronously with session.run_async() for improved throughput.
  • Batch multiple inputs by concatenating input tensors along the batch dimension.

Troubleshooting

  • If you get shape mismatch errors, verify input tensor shapes match the ONNX model's input signature.
  • For performance issues, ensure you use the GPU-enabled ONNX Runtime and optimize the ONNX model with onnxruntime-tools.
  • If tokenization output keys differ, inspect tokenizer output and map inputs accordingly.
  • Check ONNX model opset version compatibility with your ONNX Runtime version.

Key Takeaways

  • Use onnxruntime.InferenceSession to load and run ONNX LLM models efficiently.
  • Prepare inputs as numpy arrays matching the model's input names and shapes.
  • GPU acceleration requires installing onnxruntime-gpu and compatible hardware.
  • Export models carefully with correct opset versions to avoid runtime errors.
  • Batching inputs and async inference improve throughput for production use.
Verified 2026-04 · gpt2, GPT-J, LLaMA
Verify ↗