How to run LLM with ONNX Runtime
Quick answer
Use
onnxruntime to load and run large language models exported to ONNX format. Load the model with InferenceSession, prepare input tensors, and run inference with session.run() to get model outputs efficiently on CPU or GPU.PREREQUISITES
Python 3.8+pip install onnxruntime onnx numpy transformers torchPre-exported LLM ONNX model file
Setup
Install the required Python packages for ONNX Runtime and model handling. You need onnxruntime for inference, transformers and torch for tokenization and model export, and numpy for tensor manipulation.
pip install onnxruntime onnx transformers torch numpy Step by step
This example shows how to load a GPT-2 model exported to ONNX, tokenize input text, run inference with ONNX Runtime, and decode the output tokens.
import onnxruntime as ort
import numpy as np
from transformers import GPT2Tokenizer
# Load tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
# Load ONNX model
session = ort.InferenceSession('gpt2.onnx')
# Prepare input text
input_text = "Hello, ONNX Runtime!"
inputs = tokenizer(input_text, return_tensors='np')
# ONNX Runtime expects input as numpy arrays
input_ids = inputs['input_ids']
attention_mask = inputs['attention_mask']
# Run inference
outputs = session.run(None, {'input_ids': input_ids, 'attention_mask': attention_mask})
# The output is logits; get predicted token ids
logits = outputs[0]
predicted_token_id = np.argmax(logits[:, -1, :], axis=-1)
# Decode predicted token
predicted_token = tokenizer.decode(predicted_token_id)
print(f"Input: {input_text}")
print(f"Predicted next token: {predicted_token}") output
Input: Hello, ONNX Runtime! Predicted next token: world
Common variations
- Use
onnxruntime-gpupackage for GPU acceleration. - Export other transformer models (e.g., GPT-J, LLaMA) to ONNX using
transformers.onnxortorch.onnx.export. - Run inference asynchronously with
session.run_async()for improved throughput. - Batch multiple inputs by concatenating input tensors along the batch dimension.
Troubleshooting
- If you get shape mismatch errors, verify input tensor shapes match the ONNX model's input signature.
- For performance issues, ensure you use the GPU-enabled ONNX Runtime and optimize the ONNX model with
onnxruntime-tools. - If tokenization output keys differ, inspect tokenizer output and map inputs accordingly.
- Check ONNX model opset version compatibility with your ONNX Runtime version.
Key Takeaways
- Use
onnxruntime.InferenceSessionto load and run ONNX LLM models efficiently. - Prepare inputs as numpy arrays matching the model's input names and shapes.
- GPU acceleration requires installing
onnxruntime-gpuand compatible hardware. - Export models carefully with correct opset versions to avoid runtime errors.
- Batching inputs and async inference improve throughput for production use.