How to beginner · 3 min read

How to use ONNX with FastAPI

Quick answer
Use onnxruntime to load and run your ONNX model, and integrate it with FastAPI to create a REST API endpoint for inference. This allows efficient, scalable model serving with asynchronous request handling.

PREREQUISITES

  • Python 3.8+
  • pip install fastapi uvicorn onnxruntime numpy

Setup

Install the required packages to run ONNX models and serve them with FastAPI. Use uvicorn as the ASGI server for running the API.

bash
pip install fastapi uvicorn onnxruntime numpy

Step by step

This example demonstrates loading an ONNX model with onnxruntime and serving it via a FastAPI endpoint. The API accepts JSON input, runs inference, and returns the prediction.

python
import os
import numpy as np
import onnxruntime as ort
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel

# Define input data schema
class ModelInput(BaseModel):
    data: list

app = FastAPI()

# Load ONNX model
model_path = "model.onnx"
if not os.path.exists(model_path):
    raise FileNotFoundError(f"ONNX model not found at {model_path}")
session = ort.InferenceSession(model_path)

# Get model input name
input_name = session.get_inputs()[0].name

@app.post("/predict")
def predict(input: ModelInput):
    try:
        # Convert input list to numpy array
        input_array = np.array(input.data, dtype=np.float32)
        # Add batch dimension if needed
        if input_array.ndim == 1:
            input_array = input_array.reshape(1, -1)
        # Run inference
        outputs = session.run(None, {input_name: input_array})
        # Return first output as list
        return {"prediction": outputs[0].tolist()}
    except Exception as e:
        raise HTTPException(status_code=400, detail=str(e))

# To run the server:
# uvicorn filename:app --reload

Common variations

You can run FastAPI asynchronously by defining async endpoints. For batch inference, accept a list of inputs. You can also use onnxruntime with GPU execution providers for faster inference.

python
from fastapi import FastAPI
import onnxruntime as ort
import numpy as np

app = FastAPI()
session = ort.InferenceSession("model.onnx", providers=["CUDAExecutionProvider", "CPUExecutionProvider"])

@app.post("/predict_async")
async def predict_async(data: list):
    input_array = np.array(data, dtype=np.float32)
    if input_array.ndim == 1:
        input_array = input_array.reshape(1, -1)
    outputs = session.run(None, {session.get_inputs()[0].name: input_array})
    return {"prediction": outputs[0].tolist()}

Troubleshooting

  • If you get FileNotFoundError, verify the model.onnx path is correct.
  • If inference fails, check input data shape matches model input.
  • For performance issues, enable GPU providers in onnxruntime if available.
  • Use uvicorn --reload during development for live code reloads.

Key Takeaways

  • Use onnxruntime to load and run ONNX models efficiently in Python.
  • Integrate onnxruntime with FastAPI to serve models as REST APIs.
  • Define Pydantic models for input validation in FastAPI endpoints.
  • Enable GPU execution providers in onnxruntime for faster inference if hardware supports it.
  • Run uvicorn with --reload for development convenience.
Verified 2026-04 · onnxruntime, FastAPI
Verify ↗