Code Beginner easy · 6 min

FastAPI vs BentoML vs TorchServe

What you will learn

FastAPI is a general-purpose web framework; BentoML and TorchServe are ML-specific deployment platforms that trade flexibility for built-in model management.

Why this matters

You'll choose the wrong tool if you don't understand what each solves: FastAPI requires you to build model loading and batching yourself, while BentoML packages everything, and TorchServe is PyTorch-only. Picking wrong means extra work or locked-in infrastructure.

Skip if: Don't use BentoML or TorchServe if you're deploying a single model once and never changing it, or if you need custom inference logic (A/B testing, feature engineering, post-processing): FastAPI's simplicity wins. Don't use FastAPI if you're deploying dozens of models that need versioning, canary rollouts, and GPU management without writing infrastructure code.

Explanation

What each tool is: FastAPI is a Python web framework for building APIs. BentoML is a model packaging and serving platform built on top of FastAPI that handles model versioning, containerization, and deployment. TorchServe is a PyTorch-native inference server from Meta that manages model lifecycle, batching, and metrics for PyTorch models specifically.

How they differ mechanically: FastAPI requires you to manually write code to load a model, handle batching, and manage dependencies. BentoML wraps your model in a standardized service definition that auto-generates versioning, OpenAPI docs, and deployment configs. TorchServe takes a .pt file and a config file: no Python code needed: and handles batching, quantization, and serving out of the box.

When to use each: Use FastAPI if you need full control or are serving non-ML endpoints alongside models. Use BentoML if you have 1-10 models that evolve frequently and want MLOps features without Kubernetes complexity. Use TorchServe if you're PyTorch-only and want production-grade batching and metrics without writing any serving code.

Analogy

FastAPI is a blank canvas (you paint the model serving). BentoML is a paint-by-numbers kit (follow the template, get ML features). TorchServe is a vending machine (drop in a .pt file, it works).

Code

Illustrative only - not runnable without a valid API key

python

import json
from fastapi import FastAPI
from pydantic import BaseModel
import numpy as np

app = FastAPI()

model_state = None

def load_model():
    global model_state
    model_state = {'weights': np.array([0.5, 0.3, 0.2])}
    print(f"Model loaded: {model_state}")

@app.on_event('startup')
async def startup():
    load_model()

class PredictionRequest(BaseModel):
    features: list[float]

@app.post('/predict')
async def predict(request: PredictionRequest):
    if model_state is None:
        return {'error': 'Model not loaded'}
    features = np.array(request.features)
    prediction = float(np.dot(features, model_state['weights']))
    return {'prediction': prediction, 'model_version': 'fastapi-v1'}

if __name__ == '__main__':
    import uvicorn
    load_model()
    test_request = PredictionRequest(features=[1.0, 2.0, 3.0])
    import asyncio
    result = asyncio.run(predict(test_request))
    print(f"Request result: {result}")

Output

Model loaded: {'weights': array([0.5, 0.3, 0.2])}
Request result: {'prediction': 1.4, 'model_version': 'fastapi-v1'}

What just happened?

The code defined a FastAPI app that manually loads a numpy model on startup, then exposes a /predict endpoint that takes features and returns a dot product. You own every part: loading, versioning, input validation, output format. If you wanted model versioning, you'd write it yourself. If you wanted batching, you'd add a queue. If you wanted metrics, you'd hook in Prometheus manually.

Common gotcha

The biggest mistake: developers assume FastAPI serves models automatically. It doesn't. The model lives in memory (model_state) and you have to ensure it's loaded before requests arrive, handle concurrent requests safely (FastAPI is async but your model might not be thread-safe), and manually implement anything beyond basic inference. With BentoML, you'd decorate a function and get versioning, containerization, and a bento.yaml for free.

Error recovery

RuntimeError: 'model_state' is None when /predict is called

The startup event didn't run or failed silently. FastAPI startup events only run when uvicorn starts the server, not when running your script directly. Always call load_model() before testing in __main__ block.

TypeError: 'numpy.ndarray' is not JSON serializable

NumPy arrays can't be returned directly from FastAPI endpoints. Convert to Python list or float first (like float(np.dot(...))). FastAPI uses json.dumps() which only knows built-in types.

asyncio.InvalidStateError when testing

You're calling an async function (predict) synchronously in a script that's not inside an async context. Use asyncio.run() wrapper or make the test function async and await the call.

Experienced dev note

Here's the real tension: FastAPI forces you to own model lifecycle (loading, caching, versioning, canary deployments). This sounds like extra work, but it's actually an advantage because you'll naturally write better code: you'll see exactly where the model loads, where it's cached, what happens on failure. BentoML hides this in a framework, which feels faster until you need to debug why model updates don't rollout correctly. For your first deployment, use FastAPI. Once you have 5+ models and DevOps complexity, BentoML saves time. TorchServe is powerful but if you're not PyTorch-only, it's overkill.

Check your understanding

You've written a FastAPI endpoint that serves a Scikit-learn model. Your colleague wants to serve two different versions of the model (v1 and v2) and route 90% of traffic to v1, 10% to v2. Which framework (FastAPI, BentoML, or TorchServe) would require you to write the least code to implement this, and why?

Show answer hint

The correct answer identifies BentoML's built-in A/B testing/canary deployment support, then explains why: FastAPI gives you the tools but you code it, BentoML has it as a template, TorchServe doesn't have easy traffic splitting. The key insight is that ML-specific platforms exist because DevOps patterns repeat across ML teams.

VERSION FastAPI 0.115.x (current stable as of April 2026) changed the startup/shutdown event syntax: use @app.on_event('startup') instead of @app.on_event('startup'). Both work in 0.115.x but the decorator pattern is preferred. BentoML 1.2.x+ requires Python 3.9+; TorchServe 0.12.x+ requires Java 11+.

Next, learn how to structure model loading and caching in FastAPI endpoints to avoid reloading on every request: this is where most FastAPI ML services break in production.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.