FastAPI vs BentoML vs TorchServe
Why this matters
You'll choose the wrong tool if you don't understand what each solves: FastAPI requires you to build model loading and batching yourself, while BentoML packages everything, and TorchServe is PyTorch-only. Picking wrong means extra work or locked-in infrastructure.
Explanation
What each tool is: FastAPI is a Python web framework for building APIs. BentoML is a model packaging and serving platform built on top of FastAPI that handles model versioning, containerization, and deployment. TorchServe is a PyTorch-native inference server from Meta that manages model lifecycle, batching, and metrics for PyTorch models specifically.
How they differ mechanically: FastAPI requires you to manually write code to load a model, handle batching, and manage dependencies. BentoML wraps your model in a standardized service definition that auto-generates versioning, OpenAPI docs, and deployment configs. TorchServe takes a .pt file and a config file: no Python code needed: and handles batching, quantization, and serving out of the box.
When to use each: Use FastAPI if you need full control or are serving non-ML endpoints alongside models. Use BentoML if you have 1-10 models that evolve frequently and want MLOps features without Kubernetes complexity. Use TorchServe if you're PyTorch-only and want production-grade batching and metrics without writing any serving code.
Analogy
FastAPI is a blank canvas (you paint the model serving). BentoML is a paint-by-numbers kit (follow the template, get ML features). TorchServe is a vending machine (drop in a .pt file, it works).
Code
import json
from fastapi import FastAPI
from pydantic import BaseModel
import numpy as np
app = FastAPI()
model_state = None
def load_model():
global model_state
model_state = {'weights': np.array([0.5, 0.3, 0.2])}
print(f"Model loaded: {model_state}")
@app.on_event('startup')
async def startup():
load_model()
class PredictionRequest(BaseModel):
features: list[float]
@app.post('/predict')
async def predict(request: PredictionRequest):
if model_state is None:
return {'error': 'Model not loaded'}
features = np.array(request.features)
prediction = float(np.dot(features, model_state['weights']))
return {'prediction': prediction, 'model_version': 'fastapi-v1'}
if __name__ == '__main__':
import uvicorn
load_model()
test_request = PredictionRequest(features=[1.0, 2.0, 3.0])
import asyncio
result = asyncio.run(predict(test_request))
print(f"Request result: {result}") Model loaded: {'weights': array([0.5, 0.3, 0.2])}
Request result: {'prediction': 1.4, 'model_version': 'fastapi-v1'} What just happened?
The code defined a FastAPI app that manually loads a numpy model on startup, then exposes a /predict endpoint that takes features and returns a dot product. You own every part: loading, versioning, input validation, output format. If you wanted model versioning, you'd write it yourself. If you wanted batching, you'd add a queue. If you wanted metrics, you'd hook in Prometheus manually.
Common gotcha
The biggest mistake: developers assume FastAPI serves models automatically. It doesn't. The model lives in memory (model_state) and you have to ensure it's loaded before requests arrive, handle concurrent requests safely (FastAPI is async but your model might not be thread-safe), and manually implement anything beyond basic inference. With BentoML, you'd decorate a function and get versioning, containerization, and a bento.yaml for free.
Error recovery
RuntimeError: 'model_state' is None when /predict is calledTypeError: 'numpy.ndarray' is not JSON serializableasyncio.InvalidStateError when testingExperienced dev note
Here's the real tension: FastAPI forces you to own model lifecycle (loading, caching, versioning, canary deployments). This sounds like extra work, but it's actually an advantage because you'll naturally write better code: you'll see exactly where the model loads, where it's cached, what happens on failure. BentoML hides this in a framework, which feels faster until you need to debug why model updates don't rollout correctly. For your first deployment, use FastAPI. Once you have 5+ models and DevOps complexity, BentoML saves time. TorchServe is powerful but if you're not PyTorch-only, it's overkill.
Check your understanding
You've written a FastAPI endpoint that serves a Scikit-learn model. Your colleague wants to serve two different versions of the model (v1 and v2) and route 90% of traffic to v1, 10% to v2. Which framework (FastAPI, BentoML, or TorchServe) would require you to write the least code to implement this, and why?
Show answer hint
The correct answer identifies BentoML's built-in A/B testing/canary deployment support, then explains why: FastAPI gives you the tools but you code it, BentoML has it as a template, TorchServe doesn't have easy traffic splitting. The key insight is that ML-specific platforms exist because DevOps patterns repeat across ML teams.