Async endpoints for concurrent inference
Why this matters
ML inference endpoints are I/O-bound (waiting on model prediction, database, or external calls). Async endpoints let a single FastAPI worker handle dozens of concurrent requests instead of one at a time: critical for production APIs serving real traffic.
Explanation
What it is: An async endpoint is a FastAPI route defined with the async def keyword instead of regular def. It allows the server to pause execution while waiting for I/O (model inference, database queries, external API calls) and handle other requests in the meantime.
How it works mechanically: When a request arrives, FastAPI wraps the async function in a coroutine. While your function awaits a slow operation (like await model.predict()), FastAPI doesn't block: it handles the next incoming request. Once the first request's I/O finishes, execution resumes. This multiplexing means one worker thread can juggle many in-flight requests.
When to use it: Use async whenever your endpoint calls external services, queries a database, or performs any I/O-bound operation. For pure CPU-intensive model inference without I/O, async adds negligible benefit.
Analogy
Think of a restaurant cashier. A synchronous endpoint is like processing one customer completely, then the next: even if that customer is just waiting for their card to be swiped (I/O). An async endpoint is like processing customer 1's order, then handing them off to the kitchen, and immediately serving customer 2 while customer 1's food cooks. One person handles more customers because they're not idle waiting.
Code
import asyncio
from fastapi import FastAPI
from pydantic import BaseModel
import uvicorn
app = FastAPI()
class PredictRequest(BaseModel):
text: str
class PredictResponse(BaseModel):
label: str
confidence: float
async def simulate_model_inference(text: str):
await asyncio.sleep(2)
return {"label": "positive" if len(text) > 10 else "negative", "confidence": 0.95}
@app.post("/predict", response_model=PredictResponse)
async def predict(request: PredictRequest):
result = await simulate_model_inference(request.text)
return PredictResponse(**result)
if __name__ == "__main__":
uvicorn.run(app, host="127.0.0.1", port=8000, workers=1) No output: runs without error. Server starts and listens on http://127.0.0.1:8000. Send POST requests to /predict with {"text": "your input"} to receive predictions. What just happened?
The code defined an async endpoint that accepts text, awaits a 2-second simulated model inference, and returns a prediction object. When you send multiple concurrent requests to this endpoint, FastAPI handles them all in parallel: each one pauses at the await point, letting other requests execute. A single uvicorn worker processes all of them concurrently instead of sequentially.
Common gotcha
The biggest mistake is forgetting that async def alone doesn't make your code concurrent: the await keyword is what triggers the pause. If you call a synchronous, CPU-bound function (like a regular model.predict()) inside an async endpoint without awaiting it, you'll block the entire worker and lose all concurrency benefits. You must await something I/O-bound, or use asyncio.to_thread() to offload CPU work.
Error recovery
RuntimeError: no running event loopTypeError: coroutine object is not awaitableExperienced dev note
Async in FastAPI is a multiplexing trick, not parallelism: it all runs on one thread. True parallelism requires multiple workers (via --workers flag in uvicorn). A single async worker is great for I/O-bound APIs, but if you have 100 concurrent requests to a CPU-bound model, one worker will still queue them. Use uvicorn app:app --workers 4 to spawn 4 OS-level workers, each with its own event loop. Each worker can multiplex I/O independently.
Check your understanding
If you have a synchronous ML model inference that takes 1 second of pure CPU time (no I/O), and you receive 10 concurrent requests, how many seconds total will it take to process all 10 with a single async worker using await? Explain why.
Show answer hint
The answer involves recognizing that async doesn't parallelize CPU work: it only pauses for I/O. A synchronous 1-second CPU task blocks the entire event loop regardless of async/await. The correct answer is ~10 seconds (serial execution), not 1 second. The key insight is: <code>async def</code> + CPU-bound work = no concurrency benefit.