Code Beginner easy · 5 min

Async endpoints for concurrent inference

What you will learn

Use async functions in FastAPI to handle multiple inference requests simultaneously without blocking.

Why this matters

ML inference endpoints are I/O-bound (waiting on model prediction, database, or external calls). Async endpoints let a single FastAPI worker handle dozens of concurrent requests instead of one at a time: critical for production APIs serving real traffic.

Skip if: Don't use async if your endpoint does only CPU-heavy synchronous work with no I/O or database calls: the overhead of async/await isn't worth it. In that case, regular synchronous functions are simpler and equally fast.

Explanation

What it is: An async endpoint is a FastAPI route defined with the async def keyword instead of regular def. It allows the server to pause execution while waiting for I/O (model inference, database queries, external API calls) and handle other requests in the meantime.

How it works mechanically: When a request arrives, FastAPI wraps the async function in a coroutine. While your function awaits a slow operation (like await model.predict()), FastAPI doesn't block: it handles the next incoming request. Once the first request's I/O finishes, execution resumes. This multiplexing means one worker thread can juggle many in-flight requests.

When to use it: Use async whenever your endpoint calls external services, queries a database, or performs any I/O-bound operation. For pure CPU-intensive model inference without I/O, async adds negligible benefit.

Analogy

Think of a restaurant cashier. A synchronous endpoint is like processing one customer completely, then the next: even if that customer is just waiting for their card to be swiped (I/O). An async endpoint is like processing customer 1's order, then handing them off to the kitchen, and immediately serving customer 2 while customer 1's food cooks. One person handles more customers because they're not idle waiting.

Code

python

import asyncio
from fastapi import FastAPI
from pydantic import BaseModel
import uvicorn

app = FastAPI()

class PredictRequest(BaseModel):
    text: str

class PredictResponse(BaseModel):
    label: str
    confidence: float

async def simulate_model_inference(text: str):
    await asyncio.sleep(2)
    return {"label": "positive" if len(text) > 10 else "negative", "confidence": 0.95}

@app.post("/predict", response_model=PredictResponse)
async def predict(request: PredictRequest):
    result = await simulate_model_inference(request.text)
    return PredictResponse(**result)

if __name__ == "__main__":
    uvicorn.run(app, host="127.0.0.1", port=8000, workers=1)

Output

No output: runs without error. Server starts and listens on http://127.0.0.1:8000. Send POST requests to /predict with {"text": "your input"} to receive predictions.

What just happened?

The code defined an async endpoint that accepts text, awaits a 2-second simulated model inference, and returns a prediction object. When you send multiple concurrent requests to this endpoint, FastAPI handles them all in parallel: each one pauses at the await point, letting other requests execute. A single uvicorn worker processes all of them concurrently instead of sequentially.

Common gotcha

The biggest mistake is forgetting that async def alone doesn't make your code concurrent: the await keyword is what triggers the pause. If you call a synchronous, CPU-bound function (like a regular model.predict()) inside an async endpoint without awaiting it, you'll block the entire worker and lose all concurrency benefits. You must await something I/O-bound, or use asyncio.to_thread() to offload CPU work.

Error recovery

RuntimeError: no running event loop

You're trying to run an async function directly in the Python shell without an event loop. Use <code>asyncio.run(predict(...))</code> or test via HTTP requests to the running server instead.

TypeError: coroutine object is not awaitable

You defined an async function but forgot to <code>await</code> it. Change <code>result = simulate_model_inference(...)</code> to <code>result = await simulate_model_inference(...)</code>.

Experienced dev note

Async in FastAPI is a multiplexing trick, not parallelism: it all runs on one thread. True parallelism requires multiple workers (via --workers flag in uvicorn). A single async worker is great for I/O-bound APIs, but if you have 100 concurrent requests to a CPU-bound model, one worker will still queue them. Use uvicorn app:app --workers 4 to spawn 4 OS-level workers, each with its own event loop. Each worker can multiplex I/O independently.

Check your understanding

If you have a synchronous ML model inference that takes 1 second of pure CPU time (no I/O), and you receive 10 concurrent requests, how many seconds total will it take to process all 10 with a single async worker using await? Explain why.

Show answer hint

The answer involves recognizing that async doesn't parallelize CPU work: it only pauses for I/O. A synchronous 1-second CPU task blocks the entire event loop regardless of async/await. The correct answer is ~10 seconds (serial execution), not 1 second. The key insight is: <code>async def</code> + CPU-bound work = no concurrency benefit.

VERSION FastAPI 0.100.0+ stabilized async behavior. Current version (0.115.x) uses native Python async/await without special wrappers: no compatibility concerns.

Next, learn how to add request validation using Pydantic models to enforce type safety and prevent bad data from crashing your inference endpoint.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.