gunicorn + uvicorn workers
Why this matters
The default Uvicorn server runs single-threaded and can only handle one request at a time. In production, your ML inference API will receive multiple concurrent requests: you need worker processes to prevent bottlenecks and timeouts.
Explanation
Gunicorn is a production-grade application server that manages multiple worker processes, each running an instance of your FastAPI app. Uvicorn is a lightweight ASGI server: Gunicorn uses Uvicorn workers to run your app. Mechanically: Gunicorn forks N worker processes on startup. Each worker runs the complete FastAPI application independently. Gunicorn's master process receives incoming requests and distributes them to available workers using load balancing. When a worker finishes a request, it becomes available for the next one. When to use it: Always in production. Use 2–4 workers per CPU core as a starting point (e.g., 8 workers on a 2-core machine). For ML APIs, monitor which number of workers gives the best throughput without exhausting memory.
Analogy
Think of Gunicorn as a restaurant manager and each Uvicorn worker as a server. The manager (Gunicorn) doesn't take orders themselves: they receive requests at the front desk and assign each one to an available server (worker). If you only had one server, you'd have a line out the door. Multiple servers let you handle many customers in parallel.
Code
import uvicorn
from fastapi import FastAPI
import time
import os
app = FastAPI()
@app.post("/predict")
async def predict(data: dict):
"""Simulates a 2-second ML inference."""
worker_pid = os.getpid()
time.sleep(2)
return {
"prediction": 0.87,
"processed_by_worker_pid": worker_pid,
"input": data
}
@app.get("/health")
async def health():
return {"status": "ok"}
if __name__ == "__main__":
uvicorn.run(
"app:app",
host="0.0.0.0",
port=8000,
workers=4
) INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit) INFO: Started server process [12345] INFO: Started server process [12346] INFO: Started server process [12347] INFO: Started server process [12348] INFO: Waiting for application startup.
What just happened?
Uvicorn started 4 worker processes (PIDs 12345–12348), each running the complete FastAPI app. When you send concurrent requests to /predict, Gunicorn distributes them across available workers. The response includes the worker's PID, so you can see which worker handled each request. Each worker runs independently in its own process: that's why you see 4 startup messages.
Common gotcha
Developers often set workers=4 on a 1-core machine or workers=64 on a 2-core machine without understanding the relationship to CPU count. Too many workers waste memory and cause context-switching overhead. Too few workers under-utilize your machine. Start with (CPU cores × 2) + 1 as a baseline, then monitor. Also: if your ML model loads a large file in the module scope (outside a function), it gets loaded once per worker: multiply memory by the worker count.
Error recovery
OSError: [Errno 48] Address already in useRuntimeError: cannot release un-acquired lockMemoryError after adding workersExperienced dev note
Gunicorn + Uvicorn is the standard for FastAPI in production, but don't blindly copy configs from tutorials. Profile your actual API: measure inference time, request rate, and memory per model. A 10-second model inference doesn't need 16 workers: you'll be memory-bound, not CPU-bound. Use APM tools (New Relic, DataDog, Prometheus) to track worker utilization. Also: Uvicorn's built-in `workers` parameter is convenient, but production deployments often use Gunicorn as the parent process with `gunicorn -w 4 -k uvicorn.workers.UvicornWorker app:app` for better resource control and graceful reloading.
Check your understanding
You have a 4-core machine and your ML model takes 3 seconds to run inference. You expect 20 requests per second. Would you set workers=8, workers=12, or workers=4? Explain your reasoning in terms of concurrency, not just CPU cores.
Show answer hint
A correct answer considers that with 3-second inference, each worker is busy for 3 seconds per request. You need enough workers so that 20 requests/sec × 3 sec/request ≈ 60 concurrent requests don't queue. This requires thinking about request queuing time, not just CPU utilization.