Code Beginner easy · 5 min

gunicorn + uvicorn workers

What you will learn

Run multiple FastAPI worker processes behind Gunicorn to handle concurrent requests in production.

Why this matters

The default Uvicorn server runs single-threaded and can only handle one request at a time. In production, your ML inference API will receive multiple concurrent requests: you need worker processes to prevent bottlenecks and timeouts.

Skip if: You don't need this for local development, testing, or when you're deliberately limiting concurrency. Don't use this if your inference model requires strict single-process state (rare, but possible with certain GPU libraries that don't support multiprocessing).

Explanation

Gunicorn is a production-grade application server that manages multiple worker processes, each running an instance of your FastAPI app. Uvicorn is a lightweight ASGI server: Gunicorn uses Uvicorn workers to run your app. Mechanically: Gunicorn forks N worker processes on startup. Each worker runs the complete FastAPI application independently. Gunicorn's master process receives incoming requests and distributes them to available workers using load balancing. When a worker finishes a request, it becomes available for the next one. When to use it: Always in production. Use 2–4 workers per CPU core as a starting point (e.g., 8 workers on a 2-core machine). For ML APIs, monitor which number of workers gives the best throughput without exhausting memory.

Analogy

Think of Gunicorn as a restaurant manager and each Uvicorn worker as a server. The manager (Gunicorn) doesn't take orders themselves: they receive requests at the front desk and assign each one to an available server (worker). If you only had one server, you'd have a line out the door. Multiple servers let you handle many customers in parallel.

Code

python

import uvicorn
from fastapi import FastAPI
import time
import os

app = FastAPI()

@app.post("/predict")
async def predict(data: dict):
    """Simulates a 2-second ML inference."""
    worker_pid = os.getpid()
    time.sleep(2)
    return {
        "prediction": 0.87,
        "processed_by_worker_pid": worker_pid,
        "input": data
    }

@app.get("/health")
async def health():
    return {"status": "ok"}

if __name__ == "__main__":
    uvicorn.run(
        "app:app",
        host="0.0.0.0",
        port=8000,
        workers=4
    )

Output

INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO:     Started server process [12345]
INFO:     Started server process [12346]
INFO:     Started server process [12347]
INFO:     Started server process [12348]
INFO:     Waiting for application startup.

What just happened?

Uvicorn started 4 worker processes (PIDs 12345–12348), each running the complete FastAPI app. When you send concurrent requests to /predict, Gunicorn distributes them across available workers. The response includes the worker's PID, so you can see which worker handled each request. Each worker runs independently in its own process: that's why you see 4 startup messages.

Common gotcha

Developers often set workers=4 on a 1-core machine or workers=64 on a 2-core machine without understanding the relationship to CPU count. Too many workers waste memory and cause context-switching overhead. Too few workers under-utilize your machine. Start with (CPU cores × 2) + 1 as a baseline, then monitor. Also: if your ML model loads a large file in the module scope (outside a function), it gets loaded once per worker: multiply memory by the worker count.

Error recovery

OSError: [Errno 48] Address already in use

Another process is listening on port 8000. Kill it with `lsof -i :8000 | grep LISTEN | awk '{print $2}' | xargs kill -9` or change the port in uvicorn.run().

RuntimeError: cannot release un-acquired lock

Your code is trying to acquire a lock that doesn't exist. This often happens with multiprocessing and shared state. Use process-safe libraries like Redis for shared state instead of in-memory locks.

MemoryError after adding workers

Each worker is a full Python process: your model is loaded into memory once per worker. If your model is 2GB and you have 4 workers, you need 8GB. Reduce workers or use a model server (e.g., TensorFlow Serving) instead.

Experienced dev note

Gunicorn + Uvicorn is the standard for FastAPI in production, but don't blindly copy configs from tutorials. Profile your actual API: measure inference time, request rate, and memory per model. A 10-second model inference doesn't need 16 workers: you'll be memory-bound, not CPU-bound. Use APM tools (New Relic, DataDog, Prometheus) to track worker utilization. Also: Uvicorn's built-in `workers` parameter is convenient, but production deployments often use Gunicorn as the parent process with `gunicorn -w 4 -k uvicorn.workers.UvicornWorker app:app` for better resource control and graceful reloading.

Check your understanding

You have a 4-core machine and your ML model takes 3 seconds to run inference. You expect 20 requests per second. Would you set workers=8, workers=12, or workers=4? Explain your reasoning in terms of concurrency, not just CPU cores.

Show answer hint

A correct answer considers that with 3-second inference, each worker is busy for 3 seconds per request. You need enough workers so that 20 requests/sec × 3 sec/request ≈ 60 concurrent requests don't queue. This requires thinking about request queuing time, not just CPU utilization.

VERSION FastAPI 0.115.x and Uvicorn 0.30.x both support the `workers` parameter in uvicorn.run(). There are no breaking changes in this version range for worker management.

Once you've scaled with workers, learn how to structure your FastAPI app for testability and modularity with dependency injection: this becomes critical when running multiple worker instances.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.