How to beginner · 3 min read

How to use Gunicorn with FastAPI for LLM

Quick answer
Use Gunicorn as a production server to run your FastAPI app that serves an LLM by specifying an ASGI worker like uvicorn.workers.UvicornWorker. This setup enables concurrent handling of requests efficiently while your FastAPI app calls the LLM API asynchronously or synchronously.

PREREQUISITES

  • Python 3.8+
  • OpenAI API key or other LLM API key
  • pip install fastapi uvicorn gunicorn openai

Setup

Install the required packages with pip and set your environment variable for the LLM API key.

bash
pip install fastapi uvicorn gunicorn openai

Step by step

Create a simple FastAPI app that calls an LLM using the openai SDK and run it with Gunicorn using the UvicornWorker for ASGI support.

python
import os
from fastapi import FastAPI
from pydantic import BaseModel
from openai import OpenAI

app = FastAPI()
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

class PromptRequest(BaseModel):
    prompt: str

@app.post("/generate")
async def generate_text(request: PromptRequest):
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": request.prompt}]
    )
    return {"response": response.choices[0].message.content}

# To run with Gunicorn:
# gunicorn -w 4 -k uvicorn.workers.UvicornWorker main:app

# This command starts 4 worker processes handling requests concurrently.

Common variations

  • Use async calls to the LLM API for better concurrency.
  • Change the number of workers -w in Gunicorn based on CPU cores.
  • Use different models like gpt-4.1 or claude-3-5-haiku-20241022 by adjusting the model parameter.
  • For streaming responses, integrate WebSocket or Server-Sent Events with FastAPI.

Troubleshooting

  • If Gunicorn workers fail to start, ensure uvicorn is installed and the worker class is set to uvicorn.workers.UvicornWorker.
  • For environment variable issues, verify OPENAI_API_KEY is set in the shell before running Gunicorn.
  • If you see slow responses, increase the number of workers or use async calls to the LLM API.

Key Takeaways

  • Use Gunicorn with the UvicornWorker class to serve FastAPI apps in production.
  • Set the number of Gunicorn workers based on your server's CPU cores for optimal concurrency.
  • Call the LLM API asynchronously in FastAPI endpoints to maximize throughput.
  • Always load API keys from environment variables for security and flexibility.
  • Test your deployment locally with Gunicorn before production rollout.
Verified 2026-04 · gpt-4o-mini, gpt-4.1, claude-3-5-haiku-20241022
Verify ↗