How to Intermediate · 4 min read

How to scale FastAPI LLM app horizontally

Quick answer
To scale a FastAPI LLM app horizontally, deploy multiple stateless FastAPI instances behind a load balancer and use container orchestration tools like Kubernetes or Docker Swarm. Ensure your app uses external state management (e.g., Redis) and shared API keys to handle concurrent requests efficiently with OpenAI or other LLM clients.

PREREQUISITES

  • Python 3.8+
  • FastAPI
  • Uvicorn or Gunicorn
  • Docker
  • OpenAI API key
  • pip install fastapi uvicorn openai

Setup

Install required packages and set environment variables for your FastAPI LLM app.

bash
pip install fastapi uvicorn openai

# Set your OpenAI API key in environment
export OPENAI_API_KEY=os.environ["OPENAI_API_KEY"]

Step by step

Create a stateless FastAPI app that calls the OpenAI API. Deploy multiple instances behind a load balancer to handle horizontal scaling.

python
import os
from fastapi import FastAPI
from openai import OpenAI

app = FastAPI()
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

@app.post("/generate")
async def generate_text(prompt: str):
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}]
    )
    return {"response": response.choices[0].message.content}

# Run with: uvicorn main:app --host 0.0.0.0 --port 8000

Common variations

Use Docker to containerize your app and deploy multiple containers with orchestration tools like Kubernetes or Docker Swarm. Use Redis or another external cache for shared state if needed.

python
# Dockerfile example
FROM python:3.10-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

# Kubernetes deployment snippet
apiVersion: apps/v1
kind: Deployment
metadata:
  name: fastapi-llm
spec:
  replicas: 3
  selector:
    matchLabels:
      app: fastapi-llm
  template:
    metadata:
      labels:
        app: fastapi-llm
    spec:
      containers:
      - name: fastapi-llm
        image: your-docker-image:latest
        ports:
        - containerPort: 8000

# Use a LoadBalancer service to distribute traffic

Troubleshooting

  • If you see rate limit errors from the LLM API, implement exponential backoff and retry logic.
  • If requests are slow, check your load balancer and scale replicas up.
  • Ensure environment variables like OPENAI_API_KEY are set correctly on all instances.

Key Takeaways

  • Deploy multiple stateless FastAPI instances behind a load balancer for horizontal scaling.
  • Use container orchestration tools like Kubernetes or Docker Swarm to manage replicas.
  • Keep API keys and state external to instances to maintain statelessness.
  • Implement retry and backoff to handle LLM API rate limits gracefully.
  • Monitor and scale replicas based on traffic and latency metrics.
Verified 2026-04 · gpt-4o
Verify ↗