Debug Fix intermediate · 3 min read

How to add rate limiting to FastAPI LLM endpoint

Quick answer
Add rate limiting to your FastAPI LLM endpoint by integrating middleware like slowapi or fastapi-limiter to control request frequency. This prevents hitting RateLimitError from the AI API and ensures stable service.
ERROR TYPE api_error
⚡ QUICK FIX
Add exponential backoff retry logic around your API call to handle RateLimitError automatically.

Why this happens

When your FastAPI endpoint calls an LLM API (e.g., OpenAI or Anthropic) too frequently, the provider enforces rate limits to prevent abuse. If your code does not limit incoming requests, you will get RateLimitError responses from the API, causing failures.

Example broken code without rate limiting:

python
from fastapi import FastAPI
from openai import OpenAI
import os

app = FastAPI()
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

@app.post("/generate")
async def generate(prompt: str):
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}]
    )
    return {"text": response.choices[0].message.content}

The fix

Use a rate limiting middleware like slowapi to restrict the number of requests per client IP. This prevents overwhelming the LLM API and avoids RateLimitError. Additionally, implement exponential backoff retries on the API call to handle transient rate limits gracefully.

python
from fastapi import FastAPI, Request, HTTPException
from slowapi import Limiter
from slowapi.util import get_remote_address
from slowapi.errors import RateLimitExceeded
from slowapi.middleware import SlowAPIMiddleware
from openai import OpenAI
import os
import asyncio

app = FastAPI()
limiter = Limiter(key_func=get_remote_address)
app.state.limiter = limiter
app.add_middleware(SlowAPIMiddleware)

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

@app.exception_handler(RateLimitExceeded)
async def rate_limit_handler(request: Request, exc: RateLimitExceeded):
    return HTTPException(status_code=429, detail="Too many requests")

async def call_llm_with_retries(prompt: str, retries=3, backoff_in_seconds=1):
    for attempt in range(retries):
        try:
            response = client.chat.completions.create(
                model="gpt-4o",
                messages=[{"role": "user", "content": prompt}]
            )
            return response.choices[0].message.content
        except Exception as e:
            if "RateLimitError" in str(e) and attempt < retries - 1:
                await asyncio.sleep(backoff_in_seconds * 2 ** attempt)
            else:
                raise

@app.post("/generate")
@limiter.limit("5/minute")  # limit to 5 requests per minute per IP
async def generate(request: Request):
    data = await request.json()
    prompt = data.get("prompt", "")
    text = await call_llm_with_retries(prompt)
    return {"text": text}

Preventing it in production

  • Use rate limiting middleware to control request rates per user or IP.
  • Implement exponential backoff retries on API calls to handle transient rate limits.
  • Validate and throttle user inputs to avoid unnecessary API calls.
  • Monitor API usage and set alerts for approaching limits.
  • Consider caching frequent responses to reduce API calls.

Key Takeaways

  • Use slowapi or similar middleware to enforce request rate limits in FastAPI.
  • Implement exponential backoff retries around LLM API calls to handle transient RateLimitError.
  • Monitor and throttle user inputs to prevent excessive API usage and ensure stable service.
Verified 2026-04 · gpt-4o
Verify ↗