Debug Fix Intermediate · 3 min read

How to implement rate limiting for AI features

Quick answer
Implement rate limiting for AI features by catching RateLimitError exceptions from API calls and applying exponential backoff retry logic. Use client-side counters or token buckets to throttle requests and avoid exceeding API quotas.
ERROR TYPE api_error
⚡ QUICK FIX
Add exponential backoff retry logic around your API call to handle RateLimitError automatically.

Why this happens

Rate limiting errors occur when your application sends too many requests to an AI API within a short time, exceeding the provider's quota or concurrency limits. For example, calling client.chat.completions.create() rapidly without delays can trigger a RateLimitError. The API responds with HTTP 429 status and an error message indicating the limit was exceeded.

Example broken code that triggers rate limiting:

python
from openai import OpenAI
import os

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

for _ in range(100):
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": "Hello"}]
    )
    print(response.choices[0].message.content)
output
openai.error.RateLimitError: You have exceeded your current quota, please check your plan and billing details.

The fix

Wrap your API calls in a retry loop with exponential backoff to handle RateLimitError. This delays retries progressively, reducing request bursts. Additionally, implement client-side rate limiting using counters or token buckets to throttle requests before hitting the API.

This example uses time.sleep() for backoff and catches the rate limit error to retry:

python
from openai import OpenAI
import os
import time
from openai import RateLimitError

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

max_retries = 5

for _ in range(100):
    for attempt in range(max_retries):
        try:
            response = client.chat.completions.create(
                model="gpt-4o",
                messages=[{"role": "user", "content": "Hello"}]
            )
            print(response.choices[0].message.content)
            break  # success, exit retry loop
        except RateLimitError:
            wait_time = 2 ** attempt  # exponential backoff
            print(f"Rate limit hit, retrying in {wait_time}s...")
            time.sleep(wait_time)
    else:
        print("Max retries exceeded, skipping request.")
output
Hello
Hello
Rate limit hit, retrying in 1s...
Hello
Hello
...

Preventing it in production

  • Use client-side rate limiting algorithms like token bucket or leaky bucket to control request rate before calling the API.
  • Implement exponential backoff with jitter to avoid synchronized retries causing spikes.
  • Monitor API usage and quotas via provider dashboards or API responses to adjust limits dynamically.
  • Use circuit breakers or fallback logic to degrade gracefully when limits are hit.
  • Cache frequent responses to reduce unnecessary API calls.

Key Takeaways

  • Catch RateLimitError and retry with exponential backoff to handle API limits gracefully.
  • Implement client-side throttling to prevent hitting API rate limits proactively.
  • Monitor usage and apply fallback strategies to maintain app reliability under load.
Verified 2026-04 · gpt-4o
Verify ↗