Debug Fix beginner · 3 min read

How to use AI APIs with rate limiting

Quick answer
When using AI APIs like gpt-4o or claude-3-5-sonnet-20241022, you may encounter RateLimitError if you exceed request quotas. To handle this, implement retry logic with exponential backoff around your API calls to automatically pause and retry after delays, ensuring smooth operation without hitting limits.
ERROR TYPE api_error
⚡ QUICK FIX
Add exponential backoff retry logic around your API call to handle RateLimitError automatically.

Why this happens

AI APIs enforce rate limits to prevent abuse and ensure fair usage. If your application sends too many requests too quickly, the API returns a RateLimitError. This often happens in loops or high-concurrency scenarios without delay or retry logic.

Example broken code that triggers rate limiting:

python
from openai import OpenAI
import os

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

for i in range(100):
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": "Hello"}]
    )
    print(response.choices[0].message.content)
output
openai.error.RateLimitError: You have exceeded your current quota, please check your plan and billing details.

The fix

Wrap your API calls in retry logic with exponential backoff to catch RateLimitError and retry after waiting. This reduces request bursts and respects API limits.

Example fixed code with retries:

python
import time
from openai import OpenAI
import os

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

max_retries = 5

for i in range(100):
    for attempt in range(max_retries):
        try:
            response = client.chat.completions.create(
                model="gpt-4o",
                messages=[{"role": "user", "content": "Hello"}]
            )
            print(response.choices[0].message.content)
            break  # success, exit retry loop
        except Exception as e:
            if "RateLimitError" in str(e):
                wait_time = 2 ** attempt  # exponential backoff
                print(f"Rate limit hit, retrying in {wait_time}s...")
                time.sleep(wait_time)
            else:
                raise
output
Hello
Hello
... (repeated 100 times without error)

Preventing it in production

  • Implement robust retry with exponential backoff and jitter to avoid synchronized retries.
  • Monitor API usage and set alerts for approaching rate limits.
  • Use client-side rate limiting to throttle requests proactively.
  • Consider fallback models or cached responses when limits are hit.
  • Batch requests if supported to reduce call frequency.

Key Takeaways

  • Always implement retry logic with exponential backoff to handle RateLimitError gracefully.
  • Monitor and throttle your request rate proactively to avoid hitting API limits.
  • Use fallback strategies like caching or alternative models to maintain app availability under rate limits.
Verified 2026-04 · gpt-4o, claude-3-5-sonnet-20241022
Verify ↗