How to use AI APIs with rate limiting
Quick answer
When using AI APIs like
gpt-4o or claude-3-5-sonnet-20241022, you may encounter RateLimitError if you exceed request quotas. To handle this, implement retry logic with exponential backoff around your API calls to automatically pause and retry after delays, ensuring smooth operation without hitting limits. ERROR TYPE
api_error ⚡ QUICK FIX
Add exponential backoff retry logic around your API call to handle
RateLimitError automatically.Why this happens
AI APIs enforce rate limits to prevent abuse and ensure fair usage. If your application sends too many requests too quickly, the API returns a RateLimitError. This often happens in loops or high-concurrency scenarios without delay or retry logic.
Example broken code that triggers rate limiting:
from openai import OpenAI
import os
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
for i in range(100):
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Hello"}]
)
print(response.choices[0].message.content) output
openai.error.RateLimitError: You have exceeded your current quota, please check your plan and billing details.
The fix
Wrap your API calls in retry logic with exponential backoff to catch RateLimitError and retry after waiting. This reduces request bursts and respects API limits.
Example fixed code with retries:
import time
from openai import OpenAI
import os
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
max_retries = 5
for i in range(100):
for attempt in range(max_retries):
try:
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Hello"}]
)
print(response.choices[0].message.content)
break # success, exit retry loop
except Exception as e:
if "RateLimitError" in str(e):
wait_time = 2 ** attempt # exponential backoff
print(f"Rate limit hit, retrying in {wait_time}s...")
time.sleep(wait_time)
else:
raise output
Hello Hello ... (repeated 100 times without error)
Preventing it in production
- Implement robust retry with exponential backoff and jitter to avoid synchronized retries.
- Monitor API usage and set alerts for approaching rate limits.
- Use client-side rate limiting to throttle requests proactively.
- Consider fallback models or cached responses when limits are hit.
- Batch requests if supported to reduce call frequency.
Key Takeaways
- Always implement retry logic with exponential backoff to handle
RateLimitErrorgracefully. - Monitor and throttle your request rate proactively to avoid hitting API limits.
- Use fallback strategies like caching or alternative models to maintain app availability under rate limits.