How to rate limit LLM API
api_error Why this happens
LLM API providers enforce rate limits to prevent abuse and ensure fair resource allocation. When your application sends requests too quickly or exceeds the allowed number of calls per minute, the API returns a RateLimitError. This error typically appears as an HTTP 429 status with a message like "Rate limit exceeded." For example, calling the OpenAI chat.completions.create method in a tight loop without delays triggers this error.
Example of problematic code that triggers rate limiting:
from openai import OpenAI
import os
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
for _ in range(100):
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Hello"}]
)
print(response.choices[0].message.content) openai.error.RateLimitError: Rate limit exceeded
The fix
Implement exponential backoff retry logic to automatically pause and retry requests when a RateLimitError occurs. This approach respects the API's rate limits and reduces failed calls. Additionally, you can throttle requests by adding delays or using a rate limiter library.
Below is a corrected example using Python with exponential backoff and jitter to handle rate limits gracefully:
import time
import random
from openai import OpenAI
import os
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
max_retries = 5
for _ in range(100):
for attempt in range(max_retries):
try:
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Hello"}]
)
print(response.choices[0].message.content)
break # Success, exit retry loop
except Exception as e:
if "RateLimitError" in str(type(e)) or "rate limit" in str(e).lower():
sleep_time = (2 ** attempt) + random.uniform(0, 1)
print(f"Rate limit hit, retrying in {sleep_time:.2f} seconds...")
time.sleep(sleep_time)
else:
raise Hello Hello Rate limit hit, retrying in 2.73 seconds... Hello Hello ...
Preventing it in production
To avoid rate limit errors in production AI applications, implement these best practices:
- Client-side throttling: Limit the number of requests per second/minute using token buckets or leaky bucket algorithms.
- Exponential backoff retries: Automatically retry failed requests with increasing delays and jitter to reduce collision.
- Request batching: Combine multiple prompts or queries into a single API call if supported.
- Monitoring and alerting: Track API usage and error rates to detect and respond to rate limit issues early.
- Fallback strategies: Use cached responses or degrade gracefully when limits are hit.
These measures ensure stable, ethical, and efficient use of LLM APIs without overwhelming provider infrastructure.
Key Takeaways
- Implement exponential backoff with jitter to handle RateLimitError gracefully.
- Throttle API calls client-side to stay within provider rate limits and avoid errors.
- Monitor API usage and errors to proactively prevent rate limiting in production.