RateLimitError
openai.RateLimitError (HTTP 429)
Stack trace
openai.RateLimitError: Error code: 429 - {'error': {'message': 'Rate limit reached', 'type': 'requests', 'code': 'rate_limit_exceeded'}}
File "main.py", line 42, in run_query
response = client.chat.completions.create(model="gpt-4o-mini", messages=messages)
File "/usr/local/lib/python3.9/site-packages/openai/openai.py", line 123, in create
raise RateLimitError(error_message) Why it happens
The OpenAI API enforces strict rate limits per API key and model tier. When LlamaIndex's query engine sends too many requests in a short time, the API returns a 429 RateLimitError to prevent abuse and ensure fair usage.
Detection
Monitor API call responses for HTTP 429 status codes and catch openai.RateLimitError exceptions to detect rate limiting before the application crashes.
Causes & fixes
Query engine sends too many requests concurrently or in rapid succession to OpenAI API.
Implement request throttling or exponential backoff retry logic in the query engine to reduce request frequency.
Using a low-tier OpenAI API plan with strict rate limits that are easily exceeded.
Upgrade to a higher-tier OpenAI plan with increased rate limits or request quota.
No retry mechanism on RateLimitError in the LlamaIndex query engine code.
Add retry logic with delays on catching RateLimitError to automatically retry failed requests.
Multiple parallel processes or threads sharing the same API key causing aggregate rate limit breaches.
Coordinate API usage across processes or use separate API keys to distribute request load.
Code: broken vs fixed
from openai import OpenAI
client = OpenAI()
messages = [{"role": "user", "content": "Hello"}]
response = client.chat.completions.create(model="gpt-4o-mini", messages=messages) # Raises RateLimitError if too many calls import os
from openai import OpenAI, RateLimitError
import time
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
messages = [{"role": "user", "content": "Hello"}]
for attempt in range(5):
try:
response = client.chat.completions.create(model="gpt-4o-mini", messages=messages)
print(response)
break
except RateLimitError:
wait_time = 2 ** attempt
print(f"Rate limit hit, retrying in {wait_time} seconds...")
time.sleep(wait_time)
else:
print("Failed after retries due to rate limit.") Workaround
Wrap OpenAI calls in try/except RateLimitError and add a fixed delay retry loop to reduce request frequency temporarily.
Prevention
Architect your system to limit request concurrency, use rate limit headers to pace calls, and upgrade API plans to match usage needs.