High severity HTTP 429 intermediate · Fix: 5-10 min

RateLimitError

openai.RateLimitError (HTTP 429)

What this error means

LlamaIndex query engine calls to OpenAI API exceed the allowed request rate, causing a 429 RateLimitError.

Stack trace

traceback

openai.RateLimitError: Error code: 429 - {'error': {'message': 'Rate limit reached', 'type': 'requests', 'code': 'rate_limit_exceeded'}}
  File "main.py", line 42, in run_query
    response = client.chat.completions.create(model="gpt-4o-mini", messages=messages)
  File "/usr/local/lib/python3.9/site-packages/openai/openai.py", line 123, in create
    raise RateLimitError(error_message)

QUICK FIX

Add try/except catching openai.RateLimitError around OpenAI calls and implement exponential backoff retries.

Why it happens

The OpenAI API enforces strict rate limits per API key and model tier. When LlamaIndex's query engine sends too many requests in a short time, the API returns a 429 RateLimitError to prevent abuse and ensure fair usage.

Detection

Monitor API call responses for HTTP 429 status codes and catch openai.RateLimitError exceptions to detect rate limiting before the application crashes.

Causes & fixes

Query engine sends too many requests concurrently or in rapid succession to OpenAI API.

✓ Fix

Implement request throttling or exponential backoff retry logic in the query engine to reduce request frequency.

Using a low-tier OpenAI API plan with strict rate limits that are easily exceeded.

✓ Fix

Upgrade to a higher-tier OpenAI plan with increased rate limits or request quota.

No retry mechanism on RateLimitError in the LlamaIndex query engine code.

✓ Fix

Add retry logic with delays on catching RateLimitError to automatically retry failed requests.

Multiple parallel processes or threads sharing the same API key causing aggregate rate limit breaches.

✓ Fix

Coordinate API usage across processes or use separate API keys to distribute request load.

Code: broken vs fixed

Broken - triggers the error

python

from openai import OpenAI

client = OpenAI()

messages = [{"role": "user", "content": "Hello"}]
response = client.chat.completions.create(model="gpt-4o-mini", messages=messages)  # Raises RateLimitError if too many calls

Fixed - works correctly

python

import os
from openai import OpenAI, RateLimitError
import time

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

messages = [{"role": "user", "content": "Hello"}]

for attempt in range(5):
    try:
        response = client.chat.completions.create(model="gpt-4o-mini", messages=messages)
        print(response)
        break
    except RateLimitError:
        wait_time = 2 ** attempt
        print(f"Rate limit hit, retrying in {wait_time} seconds...")
        time.sleep(wait_time)
else:
    print("Failed after retries due to rate limit.")

Added try/except block catching RateLimitError with exponential backoff retries to handle OpenAI rate limits gracefully.

⚠

Workaround

Wrap OpenAI calls in try/except RateLimitError and add a fixed delay retry loop to reduce request frequency temporarily.

✓

Prevention

Architect your system to limit request concurrency, use rate limit headers to pace calls, and upgrade API plans to match usage needs.

Python 3.9+ · llama-index >=0.5.0 · tested on 0.5.x

Verified 2026-04

Verify ↗

Community Notes

No notes yetBe the first to share a version-specific fix or tip.