High severity HTTP 429 intermediate · Fix: 2-5 min

RateLimitError

RateLimitError (HTTP 429)

What this error means
Azure OpenAI returns a RateLimitError 429 when the token usage exceeds the allowed tokens per minute quota.

Stack trace

traceback
openai.RateLimitError: Error code: 429 - {'error': {'message': 'Rate limit reached', 'type': 'requests', 'code': 'rate_limit_exceeded'}}
QUICK FIX
Add retry logic with exponential backoff catching RateLimitError to automatically retry after waiting.

Why it happens

Azure OpenAI enforces strict token usage limits per minute to manage resource allocation. When your application sends requests that cumulatively exceed the allowed tokens per minute quota, the service responds with a 429 RateLimitError to throttle usage.

Detection

Monitor API responses for HTTP 429 status codes and catch RateLimitError exceptions to detect when token rate limits are exceeded before the application crashes.

Causes & fixes

1

Sending too many tokens in requests within a short time exceeding Azure OpenAI's tokens per minute quota

✓ Fix

Reduce the frequency of requests or the size of prompts and completions to stay within the tokens per minute limit.

2

Multiple parallel requests cumulatively exceeding the token rate limit

✓ Fix

Implement request queuing or rate limiting in your client to serialize or throttle requests to Azure OpenAI.

3

Using a subscription tier with low token per minute limits without adjusting usage accordingly

✓ Fix

Upgrade your Azure OpenAI subscription plan to a higher tier with increased token rate limits or optimize token usage.

Code: broken vs fixed

Broken - triggers the error
python
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Hello"}]
)  # This may raise RateLimitError if tokens per minute exceeded
print(response)
Fixed - works correctly
python
import os
from openai import OpenAI, RateLimitError
import time

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

try:
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": "Hello"}]
    )  # Added try/except to catch RateLimitError
    print(response)
except RateLimitError:
    print("Rate limit exceeded, retrying after delay...")
    time.sleep(10)  # Wait before retrying
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": "Hello"}]
    )
    print(response)

# Note: API key must be set in environment variable OPENAI_API_KEY
Added try/except block to catch RateLimitError and retry after a delay, preventing immediate crash on token rate limit exceedance.

Workaround

Catch RateLimitError exceptions and implement exponential backoff retries with delays to avoid immediate failure when token limits are hit.

Prevention

Implement client-side rate limiting and batching to keep token usage within Azure OpenAI quotas, and monitor usage metrics to upgrade plans proactively.

Python 3.9+ · openai >=1.0.0 · tested on 1.x
Verified 2026-04
Verify ↗

Community Notes

No notes yetBe the first to share a version-specific fix or tip.