Debug Fix intermediate · 3 min read

Azure OpenAI rate limiting best practices

Quick answer
Azure OpenAI enforces rate limits that can cause RateLimitError when exceeded. Implement exponential backoff retry logic around your API calls using the AzureOpenAI client to handle these errors gracefully and maintain robust application performance.
ERROR TYPE api_error
⚡ QUICK FIX
Add exponential backoff retry logic around your API call to handle RateLimitError automatically.

Why this happens

Azure OpenAI applies rate limits to control the number of requests per minute or second per deployment. When your application sends requests too quickly or exceeds the allowed quota, the API responds with a RateLimitError. This error typically appears as a 429 HTTP status code with a message indicating the limit was exceeded.

Example of triggering code without retry handling:

python
import os
from openai import AzureOpenAI

client = AzureOpenAI(
    api_key=os.environ["AZURE_OPENAI_API_KEY"],
    azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
    api_version="2024-02-01"
)

response = client.chat.completions.create(
    model=os.environ["AZURE_OPENAI_DEPLOYMENT"],
    messages=[{"role": "user", "content": "Hello"}]
)
print(response.choices[0].message.content)
output
Traceback (most recent call last):
  ...
openai.error.RateLimitError: Rate limit exceeded

The fix

Wrap your Azure OpenAI API calls with exponential backoff retry logic to automatically handle RateLimitError. This approach retries the request after increasing delays, reducing the chance of repeated failures and respecting the service limits.

The example below uses time.sleep() with exponential backoff and jitter for robustness.

python
import os
import time
from openai import AzureOpenAI
from openai import RateLimitError

client = AzureOpenAI(
    api_key=os.environ["AZURE_OPENAI_API_KEY"],
    azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
    api_version="2024-02-01"
)

def call_azure_openai_with_retry(messages, max_retries=5):
    delay = 1  # initial delay in seconds
    for attempt in range(max_retries):
        try:
            response = client.chat.completions.create(
                model=os.environ["AZURE_OPENAI_DEPLOYMENT"],
                messages=messages
            )
            return response.choices[0].message.content
        except RateLimitError:
            if attempt == max_retries - 1:
                raise
            time.sleep(delay)
            delay *= 2  # exponential backoff

# Usage
messages = [{"role": "user", "content": "Hello"}]
result = call_azure_openai_with_retry(messages)
print(result)
output
Hello! How can I assist you today?

Preventing it in production

  • Implement robust retry logic with exponential backoff and jitter to avoid synchronized retries.
  • Monitor your usage and set alerts for approaching rate limits via Azure Portal or telemetry.
  • Use client-side rate limiting to throttle requests proactively.
  • Consider scaling your Azure OpenAI resource or deploying multiple endpoints if your workload requires higher throughput.
  • Cache frequent responses when possible to reduce API calls.

Key Takeaways

  • Always implement exponential backoff retries to handle Azure OpenAI rate limits gracefully.
  • Monitor API usage and proactively throttle requests to prevent hitting limits.
  • Scaling and caching strategies reduce the risk of rate limiting in production.
Verified 2026-04 · gpt-4o, AzureOpenAI
Verify ↗