Debug Fix intermediate · 3 min read

How to handle LLM downtime in production

Quick answer
Handle LLM downtime in production by implementing retry logic with exponential backoff around your API calls and using fallback mechanisms such as cached responses or alternative models. Monitoring and alerting on API errors or latency also help maintain service reliability.
ERROR TYPE api_error
⚡ QUICK FIX
Add exponential backoff retry logic around your API call to handle RateLimitError automatically.

Why this happens

LLM downtime occurs due to service outages, rate limits, or network issues affecting API availability. For example, calling client.chat.completions.create() without retries can fail with RateLimitError or TimeoutError, causing your app to crash or hang.

Typical error output includes HTTP 429 (Too Many Requests) or 503 (Service Unavailable) responses.

python
from openai import OpenAI
import os

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Broken code: no retry handling
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello"}]
)
print(response.choices[0].message.content)
output
openai.error.RateLimitError: You have exceeded your current quota, please check your plan and billing details.

The fix

Wrap your API calls with retry logic using exponential backoff to handle transient errors like rate limits or timeouts. This prevents immediate failure and allows the request to succeed after a delay.

Example uses tenacity for retries and catches RateLimitError and TimeoutError. This approach ensures your app gracefully recovers from temporary downtime.

python
from openai import OpenAI
import os
from tenacity import retry, wait_exponential, stop_after_attempt, retry_if_exception_type
from openai import RateLimitError, Timeout

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

@retry(
    wait=wait_exponential(min=1, max=10),
    stop=stop_after_attempt(5),
    retry=retry_if_exception_type((RateLimitError, Timeout))
)
def call_llm(messages):
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=messages
    )
    return response.choices[0].message.content

try:
    result = call_llm([{"role": "user", "content": "Hello"}])
    print(result)
except Exception as e:
    print(f"LLM call failed after retries: {e}")
output
Hello
# or after retries if transient errors occur
# LLM call failed after retries: You have exceeded your current quota, please check your plan and billing details.

Preventing it in production

  • Implement retries with exponential backoff for transient API errors.
  • Use fallback strategies such as cached responses, simpler local models, or alternative LLM providers to maintain service continuity.
  • Monitor API error rates, latency, and usage quotas with alerting to detect downtime early.
  • Validate inputs and outputs to avoid unnecessary API calls that may trigger rate limits.
  • Design your system to degrade gracefully, informing users when AI features are temporarily unavailable.

Key Takeaways

  • Use exponential backoff retries to handle transient LLM API errors automatically.
  • Implement fallback mechanisms like caching or alternative models to maintain uptime.
  • Monitor API usage and errors to detect and respond to downtime quickly.
Verified 2026-04 · gpt-4o
Verify ↗