Debug Fix intermediate · 3 min read

How to handle LLM downtime in production

Q: How to handle LLM downtime in production

Handle LLM downtime in production by implementing retry logic with exponential backoff around your API calls and using fallback mechanisms such as cached responses or alternative models. Monitoring and alerting on API errors or latency also help maintain service reliability.

Quick answer

Handle LLM downtime in production by implementing retry logic with exponential backoff around your API calls and using fallback mechanisms such as cached responses or alternative models. Monitoring and alerting on API errors or latency also help maintain service reliability.

ERROR TYPE api_error

⚡ QUICK FIX

Add exponential backoff retry logic around your API call to handle RateLimitError automatically.

Why this happens

LLM downtime occurs due to service outages, rate limits, or network issues affecting API availability. For example, calling client.chat.completions.create() without retries can fail with RateLimitError or TimeoutError, causing your app to crash or hang.

Typical error output includes HTTP 429 (Too Many Requests) or 503 (Service Unavailable) responses.

python

from openai import OpenAI
import os

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Broken code: no retry handling
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello"}]
)
print(response.choices[0].message.content)

output

openai.error.RateLimitError: You have exceeded your current quota, please check your plan and billing details.

The fix

Wrap your API calls with retry logic using exponential backoff to handle transient errors like rate limits or timeouts. This prevents immediate failure and allows the request to succeed after a delay.

Example uses tenacity for retries and catches RateLimitError and TimeoutError. This approach ensures your app gracefully recovers from temporary downtime.

python

from openai import OpenAI
import os
from tenacity import retry, wait_exponential, stop_after_attempt, retry_if_exception_type
from openai import RateLimitError, Timeout

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

@retry(
    wait=wait_exponential(min=1, max=10),
    stop=stop_after_attempt(5),
    retry=retry_if_exception_type((RateLimitError, Timeout))
)
def call_llm(messages):
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=messages
    )
    return response.choices[0].message.content

try:
    result = call_llm([{"role": "user", "content": "Hello"}])
    print(result)
except Exception as e:
    print(f"LLM call failed after retries: {e}")

output

Hello
# or after retries if transient errors occur
# LLM call failed after retries: You have exceeded your current quota, please check your plan and billing details.

Preventing it in production

Implement retries with exponential backoff for transient API errors.
Use fallback strategies such as cached responses, simpler local models, or alternative LLM providers to maintain service continuity.
Monitor API error rates, latency, and usage quotas with alerting to detect downtime early.
Validate inputs and outputs to avoid unnecessary API calls that may trigger rate limits.
Design your system to degrade gracefully, informing users when AI features are temporarily unavailable.

Related errors

Error	Cause	Quick fix
RateLimitError	Too many requests in short time	Add exponential backoff retry logic
TimeoutError	Network or server timeout	Retry with backoff and increase timeout settings
ServiceUnavailable	API service outage	Use fallback model or cached response
AuthenticationError	Invalid API key or expired token	Verify and refresh API credentials

✅

Key Takeaways

Use exponential backoff retries to handle transient LLM API errors automatically.
Implement fallback mechanisms like caching or alternative models to maintain uptime.
Monitor API usage and errors to detect and respond to downtime quickly.

Verified 2026-04 · gpt-4o

Verify ↗