High severity intermediate · Fix: 5-10 min

LoadBalancingCooldownError

litellm.errors.LoadBalancingCooldownError

What this error means

LiteLLM raises LoadBalancingCooldownError when requests exceed the cooldown period enforced by its load balancer to prevent overload.

Stack trace

traceback

litellm.errors.LoadBalancingCooldownError: Request rejected due to load balancing cooldown period. Please retry after cooldown expires.

QUICK FIX

Add client-side retry with exponential backoff on LoadBalancingCooldownError to automatically wait and retry after cooldown.

Why it happens

LiteLLM's internal load balancer enforces a cooldown period between requests to avoid overloading backend resources. When requests come in too quickly, this error is raised to throttle traffic and maintain stability.

Detection

Monitor for LoadBalancingCooldownError exceptions in your request handling code and log the timestamps to detect frequent cooldown triggers before user impact.

Causes & fixes

Sending requests to LiteLLM too rapidly without respecting cooldown intervals

✓ Fix

Implement client-side rate limiting or exponential backoff to space out requests and respect the cooldown period.

Multiple concurrent requests exceeding LiteLLM's load balancing capacity

✓ Fix

Serialize or queue requests on the client side to reduce concurrency and avoid triggering cooldown.

Using an outdated LiteLLM client version that mishandles cooldown signals

✓ Fix

Upgrade to the latest LiteLLM client version which properly handles cooldown errors and retries.

Code: broken vs fixed

Broken - triggers the error

python

from litellm import LiteLLM

client = LiteLLM(api_key='mykey')
response = client.generate('Hello world')  # triggers LoadBalancingCooldownError

Fixed - works correctly

python

import os
from litellm import LiteLLM, LoadBalancingCooldownError
import time

client = LiteLLM(api_key=os.environ['LITELLM_API_KEY'])

try:
    response = client.generate('Hello world')
except LoadBalancingCooldownError:
    time.sleep(2)  # wait for cooldown
    response = client.generate('Hello world')

print(response)

Added exception handling for LoadBalancingCooldownError with a sleep retry to respect the cooldown period and avoid immediate failure.

⚠

Workaround

Catch LoadBalancingCooldownError and parse the error message to extract cooldown duration, then sleep that duration before retrying the request.

✓

Prevention

Implement client-side rate limiting and exponential backoff strategies to space out requests and avoid triggering LiteLLM's load balancing cooldown.

Python 3.9+ · litellm >=0.1.0 · tested on 0.2.0

Verified 2026-04

Verify ↗

Community Notes

No notes yetBe the first to share a version-specific fix or tip.