Fix Cerebras rate limit error
Quick answer
A
RateLimitError from the Cerebras API occurs when you exceed the allowed request rate. Add exponential backoff retry logic around your API calls using the OpenAI SDK to handle RateLimitError automatically and avoid failures. ERROR TYPE
api_error ⚡ QUICK FIX
Add exponential backoff retry logic around your API call to handle
RateLimitError automatically.Why this happens
The RateLimitError occurs when your application sends requests to the Cerebras API faster than the allowed rate limit. This can happen during bursts of traffic or rapid loops without delay. The error message typically looks like:
openai.error.RateLimitError: You have exceeded your current quota, please check your plan and billing details.Example of problematic code without retry logic:
from openai import OpenAI
import os
client = OpenAI(api_key=os.environ["CEREBRAS_API_KEY"], base_url="https://api.cerebras.ai/v1")
response = client.chat.completions.create(
model="llama3.3-70b",
messages=[{"role": "user", "content": "Hello"}]
)
print(response.choices[0].message.content) output
openai.error.RateLimitError: You have exceeded your current quota, please check your plan and billing details.
The fix
Wrap your Cerebras API calls with exponential backoff retry logic to handle RateLimitError. This retries the request after increasing delays, preventing immediate failure and respecting rate limits.
Example fixed code using time.sleep and catching RateLimitError:
from openai import OpenAI
import os
import time
from openai import RateLimitError
client = OpenAI(api_key=os.environ["CEREBRAS_API_KEY"], base_url="https://api.cerebras.ai/v1")
max_retries = 5
retry_delay = 1 # initial delay in seconds
for attempt in range(max_retries):
try:
response = client.chat.completions.create(
model="llama3.3-70b",
messages=[{"role": "user", "content": "Hello"}]
)
print(response.choices[0].message.content)
break # success, exit loop
except RateLimitError:
if attempt == max_retries - 1:
raise # re-raise after max retries
time.sleep(retry_delay)
retry_delay *= 2 # exponential backoff output
Hello! How can I assist you today?
Preventing it in production
- Implement robust retry logic with exponential backoff and jitter to avoid synchronized retries.
- Monitor your API usage and rate limits to adjust request frequency proactively.
- Use circuit breakers or fallback mechanisms to degrade gracefully when limits are hit.
- Cache frequent responses to reduce unnecessary API calls.
Key Takeaways
- Use exponential backoff retry logic to handle Cerebras
RateLimitErrorgracefully. - Always get your API key from environment variables to avoid authentication issues.
- Monitor and limit request rates proactively to prevent hitting rate limits in production.