Debug Fix intermediate · 3 min read

Fix Cerebras rate limit error

Quick answer
A RateLimitError from the Cerebras API occurs when you exceed the allowed request rate. Add exponential backoff retry logic around your API calls using the OpenAI SDK to handle RateLimitError automatically and avoid failures.
ERROR TYPE api_error
⚡ QUICK FIX
Add exponential backoff retry logic around your API call to handle RateLimitError automatically.

Why this happens

The RateLimitError occurs when your application sends requests to the Cerebras API faster than the allowed rate limit. This can happen during bursts of traffic or rapid loops without delay. The error message typically looks like:

openai.error.RateLimitError: You have exceeded your current quota, please check your plan and billing details.

Example of problematic code without retry logic:

python
from openai import OpenAI
import os

client = OpenAI(api_key=os.environ["CEREBRAS_API_KEY"], base_url="https://api.cerebras.ai/v1")

response = client.chat.completions.create(
    model="llama3.3-70b",
    messages=[{"role": "user", "content": "Hello"}]
)
print(response.choices[0].message.content)
output
openai.error.RateLimitError: You have exceeded your current quota, please check your plan and billing details.

The fix

Wrap your Cerebras API calls with exponential backoff retry logic to handle RateLimitError. This retries the request after increasing delays, preventing immediate failure and respecting rate limits.

Example fixed code using time.sleep and catching RateLimitError:

python
from openai import OpenAI
import os
import time
from openai import RateLimitError

client = OpenAI(api_key=os.environ["CEREBRAS_API_KEY"], base_url="https://api.cerebras.ai/v1")

max_retries = 5
retry_delay = 1  # initial delay in seconds

for attempt in range(max_retries):
    try:
        response = client.chat.completions.create(
            model="llama3.3-70b",
            messages=[{"role": "user", "content": "Hello"}]
        )
        print(response.choices[0].message.content)
        break  # success, exit loop
    except RateLimitError:
        if attempt == max_retries - 1:
            raise  # re-raise after max retries
        time.sleep(retry_delay)
        retry_delay *= 2  # exponential backoff
output
Hello! How can I assist you today?

Preventing it in production

  • Implement robust retry logic with exponential backoff and jitter to avoid synchronized retries.
  • Monitor your API usage and rate limits to adjust request frequency proactively.
  • Use circuit breakers or fallback mechanisms to degrade gracefully when limits are hit.
  • Cache frequent responses to reduce unnecessary API calls.

Key Takeaways

  • Use exponential backoff retry logic to handle Cerebras RateLimitError gracefully.
  • Always get your API key from environment variables to avoid authentication issues.
  • Monitor and limit request rates proactively to prevent hitting rate limits in production.
Verified 2026-04 · llama3.3-70b
Verify ↗