RateLimitError
cerebras.RateLimitError (HTTP 429)
Stack trace
cerebras.RateLimitError: Error code: 429 - {'error': {'message': 'Rate limit exceeded', 'type': 'requests', 'code': 'rate_limit_exceeded'}} Why it happens
Cerebras enforces strict rate limits on API requests to prevent abuse and ensure fair usage. When your application sends more requests than allowed within a time window, the API returns a 429 RateLimitError to signal you must slow down.
Detection
Monitor API responses for HTTP 429 status codes or catch cerebras.RateLimitError exceptions to detect rate limit breaches before your app crashes.
Causes & fixes
Sending too many requests in a short period exceeding Cerebras API limits
Implement exponential backoff and retry logic with delays between requests to stay within rate limits.
Parallel or concurrent requests exceeding the allowed quota
Throttle concurrency by limiting the number of simultaneous API calls using a queue or semaphore.
Using an API key with low rate limits or shared usage causing quota exhaustion
Request a higher rate limit from Cerebras support or use dedicated API keys per service to distribute load.
Code: broken vs fixed
from cerebras import CerebrasClient
client = CerebrasClient(api_key='hardcoded_key')
response = client.call_model('model-name', input_data) # This line triggers RateLimitError
print(response) import os
from cerebras import CerebrasClient, RateLimitError
import time
client = CerebrasClient(api_key=os.environ['CEREBRAS_API_KEY'])
try:
response = client.call_model('model-name', input_data)
print(response)
except RateLimitError:
print('Rate limit exceeded, retrying after delay...')
time.sleep(10) # Wait before retrying
response = client.call_model('model-name', input_data)
print(response) # Fixed by using env var and retry with delay Workaround
Wrap API calls in try/except RateLimitError, catch the exception, wait a fixed delay (e.g., 10 seconds), then retry the request to temporarily handle rate limits.
Prevention
Implement client-side rate limiting with exponential backoff and concurrency control, and request higher quotas from Cerebras if needed to avoid hitting limits.