Rate limit errors: quota exceeded
Why this matters
Rate limit errors are the most common production failure in API integrations. Without proper handling, your application crashes instead of gracefully retrying. Understanding quota structure prevents surprise billing and service interruptions.
Explanation
What happens: The Gemini API allows a maximum number of requests per minute depending on your billing tier (free tier: 60 requests/minute, paid: 360+ requests/minute). When you exceed this quota, the API returns HTTP 429 (Too Many Requests) with a Retry-After header indicating how long to wait before retrying. How it works: The Google Generative AI SDK does not automatically retry on 429 errors: it raises a google.api_core.exceptions.ResourceExhausted exception. Your code must catch this and implement exponential backoff (wait 1 second, then 2, then 4, etc.) before retrying. The Retry-After header tells you the exact minimum wait time. When to use it: Always implement this pattern in production. For batch processing, use delays between requests or queue your requests to respect the rate limit upfront.
Request code
import google.generativeai as genai
import os
import time
from google.api_core.exceptions import ResourceExhausted
genai.configure(api_key=os.environ['GOOGLE_API_KEY'])
model = genai.GenerativeModel('gemini-2.0-flash')
def call_with_backoff(prompt, max_retries=3):
for attempt in range(max_retries):
try:
response = model.generate_content(prompt)
return response.text
except ResourceExhausted as e:
if attempt == max_retries - 1:
raise
wait_time = 2 ** attempt
print(f"Rate limited. Waiting {wait_time}s before retry {attempt + 1}/{max_retries}")
time.sleep(wait_time)
result = call_with_backoff("Explain quantum computing in 100 words")
print(result) Authentication
Rate limits are enforced per API key. Set your Google API key as an environment variable: export GOOGLE_API_KEY="your-api-key-here". The SDK reads this automatically when you call genai.configure(api_key=os.environ['GOOGLE_API_KEY']). Free tier keys have lower limits than paid keys: verify your billing setup if you hit limits immediately.
Response shape
| Field | Description |
|---|---|
text | The generated text response from the model |
usage_metadata | [object Object] |
finish_reason | STOP (completed normally) or MAX_TOKENS (hit length limit) |
Field guide
text The actual generated content: this is what you display or process
usage_metadata.total_token_count Critical for cost tracking; multiply by $0.075 per million tokens (Gemini 2.0 Flash) to estimate billing
Setup trap
The exponential backoff loop looks simple but many developers hardcode a fixed 2-second sleep instead of checking the Retry-After header. If Retry-After says wait 5 seconds, sleeping 2 seconds will fail immediately again. Use the header if present: retry_after = int(e.response.headers.get('Retry-After', 2 ** attempt)).
Cost
On the free tier, 60 requests/minute = 3,600 requests/hour. With Gemini 2.0 Flash at ~750 tokens per request, that's roughly 2.7M tokens/hour. Exceeding quota doesn't cost extra, but hitting the limit stops your requests: you cannot buy your way out of rate limits. You need to upgrade your plan or distribute load across multiple API keys (against terms of service; do not do this).
Rate limits
Rate limits are the #1 cause of production failures with Gemini API. The free tier is especially tight at 60 req/min. If you hit limits within seconds of starting, your quota tier is too low. Check your billing page at console.cloud.google.com and consider upgrading to a paid plan.
Common gotcha
Developers catch Exception broadly and lose the original error. If you catch Exception, you mask the 429 error and retry on unrelated failures (like invalid input), wasting time and quota. Always catch google.api_core.exceptions.ResourceExhausted specifically.
Error recovery
google.api_core.exceptions.ResourceExhaustedgoogle.auth.exceptions.DefaultCredentialsErrorValueErrorExperienced dev note
Senior teams implement a request queue with built-in rate limiting instead of relying on catch-retry. Use a library like tenacity or backoff to declaratively specify retry strategy. Also: monitor your usage_metadata.total_token_count across requests: you can predict quota exhaustion before it happens by tracking tokens/second and comparing to your tier limit. This saves money and prevents production surprises.
Check your understanding
If your code hits a 429 error and you retry immediately (after 1ms), why will it fail again?
Show answer hint
The rate limit is enforced per minute, not per second. Retrying immediately sends another request into the same minute window, which is still over quota. You must wait until the next minute boundary or respect the Retry-After header duration.
google.api_core.exceptions.ResourceExhausted for rate limits. Older 0.7.x versions used different exception names. Always pin your SDK version: pip install google-generativeai==0.8.1.