API Beginner easy · 5 min

Rate limit errors: quota exceeded

What you will learn

The Gemini API enforces rate limits on requests per minute; exceeding them returns a 429 error that you must handle with exponential backoff.

Why this matters

Rate limit errors are the most common production failure in API integrations. Without proper handling, your application crashes instead of gracefully retrying. Understanding quota structure prevents surprise billing and service interruptions.

Skip if: If you're building a local-only prototype or testing with a single request, rate limit handling is unnecessary. However, the moment you move to production or batch processing, it becomes critical. Do not skip this for 'later': it will bite you.

Explanation

What happens: The Gemini API allows a maximum number of requests per minute depending on your billing tier (free tier: 60 requests/minute, paid: 360+ requests/minute). When you exceed this quota, the API returns HTTP 429 (Too Many Requests) with a Retry-After header indicating how long to wait before retrying. How it works: The Google Generative AI SDK does not automatically retry on 429 errors: it raises a google.api_core.exceptions.ResourceExhausted exception. Your code must catch this and implement exponential backoff (wait 1 second, then 2, then 4, etc.) before retrying. The Retry-After header tells you the exact minimum wait time. When to use it: Always implement this pattern in production. For batch processing, use delays between requests or queue your requests to respect the rate limit upfront.

Request code

python

import google.generativeai as genai
import os
import time
from google.api_core.exceptions import ResourceExhausted

genai.configure(api_key=os.environ['GOOGLE_API_KEY'])
model = genai.GenerativeModel('gemini-2.0-flash')

def call_with_backoff(prompt, max_retries=3):
    for attempt in range(max_retries):
        try:
            response = model.generate_content(prompt)
            return response.text
        except ResourceExhausted as e:
            if attempt == max_retries - 1:
                raise
            wait_time = 2 ** attempt
            print(f"Rate limited. Waiting {wait_time}s before retry {attempt + 1}/{max_retries}")
            time.sleep(wait_time)

result = call_with_backoff("Explain quantum computing in 100 words")
print(result)

Authentication

Rate limits are enforced per API key. Set your Google API key as an environment variable: export GOOGLE_API_KEY="your-api-key-here". The SDK reads this automatically when you call genai.configure(api_key=os.environ['GOOGLE_API_KEY']). Free tier keys have lower limits than paid keys: verify your billing setup if you hit limits immediately.

Response shape

Field	Description
`text`	The generated text response from the model
`usage_metadata`	[object Object]
`finish_reason`	STOP (completed normally) or MAX_TOKENS (hit length limit)

Field guide

text

The actual generated content: this is what you display or process

usage_metadata.total_token_count

Critical for cost tracking; multiply by $0.075 per million tokens (Gemini 2.0 Flash) to estimate billing

Setup trap

The exponential backoff loop looks simple but many developers hardcode a fixed 2-second sleep instead of checking the Retry-After header. If Retry-After says wait 5 seconds, sleeping 2 seconds will fail immediately again. Use the header if present: retry_after = int(e.response.headers.get('Retry-After', 2 ** attempt)).

Cost

On the free tier, 60 requests/minute = 3,600 requests/hour. With Gemini 2.0 Flash at ~750 tokens per request, that's roughly 2.7M tokens/hour. Exceeding quota doesn't cost extra, but hitting the limit stops your requests: you cannot buy your way out of rate limits. You need to upgrade your plan or distribute load across multiple API keys (against terms of service; do not do this).

Rate limits

Rate limits are the #1 cause of production failures with Gemini API. The free tier is especially tight at 60 req/min. If you hit limits within seconds of starting, your quota tier is too low. Check your billing page at console.cloud.google.com and consider upgrading to a paid plan.

Common gotcha

Developers catch Exception broadly and lose the original error. If you catch Exception, you mask the 429 error and retry on unrelated failures (like invalid input), wasting time and quota. Always catch google.api_core.exceptions.ResourceExhausted specifically.

Error recovery

google.api_core.exceptions.ResourceExhausted

Returned when you exceed rate limit. Implement exponential backoff with jitter: wait_time = (2 ** attempt) + random.uniform(0, 1). Always check Retry-After header first.

google.auth.exceptions.DefaultCredentialsError

API key not set or invalid. Ensure GOOGLE_API_KEY environment variable is exported and contains a valid key from console.cloud.google.com.

ValueError

Raised for invalid input (e.g., empty prompt). This is NOT rate limiting: do not retry. Fix the input and re-run.

Experienced dev note

Senior teams implement a request queue with built-in rate limiting instead of relying on catch-retry. Use a library like tenacity or backoff to declaratively specify retry strategy. Also: monitor your usage_metadata.total_token_count across requests: you can predict quota exhaustion before it happens by tracking tokens/second and comparing to your tier limit. This saves money and prevents production surprises.

Check your understanding

If your code hits a 429 error and you retry immediately (after 1ms), why will it fail again?

Show answer hint

The rate limit is enforced per minute, not per second. Retrying immediately sends another request into the same minute window, which is still over quota. You must wait until the next minute boundary or respect the Retry-After header duration.

VERSION google-generativeai 0.8.x uses google.api_core.exceptions.ResourceExhausted for rate limits. Older 0.7.x versions used different exception names. Always pin your SDK version: pip install google-generativeai==0.8.1.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.