API Advanced medium · 5 min

Quota increase requests

What you will learn

Request higher rate limits and quota tiers for Gemini API production deployments through the Google Cloud Console.

Why this matters

Free-tier quotas (60 requests per minute for most models) will throttle production applications; understanding the quota system and increase process prevents runtime failures and unexpected latency spikes in production.

Skip if: If you're prototyping locally or your application traffic never exceeds 60 requests/minute, quota increases aren't necessary. Use batch processing or implement client-side rate limiting instead of requesting higher quotas for spike handling.

Explanation

What it does: The Gemini API enforces rate limits and daily quotas by default. Free tier gets 60 requests per minute (RPM) for most models. To exceed these limits, you request a quota increase through the Google Cloud Console: this is not an API call, but a form submission in your GCP project.

How it works: Google's quota system tracks requests per project, per model, per minute/day. When you hit the limit, google.generativeai raises a ResourceExhausted error. Quota increases are evaluated based on your billing account status, project history, and requested tier. Approved increases take effect immediately or within minutes; you do not need to restart your application.

When to use it: Request quota increases before production launch if your expected traffic exceeds 60 RPM, or after monitoring reveals you're consistently hitting limits. Premium tier and higher request limits require a paid Google Cloud billing account.

Request code

python

import google.generativeai as genai
import os
import time

genai.configure(api_key=os.environ['GOOGLE_API_KEY'])
model = genai.GenerativeModel('gemini-2.0-flash')

def send_request_with_retry(prompt, max_retries=3):
    for attempt in range(max_retries):
        try:
            response = model.generate_content(prompt)
            return response.text
        except Exception as e:
            if 'ResourceExhausted' in str(type(e).__name__):
                wait_time = (2 ** attempt) + 1
                print(f'Rate limited. Waiting {wait_time}s before retry {attempt + 1}/{max_retries}')
                time.sleep(wait_time)
            else:
                raise
    raise RuntimeError('Max retries exceeded due to rate limit')

result = send_request_with_retry('What is machine learning?')
print(result)

Authentication

Access quotas through Google Cloud Console: 1) Navigate to your GCP project. 2) Go to APIs & Services → Quotas. 3) Search for 'Generative AI'. 4) Select the model (gemini-2.0-flash, gemini-2.5-pro, etc.). 5) Click 'Edit Quotas' and specify your requested RPM or daily limit. 6) Submit with a use case description. Approval typically takes hours to days for legitimate production use.

Response shape

Field	Description
`text`	Generated content string
`error_message`	Raised as exception, not in response (ResourceExhausted if quota hit)

Field guide

text

The actual generated response: if you receive this, the request succeeded before hitting quota

ResourceExhausted_exception

Raised when current request exceeds your approved quota tier; inspect error message for current usage metrics

Setup trap

A common mistake is submitting a quota increase request without enabling a paid Google Cloud billing account. The request will be silently rejected or capped at free-tier limits. Enable billing (via Billing → Create Billing Account) before submitting a request for production-tier quotas (>100 RPM).

Cost

Quota increases themselves are free, but higher-tier quotas imply higher usage, which incurs API charges. Gemini 2.0 Flash costs approximately $0.075 per 1M input tokens and $0.30 per 1M output tokens (April 2026). A 600 RPM increase could cost $200-500/month depending on token consumption. Monitor your usage in the Cloud Console Billing section.

Rate limits

Rate limits apply per minute and per day. The per-minute limit (RPM) is most commonly hit in production. If your application needs 200+ concurrent requests per second, you're hitting an unpublicized cap even with quota increases: contact Google Cloud support for specialized tier access.

Common gotcha

Developers often assume quota increases are automatic based on billing account age: they are not. You must explicitly request increases through the Console. Free-tier accounts cannot request increases above 60 RPM for Gemini 2.0 Flash; you must enable paid billing first (even if no charges accrue).

Error recovery

google.api_core.exceptions.ResourceExhausted

Your current request exceeded approved quota. Use exponential backoff retry (2^attempt seconds) or request a quota increase. Check Cloud Console Quotas page to see current usage vs. approved limit.

google.api_core.exceptions.PermissionDenied

Your API key or GCP project lacks Generative AI API permission. Enable the API: APIs & Services → Library → search 'Generative AI' → Enable. This is different from quota and must be done before requesting increases.

Request limit (60 RPM) silently enforced

Your quota increase request was not approved or billing account is not active. Check Quotas console: if still showing 60 RPM, resubmit request and verify billing account status in the Billing tab.

Experienced dev note

Quota limits are per project, not per API key: spinning up new GCP projects to bypass limits will fail. Google tracks usage at the billing account level. Instead, use Gemini's batch processing API if available for your use case, or implement client-side token bucket rate limiting to smooth out traffic spikes without requesting higher quotas. Most production outages from quota hits happen because teams didn't monitor usage trends; set up a Cloud Monitoring alert when usage exceeds 70% of approved quota.

Check your understanding

Your production service is hitting the 60 RPM limit at 8am daily. You submit a quota increase request to 300 RPM and it's approved at 9am. At 7:55am the next day, your service still fails. What's the most likely reason, and how would you verify it?

Show answer hint

Quota increases apply to new requests, but the service may be caching the old limit or the billing account status may have changed. Verify: 1) Cloud Console Quotas page shows 300 RPM approved, 2) Billing account is active (not suspended), 3) Restart the service or restart the Python process to clear any cached limit state. The limit is enforced server-side, so a restart isn't strictly needed, but stale client connections sometimes cause confusion.

VERSION google-generativeai 0.8.x uses the same quota system as 0.7.x. Quota limits are enforced at the Google Cloud API Gateway level, not in the SDK: upgrading the library will not change your quota. The ResourceExhausted exception structure and error messages are stable as of April 2026.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.