Quota increase requests
Why this matters
Free-tier quotas (60 requests per minute for most models) will throttle production applications; understanding the quota system and increase process prevents runtime failures and unexpected latency spikes in production.
Explanation
What it does: The Gemini API enforces rate limits and daily quotas by default. Free tier gets 60 requests per minute (RPM) for most models. To exceed these limits, you request a quota increase through the Google Cloud Console: this is not an API call, but a form submission in your GCP project.
How it works: Google's quota system tracks requests per project, per model, per minute/day. When you hit the limit, google.generativeai raises a ResourceExhausted error. Quota increases are evaluated based on your billing account status, project history, and requested tier. Approved increases take effect immediately or within minutes; you do not need to restart your application.
When to use it: Request quota increases before production launch if your expected traffic exceeds 60 RPM, or after monitoring reveals you're consistently hitting limits. Premium tier and higher request limits require a paid Google Cloud billing account.
Request code
import google.generativeai as genai
import os
import time
genai.configure(api_key=os.environ['GOOGLE_API_KEY'])
model = genai.GenerativeModel('gemini-2.0-flash')
def send_request_with_retry(prompt, max_retries=3):
for attempt in range(max_retries):
try:
response = model.generate_content(prompt)
return response.text
except Exception as e:
if 'ResourceExhausted' in str(type(e).__name__):
wait_time = (2 ** attempt) + 1
print(f'Rate limited. Waiting {wait_time}s before retry {attempt + 1}/{max_retries}')
time.sleep(wait_time)
else:
raise
raise RuntimeError('Max retries exceeded due to rate limit')
result = send_request_with_retry('What is machine learning?')
print(result) Authentication
Access quotas through Google Cloud Console: 1) Navigate to your GCP project. 2) Go to APIs & Services → Quotas. 3) Search for 'Generative AI'. 4) Select the model (gemini-2.0-flash, gemini-2.5-pro, etc.). 5) Click 'Edit Quotas' and specify your requested RPM or daily limit. 6) Submit with a use case description. Approval typically takes hours to days for legitimate production use.
Response shape
| Field | Description |
|---|---|
text | Generated content string |
error_message | Raised as exception, not in response (ResourceExhausted if quota hit) |
Field guide
text The actual generated response: if you receive this, the request succeeded before hitting quota
ResourceExhausted_exception Raised when current request exceeds your approved quota tier; inspect error message for current usage metrics
Setup trap
A common mistake is submitting a quota increase request without enabling a paid Google Cloud billing account. The request will be silently rejected or capped at free-tier limits. Enable billing (via Billing → Create Billing Account) before submitting a request for production-tier quotas (>100 RPM).
Cost
Quota increases themselves are free, but higher-tier quotas imply higher usage, which incurs API charges. Gemini 2.0 Flash costs approximately $0.075 per 1M input tokens and $0.30 per 1M output tokens (April 2026). A 600 RPM increase could cost $200-500/month depending on token consumption. Monitor your usage in the Cloud Console Billing section.
Rate limits
Rate limits apply per minute and per day. The per-minute limit (RPM) is most commonly hit in production. If your application needs 200+ concurrent requests per second, you're hitting an unpublicized cap even with quota increases: contact Google Cloud support for specialized tier access.
Common gotcha
Developers often assume quota increases are automatic based on billing account age: they are not. You must explicitly request increases through the Console. Free-tier accounts cannot request increases above 60 RPM for Gemini 2.0 Flash; you must enable paid billing first (even if no charges accrue).
Error recovery
google.api_core.exceptions.ResourceExhaustedgoogle.api_core.exceptions.PermissionDeniedRequest limit (60 RPM) silently enforcedExperienced dev note
Quota limits are per project, not per API key: spinning up new GCP projects to bypass limits will fail. Google tracks usage at the billing account level. Instead, use Gemini's batch processing API if available for your use case, or implement client-side token bucket rate limiting to smooth out traffic spikes without requesting higher quotas. Most production outages from quota hits happen because teams didn't monitor usage trends; set up a Cloud Monitoring alert when usage exceeds 70% of approved quota.
Check your understanding
Your production service is hitting the 60 RPM limit at 8am daily. You submit a quota increase request to 300 RPM and it's approved at 9am. At 7:55am the next day, your service still fails. What's the most likely reason, and how would you verify it?
Show answer hint
Quota increases apply to new requests, but the service may be caching the old limit or the billing account status may have changed. Verify: 1) Cloud Console Quotas page shows 300 RPM approved, 2) Billing account is active (not suspended), 3) Restart the service or restart the Python process to clear any cached limit state. The limit is enforced server-side, so a restart isn't strictly needed, but stale client connections sometimes cause confusion.