QuotaExceeded
google.api_core.exceptions.ResourceExhausted: QuotaExceeded
Stack trace
google.api_core.exceptions.ResourceExhausted: 429 Quota exceeded for quota metric 'Vertex AI Gemini model requests' and limit 'Requests per minute' of service 'aiplatform.googleapis.com' for consumer 'projects/your-project'.
Why it happens
Google Cloud enforces quota limits on Vertex AI Gemini model usage to prevent abuse and ensure fair resource distribution. When your project exceeds these limits, the API returns a QuotaExceeded error indicating you must reduce request rate or request higher quota.
Detection
Monitor API response codes for ResourceExhausted exceptions and track request counts against your project's quota dashboard in Google Cloud Console to detect approaching limits before failures occur.
Causes & fixes
Too many requests sent to the Gemini model in a short time exceeding the per-minute quota.
Implement request rate limiting or exponential backoff retries in your client to stay within quota limits.
Your Google Cloud project has a low default quota for Gemini model usage.
Request a quota increase via the Google Cloud Console Quotas page for the aiplatform.googleapis.com service.
Multiple services or users in your project collectively exceed the quota.
Coordinate usage across teams or services and distribute requests to avoid bursts that exceed quota.
Code: broken vs fixed
from google.cloud import aiplatform
client = aiplatform.gapic.PredictionServiceClient()
response = client.predict(endpoint='projects/your-project/locations/us-central1/endpoints/123456789', instances=[{'input': 'test'}]) # This may raise QuotaExceeded error import os
import time
from google.cloud import aiplatform
from google.api_core.exceptions import ResourceExhausted
client = aiplatform.gapic.PredictionServiceClient()
max_retries = 5
for attempt in range(max_retries):
try:
response = client.predict(endpoint=os.environ['VERTEX_ENDPOINT'], instances=[{'input': 'test'}])
print(response)
break
except ResourceExhausted as e:
if attempt < max_retries - 1:
wait_time = 2 ** attempt
print(f"Quota exceeded, retrying in {wait_time} seconds...")
time.sleep(wait_time)
else:
raise Workaround
Catch ResourceExhausted exceptions and implement client-side rate limiting or delay retries to reduce request frequency temporarily.
Prevention
Architect your system to monitor quota usage proactively and implement exponential backoff with jitter on retries; request quota increases early if usage grows.