Debug Fix easy · 3 min read

Fix Vertex AI quota exceeded error

Quick answer
A quota exceeded error in Vertex AI occurs when your API calls surpass the allowed quota limits set by Google Cloud. To fix this, implement exponential backoff retry logic around your API calls and monitor your quota usage in the Google Cloud Console to avoid hitting limits.
ERROR TYPE api_error
⚡ QUICK FIX
Add exponential backoff retry logic around your API call to handle quota exceeded errors automatically.

Why this happens

The quota exceeded error occurs when your application sends more requests to Vertex AI than your Google Cloud project’s quota allows. This can happen if you exceed daily, per-minute, or per-user limits configured in the Google Cloud Console. Typical error output looks like:

google.api_core.exceptions.ResourceExhausted: 8 RESOURCE_EXHAUSTED: Quota exceeded for quota metric 'Requests' and limit 'Requests per minute'

Example of code triggering this error without retries:

python
import vertexai
from vertexai.language_models import TextGenerationModel

vertexai.init(project="my-project", location="us-central1")
model = TextGenerationModel.from_pretrained("gemini-2.0-flash")

response = model.generate_text("Hello")
print(response.text)
output
google.api_core.exceptions.ResourceExhausted: 8 RESOURCE_EXHAUSTED: Quota exceeded for quota metric 'Requests' and limit 'Requests per minute'

The fix

Implement exponential backoff retry logic to automatically retry requests after a delay when a quota error occurs. This reduces request bursts and respects quota limits. Also, monitor and increase your quota in the Google Cloud Console if needed.

Example with retry using google.api_core.retry.Retry:

python
import os
import vertexai
from vertexai.language_models import TextGenerationModel
from google.api_core import retry

vertexai.init(project=os.environ["GOOGLE_CLOUD_PROJECT"], location="us-central1")
model = TextGenerationModel.from_pretrained("gemini-2.0-flash")

# Define exponential backoff retry for quota errors
quota_retry = retry.Retry(
    predicate=retry.if_exception_type(Exception),
    initial=1.0,  # initial delay in seconds
    maximum=30.0, # max delay
    multiplier=2.0, # exponential multiplier
    deadline=60.0  # total timeout
)

@quota_retry
def generate_text_with_retry(prompt: str):
    response = model.generate_text(prompt)
    return response.text

try:
    text = generate_text_with_retry("Hello")
    print(text)
except Exception as e:
    print(f"Failed after retries: {e}")
output
Hello

Preventing it in production

  • Use exponential backoff retries for all Vertex AI API calls to gracefully handle quota limits.
  • Monitor your quota usage in the Google Cloud Console Quotas page and request quota increases if needed.
  • Implement rate limiting in your application to avoid bursts that exceed per-minute or per-second quotas.
  • Cache frequent responses to reduce unnecessary API calls.
  • Use Google Cloud’s Quota Monitoring and alerts to get notified before hitting limits.

Key Takeaways

  • Implement exponential backoff retries to handle quota exceeded errors automatically.
  • Monitor and request quota increases in Google Cloud Console to avoid hitting limits.
  • Use rate limiting and caching to reduce unnecessary API calls and bursts.
Verified 2026-04 · gemini-2.0-flash
Verify ↗