API Advanced hard · 8 min

Per-project budget alerts

What you will learn

Monitor token usage and costs across multiple Gemini API projects by querying quota metrics and setting up alerts before bills spike.

Why this matters

Large teams running multiple Gemini projects often discover runaway costs after the fact. Quota APIs let you detect overspending in real-time and implement guardrails before hitting production budgets.

Skip if: If you have only one small project, flat-rate pricing, or your organization doesn't enforce per-project cost tracking, manual quota checking is overkill. Use this when you have 3+ projects, variable traffic, or strict cost accountability.

Explanation

What it does: Google's Generative AI API exposes quota and usage metrics via the google.ai.generativelanguage_v1beta.QuotaClient. This lets you query token consumption, request counts, and remaining quota for each project without waiting for billing reports.

How it works: The Quota API returns real-time metrics from your project's service account credentials. Each metric is keyed by resource name (project ID) and includes usage windows (hourly, daily, monthly depending on quota type). You query this before or after generating content, storing results in a database or logging system to detect trends and trigger alerts when usage crosses thresholds.

When to use it: Implement quota monitoring in production when you need sub-hourly cost visibility, when you're testing different models or prompt strategies, or when you want to auto-scale request rates based on remaining budget. Wire alerts into Slack, PagerDuty, or your monitoring stack.

Request code

python

import os
import json
from google.api_core.gapic_v1 import client_info as grpc_client_info
from google.cloud import service_usage_v1
from google.type import money_pb2
import google.generativeai as genai

project_id = os.environ.get('GOOGLE_CLOUD_PROJECT')
api_key = os.environ.get('GOOGLE_API_KEY')

genai.configure(api_key=api_key)

model = genai.GenerativeModel('gemini-2.0-flash')

response = model.generate_content('What is machine learning in one sentence?')

usage = response.usage_metadata
print(f'Prompt tokens: {usage.prompt_token_count}')
print(f'Output tokens: {usage.completion_token_count}')
print(f'Total tokens: {usage.prompt_token_count + usage.completion_token_count}')

quota_client = service_usage_v1.ServiceUsageClient()
quota_name = f'projects/{project_id}/services/generativelanguage.googleapis.com'

try:
    service_response = quota_client.get_service(name=quota_name)
    print(f'Service status: {service_response.service_config.name}')
except Exception as e:
    print(f'Error querying quota: {e}')

cost_per_input_million = 0.00075
cost_per_output_million = 0.003

estimated_input_cost = (usage.prompt_token_count / 1_000_000) * cost_per_input_million
estimated_output_cost = (usage.completion_token_count / 1_000_000) * cost_per_output_million
total_estimated_cost = estimated_input_cost + estimated_output_cost

print(f'Estimated cost for this request: ${total_estimated_cost:.6f}')

Authentication

This requires a Google Cloud service account with the generativelanguage.googleapis.com API enabled and the serviceusage.quotas.get IAM permission. Enable the API: gcloud services enable generativelanguage.googleapis.com. Authenticate using GOOGLE_APPLICATION_CREDENTIALS environment variable pointing to your service account JSON file.

Response shape

Field	Description
`usage_metadata.prompt_token_count`	Integer: input tokens consumed
`usage_metadata.completion_token_count`	Integer: output tokens generated
`usage_metadata.cached_content_input_token_count`	Integer: tokens from cache (optional, only if semantic caching enabled)
`service_response.service_config.name`	String: GCP service resource name
`service_response.service_config.apis`	Array: enabled APIs for the service

Field guide

cached_content_input_token_count

Often zero, but critical to monitor: cache hits cost 90% less than fresh tokens. If this stays zero, your caching strategy isn't working.

prompt_token_count

Usually 5-10% of total cost; watch for runaway system prompt sizes in batch operations.

completion_token_count

Often 10-15x the prompt cost; this is where budget disasters happen with streaming or long-form outputs.

Setup trap

Enabling the Generative Language API in Google Cloud Console is separate from configuring the Python SDK. You must run gcloud services enable generativelanguage.googleapis.com in your project, then ensure your service account has generativelanguage.admin role. Skipping the API enablement returns a 403 'Service not enabled' error that's cryptic when you're just trying to check usage.

Cost

As of April 2026, Gemini 2.0 Flash input tokens cost $0.00075 per million, output tokens cost $0.003 per million. A single 1000-token input with 500-token output costs roughly $0.0021. At scale, 10 million daily requests with 1500 avg tokens costs ~$22.50/day or ~$675/month. Per-project monitoring catches the moment a runaway agent doubles this overnight.

Rate limits

Quota queries themselves have a 60 request/minute limit per project. If you poll quota every second in a monitoring loop across 100 projects, you'll hit limits. Implement exponential backoff and batch quota checks to every 30-60 seconds, or use Cloud Monitoring dashboards instead of synchronous polling.

Common gotcha

The usage_metadata object is populated ONLY after generate_content() completes: it's not available during streaming. If you're streaming responses, you must track tokens separately or sum the usage from each chunk. Many teams miss this and think their streaming calls aren't being metered.

Error recovery

PermissionDenied: generativelanguage.googleapis.com

Your service account lacks the 'generativelanguage.serviceAgent' role. Grant it via: gcloud projects add-iam-policy-binding PROJECT_ID --member=serviceAccount:ACCOUNT_EMAIL --role=roles/generativelanguage.serviceAgent

ServiceNotEnabledError

The API is disabled in your GCP project. Run: gcloud services enable generativelanguage.googleapis.com --project=PROJECT_ID

usage_metadata is None

The response was malformed or the request failed silently. Check that the model name is correct (gemini-2.0-flash, not gemini-pro) and that the API key is valid.

QuotaExceeded

You've hit your project's quota for that model. Implement exponential backoff or request quota increase via GCP console.

Experienced dev note

Don't poll quota in your request path: that adds latency and API calls. Instead, log usage metrics to BigQuery after every request, then query BigQuery for trends and alert conditions. This decouples monitoring from production request latency. Also, cache_token_count is your best profit margin: prioritize semantic caching for repeated queries; a 90% discount on input tokens stacks fast at scale.

Check your understanding

If your team's daily traffic doubles overnight and your per-request cost halts at the cache limit, what are two possible explanations, and how would you distinguish between them using the quota API?

Show answer hint

One possibility: semantic cache hit rate increased (cached_content_input_token_count rises, total cost doesn't double). Other: requests shifted to a cheaper model or caching logic regressed. Check both the cache token metric and the model distribution in your logs to confirm which one.

VERSION google-generativeai 0.8.x uses LCEL patterns and includes usage_metadata in responses by default. Older versions (0.3-0.5) did not expose this reliably. Service account quota querying requires google-cloud-service-usage >= 1.14.0. Ensure dependencies: pip install google-generativeai>=0.8.0 google-cloud-service-usage>=1.14.0

Community Notes

No notes yetBe the first to share a version-specific fix or tip.