API Advanced medium · 6 min

Token counting before sending

What you will learn

Use <code>count_tokens()</code> to calculate input and output token usage before calling <code>generate_content()</code>, enabling cost prediction and context validation.

Why this matters

Token counting prevents overspend surprises in production by letting you validate prompt size and estimate costs before committing an API call. In real systems, you often need to know token count to decide whether to include a document chunk, truncate history, or reject a request that exceeds your context window.

Skip if: Skip token counting if you're prototyping with fixed, small prompts or if your cost variance is negligible. Don't use it as a substitute for implementing actual request timeouts or hard context limits: counting is metadata, not enforcement.

Explanation

What it does: The count_tokens() method returns the exact token consumption of a prompt before you send it. This includes input tokens (your prompt and any multimodal content) and optionally output tokens (estimated based on max_output_tokens). How it works: Gemini's tokenizer runs on the same server as the generation model, so the count is accurate to the actual request. The API returns both input token count and, if you pass generation config, an estimated output range. Unlike sampling-based approximations, this is deterministic. When to use it: Before sending multi-turn conversations, before including large documents or context, or before routing requests to different models based on cost. This is standard practice in production chatbot backends, RAG systems, and any system where token count drives business logic.

Request code

python

import google.generativeai as genai
import os

genai.configure(api_key=os.environ['GOOGLE_API_KEY'])
model = genai.GenerativeModel('gemini-2.0-flash')

prompt = """Analyze this customer feedback and extract sentiment, key issues, and recommendations.

Customer feedback:
The product is intuitive but the onboarding flow takes 15 minutes. 
We'd like a faster path for returning users. Support was helpful but slow (48hr response).
"""

response = model.count_tokens(prompt)
print(f"Input tokens: {response.total_tokens}")

from google.generativeai.types import GenerationConfig

config = GenerationConfig(max_output_tokens=500)
response_with_output = model.count_tokens(
    prompt,
    generation_config=config
)
print(f"Input tokens: {response_with_output.total_tokens}")
print(f"Estimated output tokens: {response_with_output.total_tokens - model.count_tokens(prompt).total_tokens}")

if response.total_tokens > 10000:
    print("Warning: prompt exceeds 10k tokens, consider truncating")
else:
    print(f"Safe to send. Total cost estimate: ~${response.total_tokens * 0.0000005:.4f}")

Authentication

Set your Google API key as an environment variable: export GOOGLE_API_KEY='your-key-here'. The SDK reads this automatically when you call genai.configure(api_key=os.environ['GOOGLE_API_KEY']). Verify setup by running python -c "import os; print('GOOGLE_API_KEY' in os.environ)" before running token counting code.

Response shape

Field	Description
`total_tokens`	integer: sum of input and (if generation_config provided) estimated output tokens
`prompt_tokens`	integer: input token count only (may not always be present in response)
`candidates_tokens`	integer: estimated output tokens (present only if max_output_tokens in generation_config)

Field guide

total_tokens

The primary field. Use this to decide whether to send the request or truncate the prompt.

prompt_tokens

Rarely exposed separately but critical for understanding input cost. If absent, subtract estimated output from total_tokens.

candidates_tokens

Hidden gotcha: this is an estimate, not a guarantee. Actual output may use fewer or more tokens depending on model behavior. Use for upper-bound estimation only.

Setup trap

The count_tokens() response shape changed between google-generativeai 0.7.x and 0.8.x. If you're upgrading, the old .total_tokens field still works but .prompt_tokens may not be populated as expected. Always check your version: pip show google-generativeai | grep Version.

Cost

Calling <code>count_tokens()</code> itself is free: it doesn't charge input tokens. But in high-frequency systems (>100k tokens/day), you're doing redundant network calls. Cache counts only if your prompt is deterministic. Example: a chatbot with fixed system prompt should count once, not per-request.

Rate limits

Google doesn't explicitly rate-limit token counting, but it shares quota with generation calls. If you hit generation limits, token counting will also be throttled. Design your system to batch count operations when possible.

Common gotcha

Developers often call count_tokens(prompt) once, assume it's accurate forever, and hardcode the count. But token counts vary slightly per API version and model. Always count immediately before sending in production, not from cached values.

Error recovery

google.api_core.exceptions.InvalidArgument

Prompt contains invalid characters or exceeds model's hard context limit. Fix: validate UTF-8 encoding and check prompt length before calling count_tokens().

google.api_core.exceptions.Unauthenticated

GOOGLE_API_KEY not set or expired. Fix: verify <code>export GOOGLE_API_KEY='your-actual-key'</code> and regenerate key in Google AI Studio if stale.

google.api_core.exceptions.PermissionDenied

API key lacks permission for this model. Fix: ensure your project has the Generative Language API enabled in Google Cloud Console and your key is from that project.

AttributeError on response.total_tokens

Using wrong response object or outdated google-generativeai version. Fix: upgrade with <code>pip install --upgrade google-generativeai</code> and use response from count_tokens(), not generate_content().

Experienced dev note

Token counting is your first line of defense for cost control in LLM systems. Smart teams use it not just to predict cost, but to implement intelligent chunking: count tokens for each document before adding to context, skip chunks that would exceed budget, and return partial results rather than overflow errors. Also: token counts are deterministic, so you can pre-compute counts for static content (system prompts, documentation snippets) at startup and cache them forever: this saves hundreds of API calls daily in production.

Check your understanding

You have a system that processes customer documents in chunks. Some chunks are 500 tokens, some are 2000 tokens. Your budget allows 15k input tokens per request. How would you use count_tokens() to ensure you never exceed budget, and why is caching the count of the system prompt alone not sufficient?

Show answer hint

The answer involves: (1) counting the system prompt once and caching it, (2) checking each chunk's count and only adding it if total doesn't exceed 15k, (3) understanding that total_tokens includes system prompt + user prompt + all chunks, so you must sum them dynamically based on which chunks fit.

VERSION google-generativeai 0.8.x response structure is stable. Token counting is available for all Gemini models (2.0-flash, 2.5-pro, etc.). If using 0.7.x or earlier, upgrade immediately: older versions have inconsistent count_tokens() behavior.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.