Token counting before sending
Why this matters
Token counting prevents overspend surprises in production by letting you validate prompt size and estimate costs before committing an API call. In real systems, you often need to know token count to decide whether to include a document chunk, truncate history, or reject a request that exceeds your context window.
Explanation
What it does: The count_tokens() method returns the exact token consumption of a prompt before you send it. This includes input tokens (your prompt and any multimodal content) and optionally output tokens (estimated based on max_output_tokens). How it works: Gemini's tokenizer runs on the same server as the generation model, so the count is accurate to the actual request. The API returns both input token count and, if you pass generation config, an estimated output range. Unlike sampling-based approximations, this is deterministic. When to use it: Before sending multi-turn conversations, before including large documents or context, or before routing requests to different models based on cost. This is standard practice in production chatbot backends, RAG systems, and any system where token count drives business logic.
Request code
import google.generativeai as genai
import os
genai.configure(api_key=os.environ['GOOGLE_API_KEY'])
model = genai.GenerativeModel('gemini-2.0-flash')
prompt = """Analyze this customer feedback and extract sentiment, key issues, and recommendations.
Customer feedback:
The product is intuitive but the onboarding flow takes 15 minutes.
We'd like a faster path for returning users. Support was helpful but slow (48hr response).
"""
response = model.count_tokens(prompt)
print(f"Input tokens: {response.total_tokens}")
from google.generativeai.types import GenerationConfig
config = GenerationConfig(max_output_tokens=500)
response_with_output = model.count_tokens(
prompt,
generation_config=config
)
print(f"Input tokens: {response_with_output.total_tokens}")
print(f"Estimated output tokens: {response_with_output.total_tokens - model.count_tokens(prompt).total_tokens}")
if response.total_tokens > 10000:
print("Warning: prompt exceeds 10k tokens, consider truncating")
else:
print(f"Safe to send. Total cost estimate: ~${response.total_tokens * 0.0000005:.4f}") Authentication
Set your Google API key as an environment variable: export GOOGLE_API_KEY='your-key-here'. The SDK reads this automatically when you call genai.configure(api_key=os.environ['GOOGLE_API_KEY']). Verify setup by running python -c "import os; print('GOOGLE_API_KEY' in os.environ)" before running token counting code.
Response shape
| Field | Description |
|---|---|
total_tokens | integer: sum of input and (if generation_config provided) estimated output tokens |
prompt_tokens | integer: input token count only (may not always be present in response) |
candidates_tokens | integer: estimated output tokens (present only if max_output_tokens in generation_config) |
Field guide
total_tokens The primary field. Use this to decide whether to send the request or truncate the prompt.
prompt_tokens Rarely exposed separately but critical for understanding input cost. If absent, subtract estimated output from total_tokens.
candidates_tokens Hidden gotcha: this is an estimate, not a guarantee. Actual output may use fewer or more tokens depending on model behavior. Use for upper-bound estimation only.
Setup trap
The count_tokens() response shape changed between google-generativeai 0.7.x and 0.8.x. If you're upgrading, the old .total_tokens field still works but .prompt_tokens may not be populated as expected. Always check your version: pip show google-generativeai | grep Version.
Cost
Calling <code>count_tokens()</code> itself is free: it doesn't charge input tokens. But in high-frequency systems (>100k tokens/day), you're doing redundant network calls. Cache counts only if your prompt is deterministic. Example: a chatbot with fixed system prompt should count once, not per-request.
Rate limits
Google doesn't explicitly rate-limit token counting, but it shares quota with generation calls. If you hit generation limits, token counting will also be throttled. Design your system to batch count operations when possible.
Common gotcha
Developers often call count_tokens(prompt) once, assume it's accurate forever, and hardcode the count. But token counts vary slightly per API version and model. Always count immediately before sending in production, not from cached values.
Error recovery
google.api_core.exceptions.InvalidArgumentgoogle.api_core.exceptions.Unauthenticatedgoogle.api_core.exceptions.PermissionDeniedAttributeError on response.total_tokensExperienced dev note
Token counting is your first line of defense for cost control in LLM systems. Smart teams use it not just to predict cost, but to implement intelligent chunking: count tokens for each document before adding to context, skip chunks that would exceed budget, and return partial results rather than overflow errors. Also: token counts are deterministic, so you can pre-compute counts for static content (system prompts, documentation snippets) at startup and cache them forever: this saves hundreds of API calls daily in production.
Check your understanding
You have a system that processes customer documents in chunks. Some chunks are 500 tokens, some are 2000 tokens. Your budget allows 15k input tokens per request. How would you use count_tokens() to ensure you never exceed budget, and why is caching the count of the system prompt alone not sufficient?
Show answer hint
The answer involves: (1) counting the system prompt once and caching it, (2) checking each chunk's count and only adding it if total doesn't exceed 15k, (3) understanding that total_tokens includes system prompt + user prompt + all chunks, so you must sum them dynamically based on which chunks fit.