What context caching provides: cost reduction
Why this matters
When you process the same large documents, system prompts, or knowledge bases across multiple requests, you pay full token cost each time. Context caching stores that context server-side after the first request, charging 10% of the original token price on cache hits. For a 128K token context queried 100 times, you save ~$95 on input costs alone. At production scale with multiple users running concurrent queries, this is the difference between prohibitively expensive and sustainable operations.
Explanation
What context caching does: The Gemini API accepts a cache_control parameter on the last message in a request. When you include type="ephemeral", Google stores that message (usually your system prompt + large document) on their servers for 5 minutes. Subsequent requests referencing the same cached content only charge for the 10% cache read cost plus new tokens you add.
How it works: After the first request with cache control, Gemini returns a usage object showing cache_creation_input_tokens. Those tokens cost full price. On the next request (within the 5-minute window), send the identical cached content again with cache_control, and Gemini detects the match via hash. The response shows cache_read_input_tokens instead: these cost 1/10th the normal rate. Cache key is based on exact content match, so a single space difference breaks the cache.
When to use it: Use for retrieval-augmented generation (RAG) where the knowledge base is fixed but queries vary. Use for document analysis workflows where users ask multiple questions about the same PDF. Use for few-shot prompt engineering where your in-context examples are large. The 5-minute TTL suits interactive sessions but not batch jobs.
Request code
import os
import json
import google.generativeai as genai
os.environ['GOOGLE_API_KEY'] = 'your-api-key-here'
genai.configure(api_key=os.environ['GOOGLE_API_KEY'])
model = genai.GenerativeModel('gemini-2.0-flash')
large_context = """You are an expert legal document analyst. Here is a 50-page contract:
SECTION 1: PARTIES
This Agreement is entered into between Acme Corp ("Company") and ServicePro Inc ("Service Provider").
SECTION 2: SCOPE OF WORK
Service Provider shall deliver cloud infrastructure management services including 24/7 monitoring, patching, and incident response.
SECTION 3: COMPENSATION
Company shall pay Service Provider $50,000 monthly, due within 30 days of invoice.
SECTION 4: TERM
This Agreement shall commence on 2024-01-01 and continue for 24 months unless terminated for cause.
SECTION 5: LIABILITY
Neither party shall be liable for indirect, incidental, or consequential damages exceeding total fees paid in the preceding 12 months.
SECTION 6: TERMINATION
Either party may terminate with 60 days written notice. Immediate termination permitted for material breach uncured within 15 days.
SECTION 7: CONFIDENTIALITY
All proprietary information shared under this Agreement remains confidential for 3 years post-termination.
SECTION 8: GOVERNING LAW
This Agreement shall be governed by Delaware law."""
query_1 = "What is the monthly compensation and payment terms?"
query_2 = "What are the termination conditions and notice period?"
print("\n=== FIRST REQUEST (CACHE CREATION) ===")
response_1 = model.generate_content(
[
{
"role": "user",
"parts": [
{"text": large_context},
{
"text": query_1,
"cache_control": {"type": "ephemeral"}
}
]
}
]
)
print(f"Response 1: {response_1.text[:200]}...")
print(f"\nUsage Stats (Request 1):")
print(f" Input tokens: {response_1.usage_metadata.prompt_token_count}")
print(f" Cache creation tokens: {response_1.usage_metadata.cache_creation_input_tokens}")
print(f" Cache read tokens: {response_1.usage_metadata.cache_read_input_tokens}")
print(f" Output tokens: {response_1.usage_metadata.candidates_token_count}")
print("\n=== SECOND REQUEST (CACHE HIT) ===")
response_2 = model.generate_content(
[
{
"role": "user",
"parts": [
{"text": large_context},
{
"text": query_2,
"cache_control": {"type": "ephemeral"}
}
]
}
]
)
print(f"Response 2: {response_2.text[:200]}...")
print(f"\nUsage Stats (Request 2):")
print(f" Input tokens: {response_2.usage_metadata.prompt_token_count}")
print(f" Cache creation tokens: {response_2.usage_metadata.cache_creation_input_tokens}")
print(f" Cache read tokens: {response_2.usage_metadata.cache_read_input_tokens}")
print(f" Output tokens: {response_2.usage_metadata.candidates_token_count}")
print("\n=== COST ANALYSIS ===")
cache_creation_price = response_1.usage_metadata.cache_creation_input_tokens * 0.075 / 1_000_000
cache_read_price = response_2.usage_metadata.cache_read_input_tokens * 0.0075 / 1_000_000
standard_second_request = response_2.usage_metadata.prompt_token_count * 0.075 / 1_000_000
print(f"First request cost: ${cache_creation_price:.6f}")
print(f"Second request with cache: ${cache_read_price:.6f}")
print(f"Second request without cache: ${standard_second_request:.6f}")
print(f"Savings on request 2: ${standard_second_request - cache_read_price:.6f} ({((standard_second_request - cache_read_price) / standard_second_request * 100):.1f}%)") Authentication
Set your API key before making requests: ```python import os import google.generativeai as genai os.environ['GOOGLE_API_KEY'] = 'your-api-key-here' genai.configure(api_key=os.environ['GOOGLE_API_KEY']) ``` Obtain your key from Google AI Studio (https://aistudio.google.com/app/apikeys). The key must have access to the Gemini API with cache control permissions enabled.
Response shape
| Field | Description |
|---|---|
text | The model's text response to your query |
usage_metadata | [object Object] |
Field guide
cache_creation_input_tokens Only appears on first request. Indicates successful cache storage. Cost = this value × input_price.
cache_read_input_tokens Only appears when cache hit occurs. This is the hidden win: same tokens, 90% cost savings. If zero, your cache expired or didn't match.
prompt_token_count Developers miss this: it includes BOTH cached + new tokens. Don't use it alone to calculate cache savings: use the granular cache_* fields instead.
Setup trap
You must send cache_control on the LAST message only, not the first. If you put it on the system message or document chunk alone, it won't work. The cache is created on the part that has cache_control, so structure it as [context_chunk + cache_control_on_query].
Cost
Pricing (as of April 2026): Gemini 2.0 Flash input = $0.075/1M tokens. Cache creation = full price. Cache read = $0.0075/1M tokens (90% discount). Example: 100K token context cached, queried 50 times. Cost = (100K × $0.075) + (100K × $0.0075 × 49) = $7.50 + $36.75 = $44.25 total. Without cache: $367.50. Savings = $323.25 per 50 queries.
Rate limits
Cache creation has a separate quota from standard API rate limits. You can hit rate limits on standard requests while cache creation still succeeds. However, if you exceed your tier's quota for cache operations (typically unlimited for paid tiers), new cache creations are rejected with 429 error, but cache reads still work.
Common gotcha
The cached content must match EXACTLY on subsequent requests: including whitespace, line breaks, and case. If you regenerate the document or add a space, the hash changes and you lose the cache hit. Store your cached context in a variable, not regenerated inline. Also, the cache survives 5 minutes of inactivity, but clock starts from FIRST cache hit, not request. If you wait 6 minutes between request 2 and 3, request 3 creates a new cache.
Error recovery
InvalidArgument: cache_control not supportedcode: 3 (INVALID_ARGUMENT) message: "Request exceeds maximum total size"code: 13 (INTERNAL) on second requestcache_read_input_tokens is 0 on second requestAuthenticationErrorExperienced dev note
At scale, caching becomes a load distribution lever, not just a cost lever. If you're serving 1000 users asking questions about the same knowledge base, one cache hit serves all concurrent users without cache contention: Google handles the bucketing server-side. The real play: cache your fine-tuned few-shot examples once, then iterate on your prompt logic without re-uploading context. Also note that 5-minute TTL is aggressive for batch workflows but perfect for interactive chat. For long-running sessions, refresh the cache every 4 minutes by making a no-op query to reset the timer: counterintuitive but saves infrastructure cost on large knowledge bases.
Check your understanding
You're building a RAG system where users upload a 50K token PDF and ask 10 follow-up questions. The cache is 5 minutes. If user A uploads at 2:00pm and asks questions until 2:04pm, then user B uploads a different PDF at 2:05pm and asks questions until 2:09pm, do both users benefit from each other's caches? Why or why not?
Show answer hint
Cache is keyed by exact content hash. User B's PDF is different content, so zero overlap. But within each user's session, every question after the first reuses the cache. The key insight: cache isolation is per-content, not per-user, so identical contexts benefit, but different contexts don't interfere.