API Advanced hard · 8 min

What context caching provides: cost reduction

What you will learn

Context caching in Gemini API reduces token costs by 90% for repeated large contexts, making multi-turn document analysis and long-context retrieval economically viable at scale.

Why this matters

When you process the same large documents, system prompts, or knowledge bases across multiple requests, you pay full token cost each time. Context caching stores that context server-side after the first request, charging 10% of the original token price on cache hits. For a 128K token context queried 100 times, you save ~$95 on input costs alone. At production scale with multiple users running concurrent queries, this is the difference between prohibitively expensive and sustainable operations.

Skip if: Don't use context caching for one-off requests, dynamic contexts that change per request, or short prompts under 1K tokens (overhead not worth it). If your context is smaller than 1024 tokens, standard pricing is cheaper. Don't cache user-specific PII or sensitive data you shouldn't store server-side: caching persists the context on Google's infrastructure for the TTL duration.

Explanation

What context caching does: The Gemini API accepts a cache_control parameter on the last message in a request. When you include type="ephemeral", Google stores that message (usually your system prompt + large document) on their servers for 5 minutes. Subsequent requests referencing the same cached content only charge for the 10% cache read cost plus new tokens you add.

How it works: After the first request with cache control, Gemini returns a usage object showing cache_creation_input_tokens. Those tokens cost full price. On the next request (within the 5-minute window), send the identical cached content again with cache_control, and Gemini detects the match via hash. The response shows cache_read_input_tokens instead: these cost 1/10th the normal rate. Cache key is based on exact content match, so a single space difference breaks the cache.

When to use it: Use for retrieval-augmented generation (RAG) where the knowledge base is fixed but queries vary. Use for document analysis workflows where users ask multiple questions about the same PDF. Use for few-shot prompt engineering where your in-context examples are large. The 5-minute TTL suits interactive sessions but not batch jobs.

Request code

python

import os
import json
import google.generativeai as genai

os.environ['GOOGLE_API_KEY'] = 'your-api-key-here'
genai.configure(api_key=os.environ['GOOGLE_API_KEY'])

model = genai.GenerativeModel('gemini-2.0-flash')

large_context = """You are an expert legal document analyst. Here is a 50-page contract:

SECTION 1: PARTIES
This Agreement is entered into between Acme Corp ("Company") and ServicePro Inc ("Service Provider").

SECTION 2: SCOPE OF WORK
Service Provider shall deliver cloud infrastructure management services including 24/7 monitoring, patching, and incident response.

SECTION 3: COMPENSATION
Company shall pay Service Provider $50,000 monthly, due within 30 days of invoice.

SECTION 4: TERM
This Agreement shall commence on 2024-01-01 and continue for 24 months unless terminated for cause.

SECTION 5: LIABILITY
Neither party shall be liable for indirect, incidental, or consequential damages exceeding total fees paid in the preceding 12 months.

SECTION 6: TERMINATION
Either party may terminate with 60 days written notice. Immediate termination permitted for material breach uncured within 15 days.

SECTION 7: CONFIDENTIALITY
All proprietary information shared under this Agreement remains confidential for 3 years post-termination.

SECTION 8: GOVERNING LAW
This Agreement shall be governed by Delaware law."""

query_1 = "What is the monthly compensation and payment terms?"
query_2 = "What are the termination conditions and notice period?"

print("\n=== FIRST REQUEST (CACHE CREATION) ===")
response_1 = model.generate_content(
    [
        {
            "role": "user",
            "parts": [
                {"text": large_context},
                {
                    "text": query_1,
                    "cache_control": {"type": "ephemeral"}
                }
            ]
        }
    ]
)

print(f"Response 1: {response_1.text[:200]}...")
print(f"\nUsage Stats (Request 1):")
print(f"  Input tokens: {response_1.usage_metadata.prompt_token_count}")
print(f"  Cache creation tokens: {response_1.usage_metadata.cache_creation_input_tokens}")
print(f"  Cache read tokens: {response_1.usage_metadata.cache_read_input_tokens}")
print(f"  Output tokens: {response_1.usage_metadata.candidates_token_count}")

print("\n=== SECOND REQUEST (CACHE HIT) ===")
response_2 = model.generate_content(
    [
        {
            "role": "user",
            "parts": [
                {"text": large_context},
                {
                    "text": query_2,
                    "cache_control": {"type": "ephemeral"}
                }
            ]
        }
    ]
)

print(f"Response 2: {response_2.text[:200]}...")
print(f"\nUsage Stats (Request 2):")
print(f"  Input tokens: {response_2.usage_metadata.prompt_token_count}")
print(f"  Cache creation tokens: {response_2.usage_metadata.cache_creation_input_tokens}")
print(f"  Cache read tokens: {response_2.usage_metadata.cache_read_input_tokens}")
print(f"  Output tokens: {response_2.usage_metadata.candidates_token_count}")

print("\n=== COST ANALYSIS ===")
cache_creation_price = response_1.usage_metadata.cache_creation_input_tokens * 0.075 / 1_000_000
cache_read_price = response_2.usage_metadata.cache_read_input_tokens * 0.0075 / 1_000_000
standard_second_request = response_2.usage_metadata.prompt_token_count * 0.075 / 1_000_000

print(f"First request cost: ${cache_creation_price:.6f}")
print(f"Second request with cache: ${cache_read_price:.6f}")
print(f"Second request without cache: ${standard_second_request:.6f}")
print(f"Savings on request 2: ${standard_second_request - cache_read_price:.6f} ({((standard_second_request - cache_read_price) / standard_second_request * 100):.1f}%)")

Authentication

Set your API key before making requests: ```python import os import google.generativeai as genai os.environ['GOOGLE_API_KEY'] = 'your-api-key-here' genai.configure(api_key=os.environ['GOOGLE_API_KEY']) ``` Obtain your key from Google AI Studio (https://aistudio.google.com/app/apikeys). The key must have access to the Gemini API with cache control permissions enabled.

Response shape

Field	Description
`text`	The model's text response to your query
`usage_metadata`	[object Object]

Field guide

cache_creation_input_tokens

Only appears on first request. Indicates successful cache storage. Cost = this value × input_price.

cache_read_input_tokens

Only appears when cache hit occurs. This is the hidden win: same tokens, 90% cost savings. If zero, your cache expired or didn't match.

prompt_token_count

Developers miss this: it includes BOTH cached + new tokens. Don't use it alone to calculate cache savings: use the granular cache_* fields instead.

Setup trap

You must send cache_control on the LAST message only, not the first. If you put it on the system message or document chunk alone, it won't work. The cache is created on the part that has cache_control, so structure it as [context_chunk + cache_control_on_query].

Cost

Pricing (as of April 2026): Gemini 2.0 Flash input = $0.075/1M tokens. Cache creation = full price. Cache read = $0.0075/1M tokens (90% discount). Example: 100K token context cached, queried 50 times. Cost = (100K × $0.075) + (100K × $0.0075 × 49) = $7.50 + $36.75 = $44.25 total. Without cache: $367.50. Savings = $323.25 per 50 queries.

Rate limits

Cache creation has a separate quota from standard API rate limits. You can hit rate limits on standard requests while cache creation still succeeds. However, if you exceed your tier's quota for cache operations (typically unlimited for paid tiers), new cache creations are rejected with 429 error, but cache reads still work.

Common gotcha

The cached content must match EXACTLY on subsequent requests: including whitespace, line breaks, and case. If you regenerate the document or add a space, the hash changes and you lose the cache hit. Store your cached context in a variable, not regenerated inline. Also, the cache survives 5 minutes of inactivity, but clock starts from FIRST cache hit, not request. If you wait 6 minutes between request 2 and 3, request 3 creates a new cache.

Error recovery

InvalidArgument: cache_control not supported

You're using an older model version. Ensure you're on gemini-2.0-flash or later. gemini-1.5-pro also supports it, but gemini-pro does not.

code: 3 (INVALID_ARGUMENT) message: "Request exceeds maximum total size"

Your context + query exceeds the model's token window. With caching, you can only cache up to 128K tokens per request. If your document is larger, paginate it.

code: 13 (INTERNAL) on second request

Cache expired (>5 minute gap) or the content hash didn't match due to whitespace differences. Retry with the exact same context string or wait for fresh cache creation.

cache_read_input_tokens is 0 on second request

Cache miss: you either changed the context, waited over 5 minutes, or the model didn't detect the match. This is the most common silent failure. Log the hash of your cached context to debug.

AuthenticationError

Your API key lacks cache control permissions. Regenerate the key in Google AI Studio and ensure Gemini API is enabled for your project.

Experienced dev note

At scale, caching becomes a load distribution lever, not just a cost lever. If you're serving 1000 users asking questions about the same knowledge base, one cache hit serves all concurrent users without cache contention: Google handles the bucketing server-side. The real play: cache your fine-tuned few-shot examples once, then iterate on your prompt logic without re-uploading context. Also note that 5-minute TTL is aggressive for batch workflows but perfect for interactive chat. For long-running sessions, refresh the cache every 4 minutes by making a no-op query to reset the timer: counterintuitive but saves infrastructure cost on large knowledge bases.

Check your understanding

You're building a RAG system where users upload a 50K token PDF and ask 10 follow-up questions. The cache is 5 minutes. If user A uploads at 2:00pm and asks questions until 2:04pm, then user B uploads a different PDF at 2:05pm and asks questions until 2:09pm, do both users benefit from each other's caches? Why or why not?

Show answer hint

Cache is keyed by exact content hash. User B's PDF is different content, so zero overlap. But within each user's session, every question after the first reuses the cache. The key insight: cache isolation is per-content, not per-user, so identical contexts benefit, but different contexts don't interfere.

VERSION Context caching added to Gemini API in google-generativeai 0.8.0 (March 2025). Requires gemini-2.0-flash, gemini-2.5-pro, or gemini-1.5-pro models. Not available on gemini-pro or earlier. TTL is currently fixed at 5 minutes (ephemeral only); persistent cache is not yet available.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.