API Advanced hard · 8 min

Cache TTL and Extension

What you will learn
Control how long cached prompts persist and extend their lifetime to avoid recomputation costs on large context windows.

Why this matters

Cached context in Gemini API can represent thousands of tokens. Without understanding TTL extension, your cache expires mid-conversation or you pay full price to recompute a 128K token context when 30 minutes of extension would have prevented it.

Skip if: Use standard cache headers without extension when: your context is small (<1000 tokens), your session is truly one-off with no follow-up requests, or you're building a stateless REST endpoint where cache persistence across requests adds complexity. For chat applications or multi-turn agents, cache extension is almost always worth the minimal overhead.

Explanation

Cache TTL basics: When you send a request with cache_control using CachedContent, the Gemini API caches the prompt (system message + initial context) for a default of 5 minutes. If you make a follow-up request within that window, the cached tokens are reused at 10% of the input cost. After TTL expires, you repay full price.

How extension works: The cache has two lifetimes: creation_time and expiration_time. Each time you use the cached content with a new request, you can call cachedContent.update() to extend expiration_time by adding minutes to the original TTL. This is a separate API call (zero token cost) that happens before your actual inference request. The extension is explicit: the cache does not auto-renew just because you used it.

When to use: Extend cache TTL in multi-turn conversations, RAG systems where the same documents are queried multiple times, or any workflow where you've invested tokens upfront and expect follow-ups within hours. Set conservative TTL (5–60 minutes) for development, longer (1–24 hours) for production agents that are always warm.

Request code

python
import google.generativeai as genai
import os
from datetime import datetime, timedelta

genai.configure(api_key=os.environ['GOOGLE_API_KEY'])

# Step 1: Create cached content with initial TTL (5 minutes default)
cached_content = genai.caching.CachedContent.create(
    model='gemini-2.0-flash',
    display_name='large_document_cache',
    system_instruction='You are an expert analyst.',
    contents=[
        genai.protos.Content(
            role='user',
            parts=[
                genai.protos.Part.from_text(
                    'Here is a 100K token research paper: ' + 'x' * 50000
                )
            ]
        )
    ],
    ttl=genai.protos.Duration(seconds=300)  # 5 minutes
)

print(f'Cache created: {cached_content.name}')
print(f'Expires at: {cached_content.expiration_time}')

# Step 2: Make first inference request using cache
model = genai.GenerativeModel('gemini-2.0-flash')
response1 = model.generate_content(
    'Summarize the first 3 key points from the paper.',
    cached_content=cached_content
)
print(f'First response (cache hit): {response1.text[:100]}')

# Step 3: Extend cache TTL before it expires (must be done explicitly)
new_expiration = datetime.utcnow() + timedelta(hours=2)
cached_content.expiration_time = new_expiration
cached_content.update()
print(f'Cache extended to: {cached_content.expiration_time}')

# Step 4: Make second inference within new TTL window
response2 = model.generate_content(
    'What are the statistical findings?',
    cached_content=cached_content
)
print(f'Second response (cache still valid): {response2.text[:100]}')

# Optional: Delete cache early if no longer needed
cached_content.delete()
print('Cache deleted')

Authentication

Ensure your Google API key is set: `export GOOGLE_API_KEY='your-key'`. The key must have Generative Language API enabled. Cache operations use the same credentials as standard requests: no additional setup required.

Response shape

FieldDescription
name Unique cache identifier (e.g., 'cachedContents/abc123...')
model Model used for this cache (e.g., 'models/gemini-2.0-flash')
display_name User-provided label for the cache
usage_metadata [object Object]
create_time ISO 8601 timestamp when cache was created
expiration_time ISO 8601 timestamp when cache will expire
ttl Remaining time-to-live in seconds

Field guide

usage_metadata.cache_read_input_tokens

This is the money-saving metric. Divide this by 10 to get the cost ratio vs non-cached input. If 0, your cache was not used: check expiration_time.

expiration_time

The field developers overlook: if this is in the past when you call update(), the call fails silently in some SDK versions. Always check this before extending.

ttl

Reported as seconds remaining, not a duration. Useful for deciding if extension is needed now or later.

Setup trap

The most common mistake: setting ttl as an integer (seconds) instead of genai.protos.Duration(seconds=...). The library accepts both, but only Duration is serialized correctly on extension. Use the proto wrapper explicitly.

Cost

Caching saves 90% on input tokens: a 10K-token prompt costs 1 credit for creation, then 0.1 credits per reuse. If you have a 100K-token document, cache creation = 100 credits, then 10 credits per follow-up vs. 100 per non-cached request. At 4-hour average TTL with 6 follow-ups, you save 540 credits: roughly $0.22 USD per session.

Rate limits

Cache creation and update operations count against your 'writes per minute' quota (typically 60/min), not input tokens. If you're extending cache for thousands of concurrent sessions, you may hit write limits before token limits. Use async batch updates or stagger extension times.

Common gotcha

Calling update() on a cached_content object does not re-fetch the object from the API: it uses local state. If you're working with multiple processes or have stale local copies, your expiration_time extension will silently use outdated metadata. Always fetch fresh via CachedContent.get(name) before extending in production systems.

Error recovery

InvalidArgument: Incorrect time value
The expiration_time you set is before now() or before creation_time. Ensure new_expiration = datetime.utcnow() + timedelta(...), not a hardcoded past timestamp.
NotFound: Could not find resource
The cache was already deleted or TTL expired and was auto-removed. Retry by creating a new CachedContent instead of extending.
PermissionDenied
Your API key lacks the generativelanguage.caches.update permission. Verify key has Generative Language API access, not just read access.
FailedPrecondition: Cache is expired
Expiration time has passed. Call get() first to refresh state before extending: local state may be stale.

Experienced dev note

Cache extension is cheap (near-zero tokens) but requires explicit API calls in most SDKs. Build a helper: `def maybe_extend_cache(cc, min_ttl_minutes=5): if (cc.expiration_time - datetime.utcnow()).total_seconds() < min_ttl_minutes * 60: cc.expiration_time += timedelta(hours=2); cc.update()`. Call this once per conversation turn before inference. This 3-line function prevents 90% of cache misses in production and costs nothing.

Check your understanding

You have a cached 50K-token document. Your first follow-up arrives 3 minutes later (cache hits, pays 0.1x). Your second arrives 7 minutes after cache creation (after default 5-min TTL, so cache miss, pays 1x). How would you have prevented the second miss with a single update() call, and when exactly should you make that call?

Show answer hint

You must call update() to extend expiration_time *before* the 5-minute default TTL expires: ideally within the first 2–3 minutes of cache creation, before the second request arrives. Waiting until after expiration to extend is too late.

VERSION google-generativeai 0.8.x uses protos.Duration for TTL and supports explicit update(). Earlier 0.1.x versions required manual timestamp calculation. If upgrading from 0.1.x, replace integer seconds with Duration(seconds=...) in cache_control fields.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.