API Intermediate medium · 6 min

What prompt caching does: 90% cost reduction

What you will learn

Prompt caching stores repeated context in Anthropic's cache, reducing per-token costs by 90% after the first request.

Why this matters

If you're sending the same large context (system prompt, documents, examples) repeatedly, caching can drop your API costs dramatically. A 30K-token cached context costs 1.25x on first use, then 0.1x per token on subsequent requests: that's the difference between a $10 bill and a dime for the same context.

Skip if: Don't use caching if: (1) your context changes frequently (caching becomes a liability), (2) you need sub-second latency (cache writes add ~500ms), (3) you're batch-processing one-off requests with unique contexts. Use it for chat applications, document-based Q&A, or multi-turn interactions with stable system prompts.

Explanation

What it does: Prompt caching pre-computes and stores parts of your request (system prompts, documents, code files) on Anthropic's servers. Subsequent requests reuse that cached content instead of re-processing it, dropping the per-token cost from standard rate to 10% of standard rate.

How it works: You mark messages or text blocks with cache_control={"type": "ephemeral"}. On the first request, those tokens are cached (charged at 1.25x normal rate). The cache persists for 5 minutes. Every subsequent request in that 5-minute window that includes the same cached content skips reprocessing: you only pay for new tokens (non-cached input and output).

When to use it: Chat systems with a fixed system prompt, RAG applications loading the same documents repeatedly, and code analysis tools that reference the same codebase across queries. The minimum cache size is 1024 tokens, so don't cache small prompts.

Request code

python

import os
from anthropic import Anthropic

client = Anthropic(api_key=os.environ.get('ANTHROPIC_API_KEY'))

largeSystemPrompt = '''You are a legal document analyzer. You have access to the complete United States Tax Code (Title 26), Internal Revenue Code Section 1-9834, spanning over 54,000 lines. Your task is to answer questions about tax law with precision and cite relevant sections.''' + '\n\n' + ('Relevant sections: ' * 100)

response = client.messages.create(
    model='claude-opus-4-6',
    max_tokens=512,
    system=[
        {
            'type': 'text',
            'text': largeSystemPrompt,
            'cache_control': {'type': 'ephemeral'}
        }
    ],
    messages=[
        {
            'role': 'user',
            'content': 'What is the tax treatment of spousal lifetime access trusts (SLATs)?'
        }
    ]
)

print(f'Response: {response.content[0].text}')
print(f'Input tokens (cached): {response.usage.cache_creation_input_tokens}')
print(f'Input tokens (read from cache): {response.usage.cache_read_input_tokens}')
print(f'Output tokens: {response.usage.output_tokens}')

Authentication

Set ANTHROPIC_API_KEY environment variable before instantiating the client. The SDK reads this at init time: export ANTHROPIC_API_KEY='sk-ant-...'

Response shape

Field	Description
`content`	List of content blocks, first is text response
`content[0].text`	The model's text response
`usage.input_tokens`	Non-cached input tokens processed
`usage.cache_creation_input_tokens`	Tokens written to cache (first request only)
`usage.cache_read_input_tokens`	Tokens read from cache (subsequent requests)
`usage.output_tokens`	Tokens in the response

Field guide

cache_creation_input_tokens

Non-zero only on first request. These tokens are charged at 1.25x the normal rate.

cache_read_input_tokens

The hidden savings field: shows how many tokens were reused from cache at 0.1x cost. This is where your 90% discount comes from.

usage.input_tokens

New tokens that weren't cached. Always charged at standard rate.

Setup trap

Cache writes add 500-800ms latency to the first request. If you're expecting instant responses and the first request hangs, that's the cache being written. Plan for this in latency budgets. Also, the 5-minute cache expiration is server-side: you can't query remaining TTL or manually expire it.

Cost

Example: 30,000-token system prompt + 500-token user question. First request: (30,000 × 1.25) + (500 × 1) = 37,750 token-equivalents at standard rates ≈ $0.56 with Opus pricing. Second request (within 5 min): (30,000 × 0.1) + (500 × 1) = 3,500 token-equivalents ≈ $0.05. For a 100-query session: first query costs $0.56, next 99 cost $0.05 each = $5.51 total instead of $56. That's 90% savings.

Rate limits

Cache writes consume quota, but cache reads do not trigger standard rate limits as aggressively. If you're hitting rate limits, switching to cached requests can let the same number of queries through because cached reads count as lower-weight operations internally.

Common gotcha

You'll mark the context with cache_control, but the cache only activates if the EXACT same content appears in subsequent requests within 5 minutes. If you regenerate your system prompt (different whitespace, reordered fields, or even different examples), the cache misses and you start fresh. Developers often think the cache is broken when they've actually modified the prompt slightly.

Error recovery

InvalidRequestError: 'cache_control' is not supported for model

You're using an older model (claude-3-5-sonnet-20241022 or earlier). Upgrade to claude-opus-4-6 or claude-sonnet-4-6.

InvalidRequestError: 'cache_control' can only be used with 'system' or 'text' type blocks

You're trying to apply cache_control to an image or tool block. Only text and system messages support caching.

RateLimitError after cache miss

Your cached content expired or changed. Wait 5 minutes or regenerate the cache with the new content.

Experienced dev note

Prompt caching is a hidden leverage point for RAG systems. Instead of re-embedding and re-ranking documents every query, you cache the full retrieval context once, then append only the new user query. A 10,000-token document cached across 1,000 queries saves you ~$9 in input costs alone. The real win: at scale, caching lets you run complex reasoning pipelines (multi-turn planning, code execution summaries) without fearing the token bill. Set it and forget it for your fixed system prompts.

Check your understanding

You have a chat system where users ask questions about a 50KB internal knowledge base. You want to cache the knowledge base. Your system prompt changes once per day. Should you put the knowledge base in cache_control, the system prompt in cache_control, both, or neither? What happens if you cache both and the system prompt updates after 2 minutes?

Show answer hint

Cache the knowledge base, not the system prompt. If the system prompt changes mid-session, the cache with the old prompt becomes useless: you'd have a stale cache serving wrong instructions. Caching is best for stable, large, reused content. Volatile content (system prompts, user-specific data) should stay outside the cache.

VERSION Prompt caching is available in anthropic SDK 0.90.0+. If you're on an older version, upgrade with `pip install --upgrade anthropic`. The cache_control parameter was not available in anthropic <0.88.x and will raise AttributeError on older models like claude-3-sonnet-20240229.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.