What prompt caching does: 90% cost reduction
Why this matters
If you're sending the same large context (system prompt, documents, examples) repeatedly, caching can drop your API costs dramatically. A 30K-token cached context costs 1.25x on first use, then 0.1x per token on subsequent requests: that's the difference between a $10 bill and a dime for the same context.
Explanation
What it does: Prompt caching pre-computes and stores parts of your request (system prompts, documents, code files) on Anthropic's servers. Subsequent requests reuse that cached content instead of re-processing it, dropping the per-token cost from standard rate to 10% of standard rate.
How it works: You mark messages or text blocks with cache_control={"type": "ephemeral"}. On the first request, those tokens are cached (charged at 1.25x normal rate). The cache persists for 5 minutes. Every subsequent request in that 5-minute window that includes the same cached content skips reprocessing: you only pay for new tokens (non-cached input and output).
When to use it: Chat systems with a fixed system prompt, RAG applications loading the same documents repeatedly, and code analysis tools that reference the same codebase across queries. The minimum cache size is 1024 tokens, so don't cache small prompts.
Request code
import os
from anthropic import Anthropic
client = Anthropic(api_key=os.environ.get('ANTHROPIC_API_KEY'))
largeSystemPrompt = '''You are a legal document analyzer. You have access to the complete United States Tax Code (Title 26), Internal Revenue Code Section 1-9834, spanning over 54,000 lines. Your task is to answer questions about tax law with precision and cite relevant sections.''' + '\n\n' + ('Relevant sections: ' * 100)
response = client.messages.create(
model='claude-opus-4-6',
max_tokens=512,
system=[
{
'type': 'text',
'text': largeSystemPrompt,
'cache_control': {'type': 'ephemeral'}
}
],
messages=[
{
'role': 'user',
'content': 'What is the tax treatment of spousal lifetime access trusts (SLATs)?'
}
]
)
print(f'Response: {response.content[0].text}')
print(f'Input tokens (cached): {response.usage.cache_creation_input_tokens}')
print(f'Input tokens (read from cache): {response.usage.cache_read_input_tokens}')
print(f'Output tokens: {response.usage.output_tokens}') Authentication
Set ANTHROPIC_API_KEY environment variable before instantiating the client. The SDK reads this at init time: export ANTHROPIC_API_KEY='sk-ant-...'
Response shape
| Field | Description |
|---|---|
content | List of content blocks, first is text response |
content[0].text | The model's text response |
usage.input_tokens | Non-cached input tokens processed |
usage.cache_creation_input_tokens | Tokens written to cache (first request only) |
usage.cache_read_input_tokens | Tokens read from cache (subsequent requests) |
usage.output_tokens | Tokens in the response |
Field guide
cache_creation_input_tokens Non-zero only on first request. These tokens are charged at 1.25x the normal rate.
cache_read_input_tokens The hidden savings field: shows how many tokens were reused from cache at 0.1x cost. This is where your 90% discount comes from.
usage.input_tokens New tokens that weren't cached. Always charged at standard rate.
Setup trap
Cache writes add 500-800ms latency to the first request. If you're expecting instant responses and the first request hangs, that's the cache being written. Plan for this in latency budgets. Also, the 5-minute cache expiration is server-side: you can't query remaining TTL or manually expire it.
Cost
Example: 30,000-token system prompt + 500-token user question. First request: (30,000 × 1.25) + (500 × 1) = 37,750 token-equivalents at standard rates ≈ $0.56 with Opus pricing. Second request (within 5 min): (30,000 × 0.1) + (500 × 1) = 3,500 token-equivalents ≈ $0.05. For a 100-query session: first query costs $0.56, next 99 cost $0.05 each = $5.51 total instead of $56. That's 90% savings.
Rate limits
Cache writes consume quota, but cache reads do not trigger standard rate limits as aggressively. If you're hitting rate limits, switching to cached requests can let the same number of queries through because cached reads count as lower-weight operations internally.
Common gotcha
You'll mark the context with cache_control, but the cache only activates if the EXACT same content appears in subsequent requests within 5 minutes. If you regenerate your system prompt (different whitespace, reordered fields, or even different examples), the cache misses and you start fresh. Developers often think the cache is broken when they've actually modified the prompt slightly.
Error recovery
InvalidRequestError: 'cache_control' is not supported for modelInvalidRequestError: 'cache_control' can only be used with 'system' or 'text' type blocksRateLimitError after cache missExperienced dev note
Prompt caching is a hidden leverage point for RAG systems. Instead of re-embedding and re-ranking documents every query, you cache the full retrieval context once, then append only the new user query. A 10,000-token document cached across 1,000 queries saves you ~$9 in input costs alone. The real win: at scale, caching lets you run complex reasoning pipelines (multi-turn planning, code execution summaries) without fearing the token bill. Set it and forget it for your fixed system prompts.
Check your understanding
You have a chat system where users ask questions about a 50KB internal knowledge base. You want to cache the knowledge base. Your system prompt changes once per day. Should you put the knowledge base in cache_control, the system prompt in cache_control, both, or neither? What happens if you cache both and the system prompt updates after 2 minutes?
Show answer hint
Cache the knowledge base, not the system prompt. If the system prompt changes mid-session, the cache with the old prompt becomes useless: you'd have a stale cache serving wrong instructions. Caching is best for stable, large, reused content. Volatile content (system prompts, user-specific data) should stay outside the cache.