System prompt caching pattern
Why this matters
System prompts often contain lengthy instructions, examples, or context that don't change between API calls. Caching them avoids re-processing identical text, cuts token usage by 90%, and speeds up time-to-first-token by eliminating redundant processing.
Explanation
System prompt caching stores immutable context on Anthropic's infrastructure. When you include cache_control={"type": "ephemeral"} in your system prompt block, Anthropic checksums that content, stores it in a fast cache layer, and reuses it across requests. Subsequent calls with identical cached content skip reprocessing: you pay only 10% of the token cost for cached input and get faster responses.
Under the hood: Anthropic's API server computes a hash of your system prompt on first request. If the same hash appears in a follow-up request (within 5 minutes), the cached parse tree is retrieved instead of re-tokenizing. This is transparent to you: same API response format, but significantly reduced latency and input token billing.
Use this pattern when: You have a fixed system prompt (e.g., "You are a legal document reviewer") that appears in 10+ requests per session, or when batch processing with identical instructions. Ideal for chatbots with stable personalities, code reviewers, or document classification pipelines. Cache windows are 5 minutes by default; reuse prompts within that window to maximize savings.
Request code
import anthropic
import os
client = anthropic.Anthropic(api_key=os.environ.get("ANTHROPIC_API_KEY"))
# System prompt to cache: must be identical across requests to reuse cache
system_prompt = """You are an expert legal document reviewer. Your task is to:
1. Identify contract risks and red flags
2. Highlight missing clauses or ambiguous language
3. Provide specific remediation suggestions
4. Rate overall risk (low/medium/high)
Always cite the specific clause or section when flagging issues.
Be concise but thorough. Assume the reader has legal background."""
# First request: cache miss, full processing
response_1 = client.messages.create(
model="claude-opus-4-6",
max_tokens=1024,
system=[
{
"type": "text",
"text": system_prompt,
"cache_control": {"type": "ephemeral"}
}
],
messages=[
{
"role": "user",
"content": "Review this NDA: [contract text here]"
}
]
)
print(f"First request usage:")
print(f"Input tokens: {response_1.usage.input_tokens}")
print(f"Cache creation tokens: {response_1.usage.cache_creation_input_tokens}")
print(f"Cache read tokens: {response_1.usage.cache_read_input_tokens}")
print(f"Response: {response_1.content[0].text}\n")
# Second request: same system prompt within 5 min, should hit cache
response_2 = client.messages.create(
model="claude-opus-4-6",
max_tokens=1024,
system=[
{
"type": "text",
"text": system_prompt,
"cache_control": {"type": "ephemeral"}
}
],
messages=[
{
"role": "user",
"content": "Review this employment agreement: [different contract]"
}
]
)
print(f"Second request usage (should show cache_read_input_tokens):")
print(f"Input tokens: {response_2.usage.input_tokens}")
print(f"Cache creation tokens: {response_2.usage.cache_creation_input_tokens}")
print(f"Cache read tokens: {response_2.usage.cache_read_input_tokens}")
print(f"Response: {response_2.content[0].text}") Authentication
Set your Anthropic API key before instantiating the client: ```bash export ANTHROPIC_API_KEY="your-key-from-console.anthropic.com" ``` The Python SDK reads this at client initialization. No additional auth headers needed: the SDK handles it.
Response shape
| Field | Description |
|---|---|
id | msg_1234567890abcdef |
type | message |
role | assistant |
content | [object Object] |
model | claude-opus-4-6 |
stop_reason | end_turn |
stop_sequence | |
usage | [object Object] |
Field guide
usage.cache_creation_input_tokens Tokens written to cache on this request. Non-zero only on first request with a new system prompt. Billed at 100% of normal input cost.
usage.cache_read_input_tokens Tokens read from cache on this request. Non-zero on cache hits. Billed at 10% of normal input cost: the real savings metric.
usage.input_tokens Total input tokens processed (excludes cached tokens), always present for reference
cache_read_input_tokens Hidden gem: this field proves your cache was hit. Zero means cache miss (system prompt changed or 5-min window expired). Experienced devs monitor this per request to validate caching strategy.
Setup trap
Cache windows are 5 minutes by default from the first request. If you test caching by running the same request 10 minutes apart, the cache expires and you'll see cache_read_input_tokens as 0 again, making you think caching didn't work. Test within a single script execution or rapid iterations to verify the pattern.
Cost
Cache creation costs 100% of input tokens, cache reads cost 10%. Example: a 1000-token system prompt costs $0.30 to cache (assuming $0.003 per input token). Each cached request costs $0.03 instead of $0.30. Breakeven is ~4 requests; after that, pure savings. For 100 requests per hour with the same system prompt, you save ~$28.50/hour.
Rate limits
Prompt caching does not change rate limits. You're still rate-limited by requests-per-minute and tokens-per-minute. However, cached input tokens count toward TPM quotas, so caching reduces TPM pressure (10% of input counts vs 100%), potentially allowing higher throughput.
Common gotcha
The system prompt must be byte-for-byte identical across requests to hit cache. A trailing space, different quote style, or even reordered list items breaks the hash and causes a cache miss. Use a constant string, not f-strings or string concatenation. Many developers debug phantom cache misses caused by whitespace differences.
Error recovery
InvalidRequestError: 'cache_control' not supported for this modelRateLimitError after enabling cachingCache seems not to work (cache_read_input_tokens always 0)Experienced dev note
Prompt caching is a silent efficiency multiplier for production systems. A chatbot with a 500-token system prompt serving 1000 users/day with repeated instructions saves ~$40/day with zero latency penalty. But the real win is architectural: it makes sense to move heavy context (company docs, coding standards, style guides) into the system prompt instead of duplicating it in user messages. This reduces per-request token bloat and keeps caches hot longer. Also, monitor cache_read_input_tokens in production logs: a sudden drop to zero often signals a code push that changed your system prompt, costing money until you notice.
Check your understanding
If you have a system prompt cached at 10:00 AM and make identical requests at 10:04 AM (cache hit) and 10:06 AM, what will cache_read_input_tokens show for the 10:06 request, and why?
Show answer hint
The 5-minute cache window starts from the first request, not from each cache hit. Resets on expiration, not on every successful read.