API Intermediate medium · 6 min

System prompt caching pattern

What you will learn

Use prompt caching to store large system prompts and context on Anthropic's servers, reducing latency and cost for repeated requests with the same instructions.

Why this matters

System prompts often contain lengthy instructions, examples, or context that don't change between API calls. Caching them avoids re-processing identical text, cuts token usage by 90%, and speeds up time-to-first-token by eliminating redundant processing.

Skip if: Don't use prompt caching for one-off requests, highly dynamic system prompts that change per user, or contexts smaller than 1KB (overhead exceeds benefit). Also skip it if your API calls are already fast enough and cost isn't a constraint.

Explanation

System prompt caching stores immutable context on Anthropic's infrastructure. When you include cache_control={"type": "ephemeral"} in your system prompt block, Anthropic checksums that content, stores it in a fast cache layer, and reuses it across requests. Subsequent calls with identical cached content skip reprocessing: you pay only 10% of the token cost for cached input and get faster responses.

Under the hood: Anthropic's API server computes a hash of your system prompt on first request. If the same hash appears in a follow-up request (within 5 minutes), the cached parse tree is retrieved instead of re-tokenizing. This is transparent to you: same API response format, but significantly reduced latency and input token billing.

Use this pattern when: You have a fixed system prompt (e.g., "You are a legal document reviewer") that appears in 10+ requests per session, or when batch processing with identical instructions. Ideal for chatbots with stable personalities, code reviewers, or document classification pipelines. Cache windows are 5 minutes by default; reuse prompts within that window to maximize savings.

Request code

python

import anthropic
import os

client = anthropic.Anthropic(api_key=os.environ.get("ANTHROPIC_API_KEY"))

# System prompt to cache: must be identical across requests to reuse cache
system_prompt = """You are an expert legal document reviewer. Your task is to:
1. Identify contract risks and red flags
2. Highlight missing clauses or ambiguous language
3. Provide specific remediation suggestions
4. Rate overall risk (low/medium/high)

Always cite the specific clause or section when flagging issues.
Be concise but thorough. Assume the reader has legal background."""

# First request: cache miss, full processing
response_1 = client.messages.create(
    model="claude-opus-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": system_prompt,
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[
        {
            "role": "user",
            "content": "Review this NDA: [contract text here]"
        }
    ]
)

print(f"First request usage:")
print(f"Input tokens: {response_1.usage.input_tokens}")
print(f"Cache creation tokens: {response_1.usage.cache_creation_input_tokens}")
print(f"Cache read tokens: {response_1.usage.cache_read_input_tokens}")
print(f"Response: {response_1.content[0].text}\n")

# Second request: same system prompt within 5 min, should hit cache
response_2 = client.messages.create(
    model="claude-opus-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": system_prompt,
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[
        {
            "role": "user",
            "content": "Review this employment agreement: [different contract]"
        }
    ]
)

print(f"Second request usage (should show cache_read_input_tokens):")
print(f"Input tokens: {response_2.usage.input_tokens}")
print(f"Cache creation tokens: {response_2.usage.cache_creation_input_tokens}")
print(f"Cache read tokens: {response_2.usage.cache_read_input_tokens}")
print(f"Response: {response_2.content[0].text}")

Authentication

Set your Anthropic API key before instantiating the client: ```bash export ANTHROPIC_API_KEY="your-key-from-console.anthropic.com" ``` The Python SDK reads this at client initialization. No additional auth headers needed: the SDK handles it.

Response shape

Field	Description
`id`	msg_1234567890abcdef
`type`	message
`role`	assistant
`content`	[object Object]
`model`	claude-opus-4-6
`stop_reason`	end_turn
`stop_sequence`
`usage`	[object Object]

Field guide

usage.cache_creation_input_tokens

Tokens written to cache on this request. Non-zero only on first request with a new system prompt. Billed at 100% of normal input cost.

usage.cache_read_input_tokens

Tokens read from cache on this request. Non-zero on cache hits. Billed at 10% of normal input cost: the real savings metric.

usage.input_tokens

Total input tokens processed (excludes cached tokens), always present for reference

cache_read_input_tokens

Hidden gem: this field proves your cache was hit. Zero means cache miss (system prompt changed or 5-min window expired). Experienced devs monitor this per request to validate caching strategy.

Setup trap

Cache windows are 5 minutes by default from the first request. If you test caching by running the same request 10 minutes apart, the cache expires and you'll see cache_read_input_tokens as 0 again, making you think caching didn't work. Test within a single script execution or rapid iterations to verify the pattern.

Cost

Cache creation costs 100% of input tokens, cache reads cost 10%. Example: a 1000-token system prompt costs $0.30 to cache (assuming $0.003 per input token). Each cached request costs $0.03 instead of $0.30. Breakeven is ~4 requests; after that, pure savings. For 100 requests per hour with the same system prompt, you save ~$28.50/hour.

Rate limits

Prompt caching does not change rate limits. You're still rate-limited by requests-per-minute and tokens-per-minute. However, cached input tokens count toward TPM quotas, so caching reduces TPM pressure (10% of input counts vs 100%), potentially allowing higher throughput.

Common gotcha

The system prompt must be byte-for-byte identical across requests to hit cache. A trailing space, different quote style, or even reordered list items breaks the hash and causes a cache miss. Use a constant string, not f-strings or string concatenation. Many developers debug phantom cache misses caused by whitespace differences.

Error recovery

InvalidRequestError: 'cache_control' not supported for this model

Prompt caching requires claude-opus-4-6 or claude-sonnet-4-6. Older model IDs (claude-3-opus, claude-3-sonnet) don't support it. Update your model string.

RateLimitError after enabling caching

Unlikely, but if you see rate limits spike, verify you're not creating duplicate cache blocks. Each <code>cache_control</code> block is cached separately: don't accidentally add it to both system and first message.

Cache seems not to work (cache_read_input_tokens always 0)

Check timestamp: cache expires after 5 minutes of last request. Also verify the system prompt string is absolutely identical (no trailing whitespace, same quotes). Use repr(system_prompt) to debug whitespace.

Experienced dev note

Prompt caching is a silent efficiency multiplier for production systems. A chatbot with a 500-token system prompt serving 1000 users/day with repeated instructions saves ~$40/day with zero latency penalty. But the real win is architectural: it makes sense to move heavy context (company docs, coding standards, style guides) into the system prompt instead of duplicating it in user messages. This reduces per-request token bloat and keeps caches hot longer. Also, monitor cache_read_input_tokens in production logs: a sudden drop to zero often signals a code push that changed your system prompt, costing money until you notice.

Check your understanding

If you have a system prompt cached at 10:00 AM and make identical requests at 10:04 AM (cache hit) and 10:06 AM, what will cache_read_input_tokens show for the 10:06 request, and why?

Show answer hint

The 5-minute cache window starts from the first request, not from each cache hit. Resets on expiration, not on every successful read.

VERSION Prompt caching was introduced in anthropic SDK 0.28.0 (late 2024) and is stable as of 0.94.x (April 2026). The ephemeral cache type is standard; no migration needed if upgrading from earlier versions that didn't support caching.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.