API Intermediate medium · 5 min

How to verify cache hits: usage.cache_read_input_tokens

What you will learn

Use the <code>cache_read_input_tokens</code> field in the API response to confirm that your cached prompt was actually reused, avoiding repeat charges for identical requests.

Why this matters

Prompt caching can reduce your API costs by up to 90% for repeated requests, but you won't know if it's working unless you inspect the usage metrics. Without checking <code>cache_read_input_tokens</code>, you might pay for cache misses thinking you're getting cache hits, wasting your budget on duplicate token processing.

Skip if: If you're making one-off requests with no repetition, or if your prompts change on every request, prompt caching won't help. In those cases, focus on optimizing token count through better prompt engineering instead. Also skip this if your cache window is shorter than your request interval: tokens will age out of cache before reuse.

Explanation

What it does: The cache_read_input_tokens field in the usage object of the API response tells you exactly how many input tokens were served from cache on that request. A non-zero value means the cache worked; zero means this was a cache miss.

How it works: When you send a request with a prompt that matches a previous request (same system prompt + message history up to a certain point), Claude's servers check the cache. If found, those tokens are marked as "read from cache" instead of "processed from input." The response includes both input_tokens (newly processed) and cache_read_input_tokens (reused). Cost-wise, cached tokens are billed at 10% of the input token rate, so catching cache hits directly impacts your bill.

When to use it: Log cache_read_input_tokens whenever cost matters: in production systems, long-running chatbots, or batch processing. If you see cache_read_input_tokens == 0 on a request that should have hit cache, investigate whether your prompts are identical, whether cache has expired (15-minute TTL), or whether you've exceeded the 120K cache token limit.

Request code

python

import os
from anthropic import Anthropic

client = Anthropic(api_key=os.environ.get('ANTHROPIC_API_KEY'))

system_prompt = "You are a code review expert. Always be concise."
user_message = "Review this Python function for bugs: def add(a, b): return a + b"

response = client.messages.create(
    model="claude-opus-4-6",
    max_tokens=256,
    system=system_prompt,
    messages=[
        {"role": "user", "content": user_message}
    ]
)

print(f"Input tokens: {response.usage.input_tokens}")
print(f"Cache read tokens: {response.usage.cache_read_input_tokens}")
print(f"Cache creation tokens: {response.usage.cache_creation_input_tokens}")
print(f"Output tokens: {response.usage.output_tokens}")
print(f"\nResponse: {response.content[0].text}")

response2 = client.messages.create(
    model="claude-opus-4-6",
    max_tokens=256,
    system=system_prompt,
    messages=[
        {"role": "user", "content": user_message}
    ]
)

print(f"\n--- Second request (same prompt) ---")
print(f"Input tokens: {response2.usage.input_tokens}")
print(f"Cache read tokens: {response2.usage.cache_read_input_tokens}")
print(f"Output tokens: {response2.usage.output_tokens}")

Authentication

Set the ANTHROPIC_API_KEY environment variable before running this code. Get your key from https://console.anthropic.com/account/keys. The SDK reads this at client instantiation time.

Response shape

Field	Description
`id`	msg_xxxxxxxxxx
`type`	message
`role`	assistant
`content`	[object Object]
`model`	claude-opus-4-6
`stop_reason`	end_turn
`stop_sequence`
`usage`	[object Object]

Field guide

input_tokens

Tokens newly processed in this request (not from cache)

cache_read_input_tokens

Tokens served from cache: this is what you check to verify a cache hit. Non-zero = cache worked

cache_creation_input_tokens

Tokens used to populate the cache on this request (only non-zero on first request with this prompt)

output_tokens

Tokens generated in the response, always charged at full rate

Cost

Cached tokens cost 10% of input token rate (vs. full input rate for newly processed tokens). On a request reading 100K tokens from cache, you pay for ~10K token equivalents instead of 100K. Over 1000 repeated requests, this saves ~90% on that portion of your bill. Monitor <code>cache_read_input_tokens</code> in production: if it's always 0, your cache strategy isn't working and you're leaving money on the table.

Rate limits

Cache population and reading both count toward the same token-per-minute rate limit. A first request that creates cache (high <code>cache_creation_input_tokens</code>) uses quota the same as a normal request. Subsequent cache hits consume quota based on <code>cache_read_input_tokens</code> only, so they count as much lighter requests toward your rate limit.

Common gotcha

Many developers assume a cache hit happened if input_tokens is low. But input_tokens includes newly processed tokens on every request. You must explicitly check cache_read_input_tokens > 0 to confirm cache reuse. On the first request with a prompt, expect cache_creation_input_tokens to be high and cache_read_input_tokens to be 0. Only on subsequent identical requests should cache_read_input_tokens spike.

Error recovery

InvalidRequestError: 'cache_read_input_tokens' not in usage

This means you're using a model or API version that doesn't support prompt caching (only Claude 3.5 Sonnet and Opus support it). Update to claude-opus-4-6 or claude-sonnet-4-6.

cache_read_input_tokens is always 0 even for identical requests

Check that your system prompt and message history are byte-identical across requests. Small differences in whitespace, formatting, or message order will cause cache misses. Also verify your cache hasn't expired (15-minute TTL) or you haven't exceeded 120K cached tokens.

Experienced dev note

The real power move: log cache_read_input_tokens / (input_tokens + cache_read_input_tokens) as your cache hit rate ratio in production. Track this metric over time: if it's below 50% despite repetitive requests, your prompt structure is changing when it shouldn't be, or you need to redesign how you build the system prompt. Also: cache tokens reset the 15-minute TTL on every request, so frequent access keeps cache alive. Cold customers with infrequent requests will never see cache savings: design accordingly.

Check your understanding

You send the same code review request twice. The first response shows cache_creation_input_tokens: 150 and cache_read_input_tokens: 0. The second response shows input_tokens: 5, cache_read_input_tokens: 145, and output_tokens: 42 instead of the original 45. Why did output tokens decrease, and what does that tell you about how caching actually works?

Show answer hint

Caching doesn't prevent token recomputation: it prevents re-reading from the input stream. Output tokens can vary due to sampling/temperature behavior and model state, but that's independent of cache hits. The key insight: cache hit confirmation (non-zero <code>cache_read_input_tokens</code>) doesn't guarantee identical outputs or token counts: it only guarantees you weren't charged full price for those input tokens.

VERSION Prompt caching is available in anthropic 0.94.x+ and requires Claude 3.5 Sonnet (claude-sonnet-4-6) or Claude Opus 4.6 (claude-opus-4-6). Earlier model versions and older SDK versions do not populate the cache-related fields. Always pin your model version explicitly.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.