How to verify cache hits: usage.cache_read_input_tokens
Why this matters
Prompt caching can reduce your API costs by up to 90% for repeated requests, but you won't know if it's working unless you inspect the usage metrics. Without checking <code>cache_read_input_tokens</code>, you might pay for cache misses thinking you're getting cache hits, wasting your budget on duplicate token processing.
Explanation
What it does: The cache_read_input_tokens field in the usage object of the API response tells you exactly how many input tokens were served from cache on that request. A non-zero value means the cache worked; zero means this was a cache miss.
How it works: When you send a request with a prompt that matches a previous request (same system prompt + message history up to a certain point), Claude's servers check the cache. If found, those tokens are marked as "read from cache" instead of "processed from input." The response includes both input_tokens (newly processed) and cache_read_input_tokens (reused). Cost-wise, cached tokens are billed at 10% of the input token rate, so catching cache hits directly impacts your bill.
When to use it: Log cache_read_input_tokens whenever cost matters: in production systems, long-running chatbots, or batch processing. If you see cache_read_input_tokens == 0 on a request that should have hit cache, investigate whether your prompts are identical, whether cache has expired (15-minute TTL), or whether you've exceeded the 120K cache token limit.
Request code
import os
from anthropic import Anthropic
client = Anthropic(api_key=os.environ.get('ANTHROPIC_API_KEY'))
system_prompt = "You are a code review expert. Always be concise."
user_message = "Review this Python function for bugs: def add(a, b): return a + b"
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=256,
system=system_prompt,
messages=[
{"role": "user", "content": user_message}
]
)
print(f"Input tokens: {response.usage.input_tokens}")
print(f"Cache read tokens: {response.usage.cache_read_input_tokens}")
print(f"Cache creation tokens: {response.usage.cache_creation_input_tokens}")
print(f"Output tokens: {response.usage.output_tokens}")
print(f"\nResponse: {response.content[0].text}")
response2 = client.messages.create(
model="claude-opus-4-6",
max_tokens=256,
system=system_prompt,
messages=[
{"role": "user", "content": user_message}
]
)
print(f"\n--- Second request (same prompt) ---")
print(f"Input tokens: {response2.usage.input_tokens}")
print(f"Cache read tokens: {response2.usage.cache_read_input_tokens}")
print(f"Output tokens: {response2.usage.output_tokens}") Authentication
Set the ANTHROPIC_API_KEY environment variable before running this code. Get your key from https://console.anthropic.com/account/keys. The SDK reads this at client instantiation time.
Response shape
| Field | Description |
|---|---|
id | msg_xxxxxxxxxx |
type | message |
role | assistant |
content | [object Object] |
model | claude-opus-4-6 |
stop_reason | end_turn |
stop_sequence | |
usage | [object Object] |
Field guide
input_tokens Tokens newly processed in this request (not from cache)
cache_read_input_tokens Tokens served from cache: this is what you check to verify a cache hit. Non-zero = cache worked
cache_creation_input_tokens Tokens used to populate the cache on this request (only non-zero on first request with this prompt)
output_tokens Tokens generated in the response, always charged at full rate
Cost
Cached tokens cost 10% of input token rate (vs. full input rate for newly processed tokens). On a request reading 100K tokens from cache, you pay for ~10K token equivalents instead of 100K. Over 1000 repeated requests, this saves ~90% on that portion of your bill. Monitor <code>cache_read_input_tokens</code> in production: if it's always 0, your cache strategy isn't working and you're leaving money on the table.
Rate limits
Cache population and reading both count toward the same token-per-minute rate limit. A first request that creates cache (high <code>cache_creation_input_tokens</code>) uses quota the same as a normal request. Subsequent cache hits consume quota based on <code>cache_read_input_tokens</code> only, so they count as much lighter requests toward your rate limit.
Common gotcha
Many developers assume a cache hit happened if input_tokens is low. But input_tokens includes newly processed tokens on every request. You must explicitly check cache_read_input_tokens > 0 to confirm cache reuse. On the first request with a prompt, expect cache_creation_input_tokens to be high and cache_read_input_tokens to be 0. Only on subsequent identical requests should cache_read_input_tokens spike.
Error recovery
InvalidRequestError: 'cache_read_input_tokens' not in usagecache_read_input_tokens is always 0 even for identical requestsExperienced dev note
The real power move: log cache_read_input_tokens / (input_tokens + cache_read_input_tokens) as your cache hit rate ratio in production. Track this metric over time: if it's below 50% despite repetitive requests, your prompt structure is changing when it shouldn't be, or you need to redesign how you build the system prompt. Also: cache tokens reset the 15-minute TTL on every request, so frequent access keeps cache alive. Cold customers with infrequent requests will never see cache savings: design accordingly.
Check your understanding
You send the same code review request twice. The first response shows cache_creation_input_tokens: 150 and cache_read_input_tokens: 0. The second response shows input_tokens: 5, cache_read_input_tokens: 145, and output_tokens: 42 instead of the original 45. Why did output tokens decrease, and what does that tell you about how caching actually works?
Show answer hint
Caching doesn't prevent token recomputation: it prevents re-reading from the input stream. Output tokens can vary due to sampling/temperature behavior and model state, but that's independent of cache hits. The key insight: cache hit confirmation (non-zero <code>cache_read_input_tokens</code>) doesn't guarantee identical outputs or token counts: it only guarantees you weren't charged full price for those input tokens.