Cache hit indicators in usage
Why this matters
Cache hits reduce token costs by 90% on cached input and give you measurable insight into whether your caching strategy is working. Without tracking usage metrics, you can't tell if expensive system prompts or long context windows are actually being cached.
Explanation
The OpenAI API returns three token usage fields in every chat completion response: input_tokens, cache_creation_input_tokens, and cache_read_input_tokens. These tell you exactly how many tokens were written to cache (on first request) vs. read from cache (on subsequent requests). When tokens are read from cache, they cost 90% less than standard input tokens: typically $0.30 per million instead of $3 per million for GPT-4o.
Under the hood, the API hashes the combined system message and messages array. If that hash matches a previous request within the 5-minute TTL, those tokens are served from cache instead of reprocessed. The response shows the breakdown so you can calculate actual cost and prove ROI on caching infrastructure.
Use this pattern when you have stable input components (long system prompts, company context, RAG documents) that repeat across requests. Track cache_read_input_tokens to measure your cache hit rate and validate that your strategy is working before scaling to thousands of requests.
Request code
from openai import OpenAI
import os
client = OpenAI(api_key=os.environ.get('OPENAI_API_KEY'))
# First request: creates cache entry
response_1 = client.chat.completions.create(
model='gpt-4o',
messages=[
{
'role': 'system',
'content': 'You are a financial analyst. Analyze quarterly earnings reports and identify risks.'
},
{
'role': 'user',
'content': 'What are the key risks in Apple Q3 2024?'
}
]
)
print('First request (creates cache):')
print(f' Input tokens: {response_1.usage.input_tokens}')
print(f' Cache creation tokens: {response_1.usage.cache_creation_input_tokens}')
print(f' Cache read tokens: {response_1.usage.cache_read_input_tokens}')
print(f' Output tokens: {response_1.usage.output_tokens}')
# Second request: same system prompt and prefix, hits cache
response_2 = client.chat.completions.create(
model='gpt-4o',
messages=[
{
'role': 'system',
'content': 'You are a financial analyst. Analyze quarterly earnings reports and identify risks.'
},
{
'role': 'user',
'content': 'What are the key risks in Microsoft Q3 2024?'
}
]
)
print('\nSecond request (cache hit):')
print(f' Input tokens: {response_2.usage.input_tokens}')
print(f' Cache creation tokens: {response_2.usage.cache_creation_input_tokens}')
print(f' Cache read tokens: {response_2.usage.cache_read_input_tokens}')
print(f' Output tokens: {response_2.usage.output_tokens}')
# Calculate actual cost
cache_read_cost = response_2.usage.cache_read_input_tokens * (0.30 / 1_000_000)
standard_cost = response_2.usage.input_tokens * (3.00 / 1_000_000)
savings = standard_cost - cache_read_cost
print(f'\nSecond request savings: ${savings:.6f} (cached {response_2.usage.cache_read_input_tokens} tokens at 90% discount)') Authentication
Set your OpenAI API key as an environment variable before running: export OPENAI_API_KEY='sk-...'. The OpenAI SDK reads this automatically at client instantiation.
Response shape
| Field | Description |
|---|---|
usage | [object Object] |
choices | [object Object] |
Field guide
cache_creation_input_tokens Non-zero only on first request with a given input. Represents tokens cached for future reuse. You pay standard rate for these.
cache_read_input_tokens Tokens reused from a previous request. These cost 90% less. This is the hidden ROI field: high values mean your caching is working.
input_tokens Regular input tokens that were NOT cached. This might be user-specific content that differs per request.
cache_hit_rate Not returned directly: calculate as (cache_read_input_tokens / (input_tokens + cache_read_input_tokens)) × 100 to measure caching efficiency.
Setup trap
Cache only works if the exact same prefix matches. Changing even one whitespace character in your system prompt will create a new cache entry. Many developers add timestamps or random IDs to messages thinking it's fine: it breaks caching silently. Use structured caching: keep system/context static, vary only the user message at the end.
Cost
Cached tokens cost $0.30 per million for GPT-4o vs. $3.00 per million standard. A 2,000-token system prompt cached across 1,000 requests saves $5.40. For longer contexts (10,000 tokens cached 100× per day), savings exceed $150/month.
Rate limits
Cache reads do not consume rate limit tokens faster than standard requests. However, cache writes (cache_creation_input_tokens) are processed immediately, so monitor for bursts when introducing a new cached context.
Common gotcha
The API counts input_tokens + cache_read_input_tokens separately. Many developers add them together and think they're double-counting. They're not: input_tokens is non-cached content (like the changing user query), and cache_read_input_tokens is the cached portion (like the system prompt). Only the sum represents total input size.
Error recovery
AuthenticationErrorRateLimitErrorInvalidRequestError - cache not supportedExperienced dev note
Cache metrics are your early warning system for cost overruns. If cache_read_input_tokens stays at 0 across hundreds of requests, your caching strategy is broken: usually because system prompts are templated with unique IDs or timestamps. Before scaling, validate cache hit rates in staging. A 50%+ cache hit rate on a large system prompt pays for your infrastructure. Also: cache works across API clients in the same account within the 5-minute window, so multiple services can share cached contexts: this is a hidden multi-tenant cost optimization.
Check your understanding
You're building a RAG system where each request includes a 3,000-token knowledge base as context, followed by a unique user question. In the first request, cache_creation_input_tokens is 3,000 and input_tokens is 3,200 (3,000 KB + 200 question). In the second request, what should cache_read_input_tokens be, and why?
Show answer hint
The knowledge base is cached (3,000 tokens), but the user question is new content, so it's not served from cache. Think about what portion of the total input is reused vs. unique per request.