API Advanced medium · 6 min

Cache hit indicators in usage

What you will learn

Detect and measure cache hits on prompt and completion tokens in OpenAI API responses to optimize costs and latency.

Why this matters

Cache hits reduce token costs by 90% on cached input and give you measurable insight into whether your caching strategy is working. Without tracking usage metrics, you can't tell if expensive system prompts or long context windows are actually being cached.

Skip if: If you're making one-off API calls or prototyping, cache tracking adds no value. Cache hits only matter when you're repeating the same input (system prompts, context, prefixes) across multiple requests within the cache window (5 minutes).

Explanation

The OpenAI API returns three token usage fields in every chat completion response: input_tokens, cache_creation_input_tokens, and cache_read_input_tokens. These tell you exactly how many tokens were written to cache (on first request) vs. read from cache (on subsequent requests). When tokens are read from cache, they cost 90% less than standard input tokens: typically $0.30 per million instead of $3 per million for GPT-4o.

Under the hood, the API hashes the combined system message and messages array. If that hash matches a previous request within the 5-minute TTL, those tokens are served from cache instead of reprocessed. The response shows the breakdown so you can calculate actual cost and prove ROI on caching infrastructure.

Use this pattern when you have stable input components (long system prompts, company context, RAG documents) that repeat across requests. Track cache_read_input_tokens to measure your cache hit rate and validate that your strategy is working before scaling to thousands of requests.

Request code

python

from openai import OpenAI
import os

client = OpenAI(api_key=os.environ.get('OPENAI_API_KEY'))

# First request: creates cache entry
response_1 = client.chat.completions.create(
    model='gpt-4o',
    messages=[
        {
            'role': 'system',
            'content': 'You are a financial analyst. Analyze quarterly earnings reports and identify risks.'
        },
        {
            'role': 'user',
            'content': 'What are the key risks in Apple Q3 2024?'
        }
    ]
)

print('First request (creates cache):')
print(f'  Input tokens: {response_1.usage.input_tokens}')
print(f'  Cache creation tokens: {response_1.usage.cache_creation_input_tokens}')
print(f'  Cache read tokens: {response_1.usage.cache_read_input_tokens}')
print(f'  Output tokens: {response_1.usage.output_tokens}')

# Second request: same system prompt and prefix, hits cache
response_2 = client.chat.completions.create(
    model='gpt-4o',
    messages=[
        {
            'role': 'system',
            'content': 'You are a financial analyst. Analyze quarterly earnings reports and identify risks.'
        },
        {
            'role': 'user',
            'content': 'What are the key risks in Microsoft Q3 2024?'
        }
    ]
)

print('\nSecond request (cache hit):')
print(f'  Input tokens: {response_2.usage.input_tokens}')
print(f'  Cache creation tokens: {response_2.usage.cache_creation_input_tokens}')
print(f'  Cache read tokens: {response_2.usage.cache_read_input_tokens}')
print(f'  Output tokens: {response_2.usage.output_tokens}')

# Calculate actual cost
cache_read_cost = response_2.usage.cache_read_input_tokens * (0.30 / 1_000_000)
standard_cost = response_2.usage.input_tokens * (3.00 / 1_000_000)
savings = standard_cost - cache_read_cost
print(f'\nSecond request savings: ${savings:.6f} (cached {response_2.usage.cache_read_input_tokens} tokens at 90% discount)')

Authentication

Set your OpenAI API key as an environment variable before running: export OPENAI_API_KEY='sk-...'. The OpenAI SDK reads this automatically at client instantiation.

Response shape

Field	Description
`usage`	[object Object]
`choices`	[object Object]

Field guide

cache_creation_input_tokens

Non-zero only on first request with a given input. Represents tokens cached for future reuse. You pay standard rate for these.

cache_read_input_tokens

Tokens reused from a previous request. These cost 90% less. This is the hidden ROI field: high values mean your caching is working.

input_tokens

Regular input tokens that were NOT cached. This might be user-specific content that differs per request.

cache_hit_rate

Not returned directly: calculate as (cache_read_input_tokens / (input_tokens + cache_read_input_tokens)) × 100 to measure caching efficiency.

Setup trap

Cache only works if the exact same prefix matches. Changing even one whitespace character in your system prompt will create a new cache entry. Many developers add timestamps or random IDs to messages thinking it's fine: it breaks caching silently. Use structured caching: keep system/context static, vary only the user message at the end.

Cost

Cached tokens cost $0.30 per million for GPT-4o vs. $3.00 per million standard. A 2,000-token system prompt cached across 1,000 requests saves $5.40. For longer contexts (10,000 tokens cached 100× per day), savings exceed $150/month.

Rate limits

Cache reads do not consume rate limit tokens faster than standard requests. However, cache writes (cache_creation_input_tokens) are processed immediately, so monitor for bursts when introducing a new cached context.

Common gotcha

The API counts input_tokens + cache_read_input_tokens separately. Many developers add them together and think they're double-counting. They're not: input_tokens is non-cached content (like the changing user query), and cache_read_input_tokens is the cached portion (like the system prompt). Only the sum represents total input size.

Error recovery

AuthenticationError

API key missing or invalid. Verify OPENAI_API_KEY is set and non-empty before instantiating OpenAI().

RateLimitError

You've exceeded token limits. Cache helps, but still respect rate limits. Check <code>response.usage</code> after each request to track cumulative consumption.

InvalidRequestError - cache not supported

Your model doesn't support prompt caching. Ensure you're using gpt-4o, gpt-4-turbo, or gpt-3.5-turbo. Older model names like gpt-4 without version suffix don't support caching.

Experienced dev note

Cache metrics are your early warning system for cost overruns. If cache_read_input_tokens stays at 0 across hundreds of requests, your caching strategy is broken: usually because system prompts are templated with unique IDs or timestamps. Before scaling, validate cache hit rates in staging. A 50%+ cache hit rate on a large system prompt pays for your infrastructure. Also: cache works across API clients in the same account within the 5-minute window, so multiple services can share cached contexts: this is a hidden multi-tenant cost optimization.

Check your understanding

You're building a RAG system where each request includes a 3,000-token knowledge base as context, followed by a unique user question. In the first request, cache_creation_input_tokens is 3,000 and input_tokens is 3,200 (3,000 KB + 200 question). In the second request, what should cache_read_input_tokens be, and why?

Show answer hint

The knowledge base is cached (3,000 tokens), but the user question is new content, so it's not served from cache. Think about what portion of the total input is reused vs. unique per request.

VERSION Prompt caching was introduced in OpenAI API May 2024. Ensure you're on openai>=1.12.0. Earlier versions return None for cache fields even if the API supports them.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.