Cache savings calculation
Why this matters
Prompt caching can reduce input token costs by 90%, but only if you understand the cache economics: knowing when you break even, how many requests hit the cache, and what your real savings are prevents misconfiguring expensive systems or missing optimization opportunities.
Explanation
What prompt caching does: OpenAI's prompt caching stores the hash of your system prompt, context, or document sections in their infrastructure. Subsequent requests with identical prefixes reuse that cached computation, charging 90% fewer input tokens for cached content and slightly more for the cache-write request itself. How it works: On your first request, you pay full input token cost plus a small cache-write overhead. The API returns cache_creation_input_tokens. On repeat requests with the same prefix, you pay only 10% of the normal input token rate for cached tokens, while the API returns cache_read_input_tokens. The economics depend entirely on your hit rate: if you never repeat the same prefix, you only pay extra. If you cache a 10K-token document and request it 100 times, you save roughly 900K tokens worth of cost. When to use it: Caching is essential for: (1) systems that reuse large documents or context blocks across many requests, (2) multi-turn conversations with fixed system prompts, (3) batch operations on the same knowledge base, (4) RAG systems serving the same documents repeatedly. Avoid it for one-off requests or highly variable inputs.
Request code
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ.get('OPENAI_API_KEY'))
# First request: cache write
response_1 = client.chat.completions.create(
model='gpt-4o',
messages=[
{
'role': 'user',
'content': [
{
'type': 'text',
'text': 'You are a legal document analyzer. Respond concisely.',
'cache_control': {'type': 'ephemeral'}
},
{
'type': 'text',
'text': 'DOCUMENT: ' + ('X' * 10000), # Large cached context
'cache_control': {'type': 'ephemeral'}
},
{
'type': 'text',
'text': 'Question: What is the main subject?'
}
]
}
]
)
first_cost = (
response_1.usage.prompt_tokens * 0.003 / 1000 +
response_1.usage.completion_tokens * 0.06 / 1000
)
cache_creation = response_1.usage.cache_creation_input_tokens or 0
first_input = response_1.usage.prompt_tokens
print(f'First request (cache write):')
print(f' Input tokens: {first_input}')
print(f' Cache creation: {cache_creation}')
print(f' Cost: ${first_cost:.4f}')
print()
# Second request: cache read (identical prefix)
response_2 = client.chat.completions.create(
model='gpt-4o',
messages=[
{
'role': 'user',
'content': [
{
'type': 'text',
'text': 'You are a legal document analyzer. Respond concisely.',
'cache_control': {'type': 'ephemeral'}
},
{
'type': 'text',
'text': 'DOCUMENT: ' + ('X' * 10000),
'cache_control': {'type': 'ephemeral'}
},
{
'type': 'text',
'text': 'Question: Who are the parties involved?'
}
]
}
]
)
cache_read = response_2.usage.cache_read_input_tokens or 0
second_input = response_2.usage.prompt_tokens
# Cache read tokens cost 10% of normal input rate
cached_token_cost = cache_read * 0.003 / 1000 * 0.1
uncached_token_cost = (second_input - cache_read) * 0.003 / 1000
second_cost = cached_token_cost + uncached_token_cost + response_2.usage.completion_tokens * 0.06 / 1000
print(f'Second request (cache hit):')
print(f' Cache read: {cache_read}')
print(f' New input tokens: {second_input - cache_read}')
print(f' Cost: ${second_cost:.4f}')
print()
# Savings calculation
without_cache_cost = (first_input + second_input) * 0.003 / 1000 + (response_1.usage.completion_tokens + response_2.usage.completion_tokens) * 0.06 / 1000
with_cache_cost = first_cost + second_cost
savings = without_cache_cost - with_cache_cost
savings_percent = (savings / without_cache_cost) * 100
print(f'Total cost comparison (2 requests):')
print(f' Without cache: ${without_cache_cost:.4f}')
print(f' With cache: ${with_cache_cost:.4f}')
print(f' Savings: ${savings:.4f} ({savings_percent:.1f}%)')
print()
# Break-even analysis: how many requests to recover cache overhead?
overhead = first_cost - (first_input * 0.003 / 1000 + response_1.usage.completion_tokens * 0.06 / 1000)
savings_per_hit = (cache_read * 0.003 / 1000) - (cache_read * 0.003 / 1000 * 0.1)
if savings_per_hit > 0:
breakeven_requests = int(overhead / savings_per_hit) + 1
print(f'Break-even analysis:')
print(f' Cache write overhead: ${overhead:.4f}')
print(f' Savings per cache hit: ${savings_per_hit:.4f}')
print(f' Break-even at request: #{breakeven_requests}')
else:
print(f'No savings per hit (cache read tokens = {cache_read}, check cache control setup)') Authentication
Prompt caching requires OpenAI API key with access to GPT-4o models (or later). Set your API key via environment variable: export OPENAI_API_KEY='sk-...' or pass it explicitly to the client. No additional permissions are needed: caching is enabled automatically for all eligible models.
Response shape
| Field | Description |
|---|---|
usage.prompt_tokens | Total input tokens sent (includes cached + non-cached) |
usage.cache_creation_input_tokens | Tokens written to cache on first request (only present if cache_control is set) |
usage.cache_read_input_tokens | Tokens read from cache on subsequent requests (charged at 10% rate) |
usage.completion_tokens | Output tokens (full rate, no caching) |
choices[0].message.content | The actual response text |
Field guide
cache_creation_input_tokens On the first request with cache_control set, this shows how many tokens entered the cache. This is your baseline investment for repeated queries.
cache_read_input_tokens The hidden field that unlocks savings: it tells you exactly how many tokens hit the cache. Multiply this by 0.003 / 1000 * 0.1 to get the cost of cached tokens. If this is 0 on a repeat request, your cache prefix didn't match (hash mismatch).
prompt_tokens This is the sum of cache_read_input_tokens + new uncached tokens. Don't confuse this with cache_read_input_tokens: prompt_tokens is the full count for billing calculation purposes, but only the non-cached portion matters for your true cost.
cache_creation_input_tokens presence If this field is absent or 0, cache_control wasn't applied or the model doesn't support it. Check your model version (must be gpt-4o-2024-11-20 or later for reliably full caching).
Setup trap
When using ephemeral caches, the cache persists for 5 minutes. If your test sends requests more than 5 minutes apart, the cache expires and you'll see cache_read_input_tokens = 0 again, making you think caching is broken. For testing, send requests within 60 seconds. In production, use 'static' cache_control for indefinite caching (requires 24+ hour TTL context).
Cost
Real numbers (as of April 2026, GPT-4o pricing): input tokens cost $0.003 per 1K tokens. Cache-read tokens cost 90% less: $0.0003 per 1K tokens. If you cache a 10,000-token document and query it 100 times: without cache = (100 × 10,000 × $0.003/1K) = $3.00. With cache = (1 × 10,000 × $0.003/1K overhead) + (99 × 10,000 × $0.0003/1K) = $0.03 + $0.30 = $0.33. Savings: $2.67 per 100 queries. The math only works if your cache actually hits: verify cache_read_input_tokens > 0 before counting on savings.
Rate limits
Prompt caching doesn't change rate limits, but it reduces token consumption significantly. If you're hitting token-per-minute limits, caching is a multiplier for request capacity. 100 cached requests = ~10 uncached requests in token terms.
Common gotcha
The most common mistake: developers assume that if cache_read_input_tokens is 0 on the second request, caching failed silently. In reality, it means the prefix hash didn't match: usually because they changed whitespace, punctuation, or even invisible characters in the cached content. Identical byte-for-byte content is required. Even a trailing space difference breaks the cache.
Error recovery
InvalidRequestError: 'cache_control' not supportedcache_read_input_tokens is always 0RateLimitError after enabling cacheCacheLimitError: exceeds cache sizeExperienced dev note
Cache savings calculations have a hidden complexity: cache overhead. Your first request with a 10K cached prompt doesn't just cost normal tokens: it pays slightly more for the cache write operation. The ROI breakeven point (when cumulative savings exceed overhead) is typically 3–5 requests for small caches, 20+ for huge documents. Calculate this before deploying caching to avoid 'why is my first request slower?' issues in production. Also, cache is regional to OpenAI's infrastructure; don't assume cache persists across API retries if you're load-balancing across endpoints.
Check your understanding
You have a RAG system that caches a 5000-token document, which gets queried 1000 times per month with different questions appended. The first request costs $0.015 with cache overhead. Subsequent requests with cache hits cost $0.0015. What is your break-even point (minimum requests needed to justify enabling cache), and what is your monthly savings at 1000 queries?
Show answer hint
Break-even: divide cache overhead by per-hit savings. Overhead is (first_request_cost_with_cache - first_request_cost_without_cache). Per-hit savings is (normal_cost_for_5K_tokens × 0.9). Monthly savings = (999 × per_hit_savings) - overhead.