API Advanced hard · 8 min

Cache savings calculation

What you will learn

Calculate actual token savings and cost reduction from prompt caching by measuring cache hit rates and comparing cached vs. uncached request costs.

Why this matters

Prompt caching can reduce input token costs by 90%, but only if you understand the cache economics: knowing when you break even, how many requests hit the cache, and what your real savings are prevents misconfiguring expensive systems or missing optimization opportunities.

Skip if: Don't use cache savings calculations when: your prompts are highly dynamic (cache misses dominate), you only make single isolated requests (no reuse patterns), or your workload is latency-critical and cache writes add unacceptable overhead. Cache is an optimization for repeated patterns, not a universal win.

Explanation

What prompt caching does: OpenAI's prompt caching stores the hash of your system prompt, context, or document sections in their infrastructure. Subsequent requests with identical prefixes reuse that cached computation, charging 90% fewer input tokens for cached content and slightly more for the cache-write request itself. How it works: On your first request, you pay full input token cost plus a small cache-write overhead. The API returns cache_creation_input_tokens. On repeat requests with the same prefix, you pay only 10% of the normal input token rate for cached tokens, while the API returns cache_read_input_tokens. The economics depend entirely on your hit rate: if you never repeat the same prefix, you only pay extra. If you cache a 10K-token document and request it 100 times, you save roughly 900K tokens worth of cost. When to use it: Caching is essential for: (1) systems that reuse large documents or context blocks across many requests, (2) multi-turn conversations with fixed system prompts, (3) batch operations on the same knowledge base, (4) RAG systems serving the same documents repeatedly. Avoid it for one-off requests or highly variable inputs.

Request code

python

import os
from openai import OpenAI

client = OpenAI(api_key=os.environ.get('OPENAI_API_KEY'))

# First request: cache write
response_1 = client.chat.completions.create(
    model='gpt-4o',
    messages=[
        {
            'role': 'user',
            'content': [
                {
                    'type': 'text',
                    'text': 'You are a legal document analyzer. Respond concisely.',
                    'cache_control': {'type': 'ephemeral'}
                },
                {
                    'type': 'text',
                    'text': 'DOCUMENT: ' + ('X' * 10000),  # Large cached context
                    'cache_control': {'type': 'ephemeral'}
                },
                {
                    'type': 'text',
                    'text': 'Question: What is the main subject?'
                }
            ]
        }
    ]
)

first_cost = (
    response_1.usage.prompt_tokens * 0.003 / 1000 +
    response_1.usage.completion_tokens * 0.06 / 1000
)

cache_creation = response_1.usage.cache_creation_input_tokens or 0
first_input = response_1.usage.prompt_tokens

print(f'First request (cache write):')
print(f'  Input tokens: {first_input}')
print(f'  Cache creation: {cache_creation}')
print(f'  Cost: ${first_cost:.4f}')
print()

# Second request: cache read (identical prefix)
response_2 = client.chat.completions.create(
    model='gpt-4o',
    messages=[
        {
            'role': 'user',
            'content': [
                {
                    'type': 'text',
                    'text': 'You are a legal document analyzer. Respond concisely.',
                    'cache_control': {'type': 'ephemeral'}
                },
                {
                    'type': 'text',
                    'text': 'DOCUMENT: ' + ('X' * 10000),
                    'cache_control': {'type': 'ephemeral'}
                },
                {
                    'type': 'text',
                    'text': 'Question: Who are the parties involved?'
                }
            ]
        }
    ]
)

cache_read = response_2.usage.cache_read_input_tokens or 0
second_input = response_2.usage.prompt_tokens

# Cache read tokens cost 10% of normal input rate
cached_token_cost = cache_read * 0.003 / 1000 * 0.1
uncached_token_cost = (second_input - cache_read) * 0.003 / 1000
second_cost = cached_token_cost + uncached_token_cost + response_2.usage.completion_tokens * 0.06 / 1000

print(f'Second request (cache hit):')
print(f'  Cache read: {cache_read}')
print(f'  New input tokens: {second_input - cache_read}')
print(f'  Cost: ${second_cost:.4f}')
print()

# Savings calculation
without_cache_cost = (first_input + second_input) * 0.003 / 1000 + (response_1.usage.completion_tokens + response_2.usage.completion_tokens) * 0.06 / 1000
with_cache_cost = first_cost + second_cost

savings = without_cache_cost - with_cache_cost
savings_percent = (savings / without_cache_cost) * 100

print(f'Total cost comparison (2 requests):')
print(f'  Without cache: ${without_cache_cost:.4f}')
print(f'  With cache: ${with_cache_cost:.4f}')
print(f'  Savings: ${savings:.4f} ({savings_percent:.1f}%)')
print()

# Break-even analysis: how many requests to recover cache overhead?
overhead = first_cost - (first_input * 0.003 / 1000 + response_1.usage.completion_tokens * 0.06 / 1000)
savings_per_hit = (cache_read * 0.003 / 1000) - (cache_read * 0.003 / 1000 * 0.1)
if savings_per_hit > 0:
    breakeven_requests = int(overhead / savings_per_hit) + 1
    print(f'Break-even analysis:')
    print(f'  Cache write overhead: ${overhead:.4f}')
    print(f'  Savings per cache hit: ${savings_per_hit:.4f}')
    print(f'  Break-even at request: #{breakeven_requests}')
else:
    print(f'No savings per hit (cache read tokens = {cache_read}, check cache control setup)')

Authentication

Prompt caching requires OpenAI API key with access to GPT-4o models (or later). Set your API key via environment variable: export OPENAI_API_KEY='sk-...' or pass it explicitly to the client. No additional permissions are needed: caching is enabled automatically for all eligible models.

Response shape

Field	Description
`usage.prompt_tokens`	Total input tokens sent (includes cached + non-cached)
`usage.cache_creation_input_tokens`	Tokens written to cache on first request (only present if cache_control is set)
`usage.cache_read_input_tokens`	Tokens read from cache on subsequent requests (charged at 10% rate)
`usage.completion_tokens`	Output tokens (full rate, no caching)
`choices[0].message.content`	The actual response text

Field guide

cache_creation_input_tokens

On the first request with cache_control set, this shows how many tokens entered the cache. This is your baseline investment for repeated queries.

cache_read_input_tokens

The hidden field that unlocks savings: it tells you exactly how many tokens hit the cache. Multiply this by 0.003 / 1000 * 0.1 to get the cost of cached tokens. If this is 0 on a repeat request, your cache prefix didn't match (hash mismatch).

prompt_tokens

This is the sum of cache_read_input_tokens + new uncached tokens. Don't confuse this with cache_read_input_tokens: prompt_tokens is the full count for billing calculation purposes, but only the non-cached portion matters for your true cost.

cache_creation_input_tokens presence

If this field is absent or 0, cache_control wasn't applied or the model doesn't support it. Check your model version (must be gpt-4o-2024-11-20 or later for reliably full caching).

Setup trap

When using ephemeral caches, the cache persists for 5 minutes. If your test sends requests more than 5 minutes apart, the cache expires and you'll see cache_read_input_tokens = 0 again, making you think caching is broken. For testing, send requests within 60 seconds. In production, use 'static' cache_control for indefinite caching (requires 24+ hour TTL context).

Cost

Real numbers (as of April 2026, GPT-4o pricing): input tokens cost $0.003 per 1K tokens. Cache-read tokens cost 90% less: $0.0003 per 1K tokens. If you cache a 10,000-token document and query it 100 times: without cache = (100 × 10,000 × $0.003/1K) = $3.00. With cache = (1 × 10,000 × $0.003/1K overhead) + (99 × 10,000 × $0.0003/1K) = $0.03 + $0.30 = $0.33. Savings: $2.67 per 100 queries. The math only works if your cache actually hits: verify cache_read_input_tokens > 0 before counting on savings.

Rate limits

Prompt caching doesn't change rate limits, but it reduces token consumption significantly. If you're hitting token-per-minute limits, caching is a multiplier for request capacity. 100 cached requests = ~10 uncached requests in token terms.

Common gotcha

The most common mistake: developers assume that if cache_read_input_tokens is 0 on the second request, caching failed silently. In reality, it means the prefix hash didn't match: usually because they changed whitespace, punctuation, or even invisible characters in the cached content. Identical byte-for-byte content is required. Even a trailing space difference breaks the cache.

Error recovery

InvalidRequestError: 'cache_control' not supported

You're using a model that doesn't support caching (pre-Nov 2024 GPT-4o). Upgrade to gpt-4o-2024-11-20 or later.

cache_read_input_tokens is always 0

Your cached prefix is not matching byte-for-byte. Check for whitespace, newlines, or encoding differences. Use repr(text) to inspect exact characters.

RateLimitError after enabling cache

You're making more requests than your rate limit allows, but each request now uses fewer tokens. This is actually success: you can now make more requests per minute. If truly hitting RPS limits (not TPM), contact OpenAI.

CacheLimitError: exceeds cache size

You tried to cache more than 128K tokens in a single message. Split the context into multiple smaller cached messages or use static cache for very large documents.

Experienced dev note

Cache savings calculations have a hidden complexity: cache overhead. Your first request with a 10K cached prompt doesn't just cost normal tokens: it pays slightly more for the cache write operation. The ROI breakeven point (when cumulative savings exceed overhead) is typically 3–5 requests for small caches, 20+ for huge documents. Calculate this before deploying caching to avoid 'why is my first request slower?' issues in production. Also, cache is regional to OpenAI's infrastructure; don't assume cache persists across API retries if you're load-balancing across endpoints.

Check your understanding

You have a RAG system that caches a 5000-token document, which gets queried 1000 times per month with different questions appended. The first request costs $0.015 with cache overhead. Subsequent requests with cache hits cost $0.0015. What is your break-even point (minimum requests needed to justify enabling cache), and what is your monthly savings at 1000 queries?

Show answer hint

Break-even: divide cache overhead by per-hit savings. Overhead is (first_request_cost_with_cache - first_request_cost_without_cache). Per-hit savings is (normal_cost_for_5K_tokens × 0.9). Monthly savings = (999 × per_hit_savings) - overhead.

VERSION Prompt caching was introduced in OpenAI SDK 1.14.0 (Nov 2024). Cache control for system prompts and documents requires gpt-4o-2024-11-20 or later. If you're on an older version, upgrade: pip install --upgrade openai.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.