How to Beginner to Intermediate · 3 min read

How to use caching to reduce LLM costs

Quick answer
Use caching to store and reuse LLM responses for repeated or similar prompts, reducing redundant API calls and lowering costs. Implement a cache layer keyed by prompt hashes or embeddings to quickly retrieve previous outputs without invoking the model again.

PREREQUISITES

  • Python 3.8+
  • OpenAI API key (free tier works)
  • pip install openai>=1.0

Setup

Install the openai Python package and set your API key as an environment variable to authenticate requests.

bash
pip install openai
output
Collecting openai
  Downloading openai-1.x.x-py3-none-any.whl
Installing collected packages: openai
Successfully installed openai-1.x.x

Step by step

This example demonstrates a simple in-memory cache using a Python dictionary keyed by prompt text. It checks the cache before calling the LLM API, returning cached results to save costs.

python
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

cache = {}

def get_completion(prompt: str) -> str:
    if prompt in cache:
        print("Cache hit")
        return cache[prompt]
    print("Cache miss - calling LLM")
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}]
    )
    text = response.choices[0].message.content
    cache[prompt] = text
    return text

if __name__ == "__main__":
    prompt = "Explain caching for LLM cost reduction"
    print(get_completion(prompt))
    print(get_completion(prompt))  # This call uses cache
output
Cache miss - calling LLM
Caching is a technique to store and reuse previous LLM responses, reducing API calls and costs.
Cache hit
Caching is a technique to store and reuse previous LLM responses, reducing API calls and costs.

Common variations

  • Use persistent caches like Redis or disk-based key-value stores for long-term caching.
  • Cache based on prompt embeddings or hashes to handle semantically similar prompts.
  • Implement async caching for non-blocking LLM calls.
  • Use different models or SDKs (Anthropic, Mistral) with the same caching logic.

Troubleshooting

  • If cached results seem outdated, implement cache invalidation policies (time-to-live or manual refresh).
  • Ensure cache keys uniquely represent prompt variations to avoid incorrect reuse.
  • Watch for memory leaks in in-memory caches by limiting size or using LRU caches.

Key Takeaways

  • Cache LLM responses keyed by prompt to avoid redundant API calls and reduce costs.
  • Use persistent or distributed caches for scalable, long-term caching beyond in-memory.
  • Implement cache invalidation to keep responses fresh and relevant.
  • Consider semantic caching with embeddings for similar prompt reuse.
  • Adapt caching strategies to your SDK and model choice for best integration.
Verified 2026-04 · gpt-4o-mini
Verify ↗