How to use caching to reduce LLM costs
Quick answer
Use
caching to store and reuse LLM responses for repeated or similar prompts, reducing redundant API calls and lowering costs. Implement a cache layer keyed by prompt hashes or embeddings to quickly retrieve previous outputs without invoking the model again.PREREQUISITES
Python 3.8+OpenAI API key (free tier works)pip install openai>=1.0
Setup
Install the openai Python package and set your API key as an environment variable to authenticate requests.
pip install openai output
Collecting openai Downloading openai-1.x.x-py3-none-any.whl Installing collected packages: openai Successfully installed openai-1.x.x
Step by step
This example demonstrates a simple in-memory cache using a Python dictionary keyed by prompt text. It checks the cache before calling the LLM API, returning cached results to save costs.
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
cache = {}
def get_completion(prompt: str) -> str:
if prompt in cache:
print("Cache hit")
return cache[prompt]
print("Cache miss - calling LLM")
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}]
)
text = response.choices[0].message.content
cache[prompt] = text
return text
if __name__ == "__main__":
prompt = "Explain caching for LLM cost reduction"
print(get_completion(prompt))
print(get_completion(prompt)) # This call uses cache output
Cache miss - calling LLM Caching is a technique to store and reuse previous LLM responses, reducing API calls and costs. Cache hit Caching is a technique to store and reuse previous LLM responses, reducing API calls and costs.
Common variations
- Use persistent caches like Redis or disk-based key-value stores for long-term caching.
- Cache based on prompt embeddings or hashes to handle semantically similar prompts.
- Implement async caching for non-blocking LLM calls.
- Use different models or SDKs (Anthropic, Mistral) with the same caching logic.
Troubleshooting
- If cached results seem outdated, implement cache invalidation policies (time-to-live or manual refresh).
- Ensure cache keys uniquely represent prompt variations to avoid incorrect reuse.
- Watch for memory leaks in in-memory caches by limiting size or using LRU caches.
Key Takeaways
- Cache LLM responses keyed by prompt to avoid redundant API calls and reduce costs.
- Use persistent or distributed caches for scalable, long-term caching beyond in-memory.
- Implement cache invalidation to keep responses fresh and relevant.
- Consider semantic caching with embeddings for similar prompt reuse.
- Adapt caching strategies to your SDK and model choice for best integration.