How to Intermediate · 3 min read

Caching strategies for AI products

Quick answer
Use caching in AI products to store and reuse expensive LLM outputs or embedding vectors, reducing latency and API costs. Common strategies include response caching, embedding caching, and hybrid cache with expiration to balance freshness and performance.

PREREQUISITES

  • Python 3.8+
  • OpenAI API key (free tier works)
  • pip install openai>=1.0
  • Basic understanding of caching concepts

Setup

Install the openai Python SDK and set your API key as an environment variable to interact with LLMs. Use a caching backend like redis or an in-memory dictionary for demonstration.

bash
pip install openai redis
output
Collecting openai
Collecting redis
Successfully installed openai redis

Step by step

This example demonstrates caching LLM chat completions using an in-memory dictionary cache to avoid repeated API calls for the same prompt.

python
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Simple in-memory cache
cache = {}

def get_cached_response(prompt: str) -> str:
    if prompt in cache:
        print("Cache hit")
        return cache[prompt]
    print("Cache miss - calling API")
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}]
    )
    text = response.choices[0].message.content
    cache[prompt] = text
    return text

if __name__ == "__main__":
    prompt = "Explain caching strategies for AI products"
    print(get_cached_response(prompt))
    print(get_cached_response(prompt))
output
Cache miss - calling API
Caching strategies for AI products involve storing and reusing expensive LLM outputs or embeddings to reduce latency and cost. Common methods include response caching, embedding caching, and hybrid caches with expiration.
Cache hit
Caching strategies for AI products involve storing and reusing expensive LLM outputs or embeddings to reduce latency and cost. Common methods include response caching, embedding caching, and hybrid caches with expiration.

Common variations

Variations include using persistent caches like Redis or Memcached for distributed systems, caching embedding vectors to speed up similarity search, and implementing cache expiration or invalidation to maintain freshness.

  • Async caching with asyncio and async Redis clients
  • Streaming LLM responses with partial caching
  • Using different models like gpt-4o-mini or claude-3-5-sonnet-20241022
python
import os
import asyncio
import aioredis
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

async def get_cached_response_async(prompt: str):
    redis = await aioredis.from_url("redis://localhost")
    cached = await redis.get(prompt)
    if cached:
        print("Async cache hit")
        return cached.decode()
    print("Async cache miss - calling API")
    response = await client.chat.completions.acreate(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}]
    )
    text = response.choices[0].message.content
    await redis.set(prompt, text, ex=3600)  # Cache expires in 1 hour
    return text

async def main():
    prompt = "Explain caching strategies for AI products"
    print(await get_cached_response_async(prompt))
    print(await get_cached_response_async(prompt))

if __name__ == "__main__":
    asyncio.run(main())
output
Async cache miss - calling API
Caching strategies for AI products involve storing and reusing expensive LLM outputs or embeddings to reduce latency and cost. Common methods include response caching, embedding caching, and hybrid caches with expiration.
Async cache hit
Caching strategies for AI products involve storing and reusing expensive LLM outputs or embeddings to reduce latency and cost. Common methods include response caching, embedding caching, and hybrid caches with expiration.

Troubleshooting

  • If you see stale or outdated responses, implement cache expiration or manual invalidation.
  • Cache size can grow large; use eviction policies or persistent stores like Redis.
  • For embeddings, ensure consistent input preprocessing to avoid cache misses.
  • Check API rate limits to avoid unexpected failures during cache misses.

Key Takeaways

  • Cache LLM outputs and embeddings to reduce latency and API costs effectively.
  • Use persistent caches like Redis for scalable AI product deployments.
  • Implement cache expiration to balance freshness and performance.
  • Async caching enables efficient handling of concurrent AI requests.
  • Consistent input preprocessing is critical for cache hit accuracy.
Verified 2026-04 · gpt-4o, gpt-4o-mini, claude-3-5-sonnet-20241022
Verify ↗