How to Intermediate · 3 min read

Caching strategies for AI products

Q: Caching strategies for AI products

Use caching in AI products to store and reuse expensive LLM outputs or embedding vectors, reducing latency and API costs. Common strategies include response caching, embedding caching, and hybrid cache with expiration to balance freshness and performance.

Quick answer

Use caching in AI products to store and reuse expensive LLM outputs or embedding vectors, reducing latency and API costs. Common strategies include response caching, embedding caching, and hybrid cache with expiration to balance freshness and performance.

PREREQUISITES

Python 3.8+
OpenAI API key (free tier works)
pip install openai>=1.0
Basic understanding of caching concepts

Setup

Install the openai Python SDK and set your API key as an environment variable to interact with LLMs. Use a caching backend like redis or an in-memory dictionary for demonstration.

bash

pip install openai redis

output

Collecting openai
Collecting redis
Successfully installed openai redis

Step by step

This example demonstrates caching LLM chat completions using an in-memory dictionary cache to avoid repeated API calls for the same prompt.

python

import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Simple in-memory cache
cache = {}

def get_cached_response(prompt: str) -> str:
    if prompt in cache:
        print("Cache hit")
        return cache[prompt]
    print("Cache miss - calling API")
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}]
    )
    text = response.choices[0].message.content
    cache[prompt] = text
    return text

if __name__ == "__main__":
    prompt = "Explain caching strategies for AI products"
    print(get_cached_response(prompt))
    print(get_cached_response(prompt))

output

Cache miss - calling API
Caching strategies for AI products involve storing and reusing expensive LLM outputs or embeddings to reduce latency and cost. Common methods include response caching, embedding caching, and hybrid caches with expiration.
Cache hit
Caching strategies for AI products involve storing and reusing expensive LLM outputs or embeddings to reduce latency and cost. Common methods include response caching, embedding caching, and hybrid caches with expiration.

Common variations

Variations include using persistent caches like Redis or Memcached for distributed systems, caching embedding vectors to speed up similarity search, and implementing cache expiration or invalidation to maintain freshness.

Async caching with asyncio and async Redis clients
Streaming LLM responses with partial caching
Using different models like gpt-4o-mini or claude-3-5-sonnet-20241022

python

import os
import asyncio
import aioredis
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

async def get_cached_response_async(prompt: str):
    redis = await aioredis.from_url("redis://localhost")
    cached = await redis.get(prompt)
    if cached:
        print("Async cache hit")
        return cached.decode()
    print("Async cache miss - calling API")
    response = await client.chat.completions.acreate(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}]
    )
    text = response.choices[0].message.content
    await redis.set(prompt, text, ex=3600)  # Cache expires in 1 hour
    return text

async def main():
    prompt = "Explain caching strategies for AI products"
    print(await get_cached_response_async(prompt))
    print(await get_cached_response_async(prompt))

if __name__ == "__main__":
    asyncio.run(main())

output

Async cache miss - calling API
Caching strategies for AI products involve storing and reusing expensive LLM outputs or embeddings to reduce latency and cost. Common methods include response caching, embedding caching, and hybrid caches with expiration.
Async cache hit
Caching strategies for AI products involve storing and reusing expensive LLM outputs or embeddings to reduce latency and cost. Common methods include response caching, embedding caching, and hybrid caches with expiration.

Troubleshooting

If you see stale or outdated responses, implement cache expiration or manual invalidation.
Cache size can grow large; use eviction policies or persistent stores like Redis.
For embeddings, ensure consistent input preprocessing to avoid cache misses.
Check API rate limits to avoid unexpected failures during cache misses.

✅

Key Takeaways

Cache LLM outputs and embeddings to reduce latency and API costs effectively.
Use persistent caches like Redis for scalable AI product deployments.
Implement cache expiration to balance freshness and performance.
Async caching enables efficient handling of concurrent AI requests.
Consistent input preprocessing is critical for cache hit accuracy.

Verified 2026-04 · gpt-4o, gpt-4o-mini, claude-3-5-sonnet-20241022

Verify ↗