How to beginner · 3 min read

How to cache LLM responses to reduce costs

Quick answer
Use a caching layer to store and reuse responses from LLM API calls, avoiding repeated requests for the same inputs. Implement caching with in-memory stores like Redis or local files keyed by input hashes to reduce costs and improve response times.

PREREQUISITES

  • Python 3.8+
  • OpenAI API key (free tier works)
  • pip install openai>=1.0
  • pip install redis (optional for Redis caching)

Setup

Install the required Python packages and set your environment variable for the OpenAI API key.

  • Install OpenAI SDK: pip install openai>=1.0
  • Optionally install Redis client: pip install redis
  • Set your API key in the environment: export OPENAI_API_KEY='your_api_key' (Linux/macOS) or setx OPENAI_API_KEY "your_api_key" (Windows)
bash
pip install openai redis

Step by step

This example demonstrates caching LLM responses in a simple Python dictionary keyed by a hash of the prompt. This avoids repeated API calls for the same input, reducing costs.

python
import os
import hashlib
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Simple in-memory cache dictionary
cache = {}

def get_cache_key(prompt: str) -> str:
    # Create a hash key from the prompt string
    return hashlib.sha256(prompt.encode('utf-8')).hexdigest()

def get_llm_response(prompt: str) -> str:
    key = get_cache_key(prompt)
    if key in cache:
        print("Cache hit")
        return cache[key]
    print("Cache miss - calling LLM API")
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}]
    )
    text = response.choices[0].message.content
    cache[key] = text
    return text

# Example usage
prompt_text = "Explain caching LLM responses to reduce costs."
result = get_llm_response(prompt_text)
print("LLM response:\n", result)

# Calling again to demonstrate cache hit
result2 = get_llm_response(prompt_text)
print("Cached response:\n", result2)
output
Cache miss - calling LLM API
LLM response:
 <explanation text>
Cache hit
Cached response:
 <same explanation text>

Common variations

You can extend caching by using persistent stores like Redis or SQLite for cross-session caching. Async calls can be cached similarly by awaiting the response before caching. Different models (e.g., claude-3-5-sonnet-20241022) can be cached with the same keying strategy.

python
import os
import hashlib
import redis
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Connect to Redis (make sure Redis server is running locally or remotely)
r = redis.Redis(host='localhost', port=6379, db=0)

def get_cache_key(prompt: str) -> str:
    return hashlib.sha256(prompt.encode('utf-8')).hexdigest()

def get_llm_response_redis(prompt: str) -> str:
    key = get_cache_key(prompt)
    cached = r.get(key)
    if cached:
        print("Redis cache hit")
        return cached.decode('utf-8')
    print("Redis cache miss - calling LLM API")
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}]
    )
    text = response.choices[0].message.content
    r.set(key, text)
    return text

# Example usage
prompt_text = "Explain caching LLM responses to reduce costs."
print(get_llm_response_redis(prompt_text))
output
Redis cache miss - calling LLM API
<explanation text>

Troubleshooting

  • If you see repeated API calls despite caching, verify your cache key generation logic is consistent.
  • For Redis caching, ensure the Redis server is running and accessible.
  • Watch out for cache size growth; implement eviction policies or TTL (time-to-live) for cache entries.
  • Check API rate limits if you get errors unrelated to caching.

Key Takeaways

  • Cache LLM responses keyed by a hash of the prompt to avoid redundant API calls.
  • Use persistent caches like Redis for cross-session and multi-user caching.
  • Implement cache eviction or TTL to manage storage and keep responses fresh.
Verified 2026-04 · gpt-4o, claude-3-5-sonnet-20241022
Verify ↗