How to beginner · 3 min read

How to cache LLM responses to reduce costs

Q: How to cache LLM responses to reduce costs

Use a caching layer to store and reuse responses from LLM API calls, avoiding repeated requests for the same inputs. Implement caching with in-memory stores like Redis or local files keyed by input hashes to reduce costs and improve response times.

Quick answer

Use a caching layer to store and reuse responses from LLM API calls, avoiding repeated requests for the same inputs. Implement caching with in-memory stores like Redis or local files keyed by input hashes to reduce costs and improve response times.

PREREQUISITES

Python 3.8+
OpenAI API key (free tier works)
pip install openai>=1.0
pip install redis (optional for Redis caching)

Setup

Install the required Python packages and set your environment variable for the OpenAI API key.

Install OpenAI SDK: pip install openai>=1.0
Optionally install Redis client: pip install redis
Set your API key in the environment: export OPENAI_API_KEY='your_api_key' (Linux/macOS) or setx OPENAI_API_KEY "your_api_key" (Windows)

bash

pip install openai redis

Step by step

This example demonstrates caching LLM responses in a simple Python dictionary keyed by a hash of the prompt. This avoids repeated API calls for the same input, reducing costs.

python

import os
import hashlib
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Simple in-memory cache dictionary
cache = {}

def get_cache_key(prompt: str) -> str:
    # Create a hash key from the prompt string
    return hashlib.sha256(prompt.encode('utf-8')).hexdigest()

def get_llm_response(prompt: str) -> str:
    key = get_cache_key(prompt)
    if key in cache:
        print("Cache hit")
        return cache[key]
    print("Cache miss - calling LLM API")
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}]
    )
    text = response.choices[0].message.content
    cache[key] = text
    return text

# Example usage
prompt_text = "Explain caching LLM responses to reduce costs."
result = get_llm_response(prompt_text)
print("LLM response:\n", result)

# Calling again to demonstrate cache hit
result2 = get_llm_response(prompt_text)
print("Cached response:\n", result2)

output

Cache miss - calling LLM API
LLM response:
 <explanation text>
Cache hit
Cached response:
 <same explanation text>

Common variations

You can extend caching by using persistent stores like Redis or SQLite for cross-session caching. Async calls can be cached similarly by awaiting the response before caching. Different models (e.g., claude-3-5-sonnet-20241022) can be cached with the same keying strategy.

python

import os
import hashlib
import redis
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Connect to Redis (make sure Redis server is running locally or remotely)
r = redis.Redis(host='localhost', port=6379, db=0)

def get_cache_key(prompt: str) -> str:
    return hashlib.sha256(prompt.encode('utf-8')).hexdigest()

def get_llm_response_redis(prompt: str) -> str:
    key = get_cache_key(prompt)
    cached = r.get(key)
    if cached:
        print("Redis cache hit")
        return cached.decode('utf-8')
    print("Redis cache miss - calling LLM API")
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}]
    )
    text = response.choices[0].message.content
    r.set(key, text)
    return text

# Example usage
prompt_text = "Explain caching LLM responses to reduce costs."
print(get_llm_response_redis(prompt_text))

output

Redis cache miss - calling LLM API
<explanation text>

Troubleshooting

If you see repeated API calls despite caching, verify your cache key generation logic is consistent.
For Redis caching, ensure the Redis server is running and accessible.
Watch out for cache size growth; implement eviction policies or TTL (time-to-live) for cache entries.
Check API rate limits if you get errors unrelated to caching.

✅

Key Takeaways

Cache LLM responses keyed by a hash of the prompt to avoid redundant API calls.
Use persistent caches like Redis for cross-session and multi-user caching.
Implement cache eviction or TTL to manage storage and keep responses fresh.

Verified 2026-04 · gpt-4o, claude-3-5-sonnet-20241022

Verify ↗