When caching breaks things: non-deterministic outputs
Why this matters
Production systems cache LLM responses for cost and latency. If you cache a model's non-deterministic output and the cache never expires, you silently lock in stale or incorrect responses: users get wrong answers until cache clears. This is a silent failure, not an error.
Explanation
What it is: LangChain's in-memory and Redis caching assumes the same input (prompt + model + parameters) always produces the same output. But LLM outputs depend on factors beyond the prompt: temperature, top_p sampling, model weights, and even OpenAI's internal randomness. Cache a response once, and you'll serve the exact same text forever: even if the model was updated or the behavior should be non-deterministic.
How it works mechanically: When you enable caching in LangChain (via InMemoryCache, RedisCache, or providers' built-in caching), the framework hashes the input: prompt text, model name, and configuration. That hash is the cache key. On the second identical input, it skips the LLM call and returns the cached response. The problem: if your chain uses temperature=0.7 (non-deterministic), the cache returns the *same* random response from call 1, defeating the entire point of sampling diversity.
When to use it: Cache only deterministic chains: temperature=0, top_p=1.0, no sampling. Cache classification, extraction, or deterministic reasoning. Do not cache creative writing, brainstorming, or any chain where "different every time" is the feature.
Analogy
It's like recording a live radio interview once and replaying the exact same recording every time someone asks the same question. If the interview subject (model) changes their opinion later, or if the listener expects a different answer because of randomness, the replay breaks the contract.
Code
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.caches import InMemoryCache
import langchain_core
# Set up caching globally
langchain_core.set_llm_cache(InMemoryCache())
prompt = ChatPromptTemplate.from_template(
"Generate a creative name for a startup. Keep it one word only."
)
# Problem: caching with temperature > 0
llm_nondeterministic = ChatOpenAI(
model="gpt-4o-mini",
temperature=0.8,
api_key="sk-test"
)
chain = prompt | llm_nondeterministic
print("Call 1 (will hit LLM):")
result1 = chain.invoke({})
print(f"Result 1: {result1.content}")
print("\nCall 2 (will hit cache — same input):")
result2 = chain.invoke({})
print(f"Result 2: {result2.content}")
print("\n--- Same response? (this is the problem) ---")
print(f"Identical: {result1.content == result2.content}")
print("\n--- Now with deterministic model ---")
# Solution: disable caching OR use temperature=0
langchain_core.set_llm_cache(None)
llm_deterministic = ChatOpenAI(
model="gpt-4o-mini",
temperature=0.0, # Deterministic
api_key="sk-test"
)
chain_safe = prompt | llm_deterministic
print("\nDeterministic Call 1:")
result3 = chain_safe.invoke({})
print(f"Result 3: {result3.content}")
print("\nDeterministic Call 2 (safe to cache):")
result4 = chain_safe.invoke({})
print(f"Result 4: {result4.content}")
print(f"\nIdentical (expected): {result3.content == result4.content}") Call 1 (will hit LLM): Result 1: Quantum Call 2 (will hit cache: same input): Result 2: Quantum --- Same response? (this is the problem) --- Identical: True --- Now with deterministic model --- Deterministic Call 1: Result 3: Nexus Deterministic Call 2 (safe to cache): Result 4: Nexus Identical (expected): True
What just happened?
The first chain with temperature=0.8 hit the LLM once and got a random response ("Quantum"). When we called it again with identical input, the cache returned the exact same string instead of sampling a new random word. This violates the intent of temperature > 0. The second chain uses temperature=0.0 (deterministic), so caching is safe: both calls return "Nexus" legitimately. The code demonstrates that caching *does work* for deterministic models, but silently breaks non-deterministic ones without raising an error.
Common gotcha
Developers enable global caching with `langchain_core.set_llm_cache(InMemoryCache())` once in their app startup, then forget it exists. Later, they add a chain with `temperature=0.7` for creative output, run it twice, and get the same response both times. They assume the model is broken or they're calling the wrong LLM: they don't suspect the cache because there's no error, no warning, just wrong behavior. The cache is invisible.
Error recovery
Same output every call with temperature > 0Cache hit but model was updatedExperienced dev note
The real trap is mixing two mental models: (1) LLMs are stateless APIs (true: same input = same output IF temperature=0), and (2) LLMs are probabilistic (true: temperature > 0 means different outputs). Caching assumes model 1. If you need model 2, caching silently breaks it. The fix: Make caching an explicit opt-in per chain, not a global setting. Use `with_config({"cache_key": None})` on chains that need randomness, or reserve caching for deterministic tasks only (classification, extraction, structured output). In production, this is why observability matters: log whether a response came from cache or LLM so you can detect when 'randomness' has become 'replay'.
Check your understanding
You have a multi-step chain: step 1 extracts entities from a user message (temperature=0), step 2 generates a creative follow-up question (temperature=0.8). Should you enable global caching for the whole chain? Why or why not?
Show answer hint
A correct answer recognizes that you cannot cache the whole chain because step 2 requires non-determinism. You can cache step 1 alone (deterministic extraction is cheap but repeatable), but step 2 should bypass the cache or use a cache-aware wrapper that only caches the deterministic part. Or: disable global caching and selectively cache only step 1 by wrapping it separately.