Code Advanced hard · 8 min

When caching breaks things: non-deterministic outputs

What you will learn
LLM caching assumes identical inputs always produce identical outputs, but temperature, randomness, and model updates violate that assumption.

Why this matters

Production systems cache LLM responses for cost and latency. If you cache a model's non-deterministic output and the cache never expires, you silently lock in stale or incorrect responses: users get wrong answers until cache clears. This is a silent failure, not an error.

Skip if: Do not use caching when: (1) temperature > 0 and you need actual randomness on every call, (2) the model version changes and you need new behavior immediately, (3) you're fine-tuning or evaluating model drift, (4) the output depends on external state (current date, user data, API state) that changes between cache hits.

Explanation

What it is: LangChain's in-memory and Redis caching assumes the same input (prompt + model + parameters) always produces the same output. But LLM outputs depend on factors beyond the prompt: temperature, top_p sampling, model weights, and even OpenAI's internal randomness. Cache a response once, and you'll serve the exact same text forever: even if the model was updated or the behavior should be non-deterministic.

How it works mechanically: When you enable caching in LangChain (via InMemoryCache, RedisCache, or providers' built-in caching), the framework hashes the input: prompt text, model name, and configuration. That hash is the cache key. On the second identical input, it skips the LLM call and returns the cached response. The problem: if your chain uses temperature=0.7 (non-deterministic), the cache returns the *same* random response from call 1, defeating the entire point of sampling diversity.

When to use it: Cache only deterministic chains: temperature=0, top_p=1.0, no sampling. Cache classification, extraction, or deterministic reasoning. Do not cache creative writing, brainstorming, or any chain where "different every time" is the feature.

Analogy

It's like recording a live radio interview once and replaying the exact same recording every time someone asks the same question. If the interview subject (model) changes their opinion later, or if the listener expects a different answer because of randomness, the replay breaks the contract.

Code

Illustrative only - not runnable without a valid API key
python
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.caches import InMemoryCache
import langchain_core

# Set up caching globally
langchain_core.set_llm_cache(InMemoryCache())

prompt = ChatPromptTemplate.from_template(
    "Generate a creative name for a startup. Keep it one word only."
)

# Problem: caching with temperature > 0
llm_nondeterministic = ChatOpenAI(
    model="gpt-4o-mini",
    temperature=0.8,
    api_key="sk-test"
)

chain = prompt | llm_nondeterministic

print("Call 1 (will hit LLM):")
result1 = chain.invoke({})
print(f"Result 1: {result1.content}")

print("\nCall 2 (will hit cache — same input):")
result2 = chain.invoke({})
print(f"Result 2: {result2.content}")

print("\n--- Same response? (this is the problem) ---")
print(f"Identical: {result1.content == result2.content}")

print("\n--- Now with deterministic model ---")

# Solution: disable caching OR use temperature=0
langchain_core.set_llm_cache(None)

llm_deterministic = ChatOpenAI(
    model="gpt-4o-mini",
    temperature=0.0,  # Deterministic
    api_key="sk-test"
)

chain_safe = prompt | llm_deterministic

print("\nDeterministic Call 1:")
result3 = chain_safe.invoke({})
print(f"Result 3: {result3.content}")

print("\nDeterministic Call 2 (safe to cache):")
result4 = chain_safe.invoke({})
print(f"Result 4: {result4.content}")

print(f"\nIdentical (expected): {result3.content == result4.content}")
Output
Call 1 (will hit LLM):
Result 1: Quantum

Call 2 (will hit cache: same input):
Result 2: Quantum

--- Same response? (this is the problem) ---
Identical: True

--- Now with deterministic model ---

Deterministic Call 1:
Result 3: Nexus

Deterministic Call 2 (safe to cache):
Result 4: Nexus

Identical (expected): True

What just happened?

The first chain with temperature=0.8 hit the LLM once and got a random response ("Quantum"). When we called it again with identical input, the cache returned the exact same string instead of sampling a new random word. This violates the intent of temperature > 0. The second chain uses temperature=0.0 (deterministic), so caching is safe: both calls return "Nexus" legitimately. The code demonstrates that caching *does work* for deterministic models, but silently breaks non-deterministic ones without raising an error.

Common gotcha

Developers enable global caching with `langchain_core.set_llm_cache(InMemoryCache())` once in their app startup, then forget it exists. Later, they add a chain with `temperature=0.7` for creative output, run it twice, and get the same response both times. They assume the model is broken or they're calling the wrong LLM: they don't suspect the cache because there's no error, no warning, just wrong behavior. The cache is invisible.

Error recovery

Same output every call with temperature > 0
Disable caching for that chain with `langchain_core.set_llm_cache(None)` before creating the LLM, or use a cache that respects temperature (none do by default): better: use a TTL-based cache and set `expire=300` (5 minutes) so stale cached non-determinism expires.
Cache hit but model was updated
If your model version changes (e.g., gpt-4o → gpt-4o-2024-11), the cache key includes the model name, so it's a cache miss (correct behavior). But if you use a model alias like 'gpt-4' and OpenAI silently updates it, the cache still hits with old model output. Solution: pin exact model versions in production and invalidate cache after model upgrades.

Experienced dev note

The real trap is mixing two mental models: (1) LLMs are stateless APIs (true: same input = same output IF temperature=0), and (2) LLMs are probabilistic (true: temperature > 0 means different outputs). Caching assumes model 1. If you need model 2, caching silently breaks it. The fix: Make caching an explicit opt-in per chain, not a global setting. Use `with_config({"cache_key": None})` on chains that need randomness, or reserve caching for deterministic tasks only (classification, extraction, structured output). In production, this is why observability matters: log whether a response came from cache or LLM so you can detect when 'randomness' has become 'replay'.

Check your understanding

You have a multi-step chain: step 1 extracts entities from a user message (temperature=0), step 2 generates a creative follow-up question (temperature=0.8). Should you enable global caching for the whole chain? Why or why not?

Show answer hint

A correct answer recognizes that you cannot cache the whole chain because step 2 requires non-determinism. You can cache step 1 alone (deterministic extraction is cheap but repeatable), but step 2 should bypass the cache or use a cache-aware wrapper that only caches the deterministic part. Or: disable global caching and selectively cache only step 1 by wrapping it separately.

VERSION In langchain < 0.3.0, caching was managed via `from langchain.cache import InMemoryCache` (deprecated). In langchain-core >= 0.3.0 (current), use `langchain_core.set_llm_cache()` and import from `langchain_core.caches`. The behavior is identical, but the import path changed: old code breaks at import time.
NEXT

How to implement per-prompt caching instead of global caching: selectively cache only the deterministic parts of your chain while letting non-deterministic steps run every time.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.