How to beginner · 3 min read

How to use caching in LiteLLM

Quick answer

Use LiteLLM's built-in caching by enabling the cache parameter when creating the model instance or by wrapping inference calls with a caching decorator. This stores previous outputs keyed by input, reducing redundant computation and speeding up repeated queries.

PREREQUISITES

Python 3.8+
pip install litellm>=0.3.0

Setup

Install LiteLLM via pip and import the necessary modules. No API key is required as LiteLLM runs locally.

bash

pip install litellm>=0.3.0

Step by step

Create a LiteLLM model instance with caching enabled. Use the cache=True parameter to enable automatic caching of inference results. Then call the model with your prompt multiple times to see caching in action.

python

from litellm import LLM

# Initialize LiteLLM with caching enabled
model = LLM(model="llama3.2", cache=True)

# First inference call (cache miss)
prompt = "What is caching in LiteLLM?"
response1 = model.generate([prompt])
print("First response:", response1[0].text)

# Second inference call with the same prompt (cache hit)
response2 = model.generate([prompt])
print("Second response (cached):", response2[0].text)

output

First response: Caching in LiteLLM stores previous outputs to speed up repeated queries.
Second response (cached): Caching in LiteLLM stores previous outputs to speed up repeated queries.

Common variations

You can also implement manual caching using Python decorators or external caching libraries like functools.lru_cache if you want more control. Additionally, caching works with different LiteLLM models such as llama3.1-405b or llama3.3-70b. For asynchronous usage, wrap calls in async functions and manage cache accordingly.

python

from functools import lru_cache
from litellm import LLM

model = LLM(model="llama3.2")

@lru_cache(maxsize=32)
def cached_generate(prompt: str):
    return model.generate([prompt])[0].text

prompt = "Explain caching in LiteLLM."
print(cached_generate(prompt))
print(cached_generate(prompt))  # Cached result

output

Explain caching in LiteLLM.
Explain caching in LiteLLM.

Troubleshooting

If caching does not seem to work, ensure the input prompt strings are exactly the same including whitespace.
Clear the cache by reinitializing the model or restarting your Python session.
For large models, caching may increase memory usage; monitor resource consumption accordingly.

✅

Key Takeaways

Enable caching in LiteLLM by setting cache=True when creating the model instance.
Repeated identical prompts return cached results instantly, reducing latency.
Manual caching with decorators like lru_cache offers flexible control.
Caching increases memory usage; monitor your environment accordingly.
Ensure prompt strings match exactly to hit the cache.

Verified 2026-04 · llama3.2, llama3.1-405b, llama3.3-70b

Verify ↗