How to use caching in LiteLLM
Quick answer
Use LiteLLM's built-in caching by enabling the cache parameter when creating the model instance or by wrapping inference calls with a caching decorator. This stores previous outputs keyed by input, reducing redundant computation and speeding up repeated queries.
PREREQUISITES
Python 3.8+pip install litellm>=0.3.0
Setup
Install LiteLLM via pip and import the necessary modules. No API key is required as LiteLLM runs locally.
pip install litellm>=0.3.0 Step by step
Create a LiteLLM model instance with caching enabled. Use the cache=True parameter to enable automatic caching of inference results. Then call the model with your prompt multiple times to see caching in action.
from litellm import LLM
# Initialize LiteLLM with caching enabled
model = LLM(model="llama3.2", cache=True)
# First inference call (cache miss)
prompt = "What is caching in LiteLLM?"
response1 = model.generate([prompt])
print("First response:", response1[0].text)
# Second inference call with the same prompt (cache hit)
response2 = model.generate([prompt])
print("Second response (cached):", response2[0].text) output
First response: Caching in LiteLLM stores previous outputs to speed up repeated queries. Second response (cached): Caching in LiteLLM stores previous outputs to speed up repeated queries.
Common variations
You can also implement manual caching using Python decorators or external caching libraries like functools.lru_cache if you want more control. Additionally, caching works with different LiteLLM models such as llama3.1-405b or llama3.3-70b. For asynchronous usage, wrap calls in async functions and manage cache accordingly.
from functools import lru_cache
from litellm import LLM
model = LLM(model="llama3.2")
@lru_cache(maxsize=32)
def cached_generate(prompt: str):
return model.generate([prompt])[0].text
prompt = "Explain caching in LiteLLM."
print(cached_generate(prompt))
print(cached_generate(prompt)) # Cached result output
Explain caching in LiteLLM. Explain caching in LiteLLM.
Troubleshooting
- If caching does not seem to work, ensure the input prompt strings are exactly the same including whitespace.
- Clear the cache by reinitializing the model or restarting your Python session.
- For large models, caching may increase memory usage; monitor resource consumption accordingly.
Key Takeaways
- Enable caching in LiteLLM by setting cache=True when creating the model instance.
- Repeated identical prompts return cached results instantly, reducing latency.
- Manual caching with decorators like lru_cache offers flexible control.
- Caching increases memory usage; monitor your environment accordingly.
- Ensure prompt strings match exactly to hit the cache.