How to add request caching to FastAPI LLM endpoint
Quick answer
Use an in-memory cache like Python's
functools.lru_cache or an external cache such as Redis to store LLM responses keyed by request parameters in your FastAPI endpoint. This avoids repeated calls to the LLM API for identical inputs, improving response time and reducing API usage.PREREQUISITES
Python 3.8+FastAPIOpenAI API key (free tier works)pip install fastapi uvicorn openai redis
Setup
Install required packages and set your environment variable for the OpenAI API key.
- Install FastAPI, Uvicorn, OpenAI SDK, and Redis client:
pip install fastapi uvicorn openai redis Step by step
This example shows a FastAPI endpoint that caches LLM responses in Redis keyed by the user prompt to avoid redundant API calls.
import os
import hashlib
import json
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from redis import Redis
from openai import OpenAI
app = FastAPI()
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Connect to Redis (default localhost:6379)
redis_client = Redis(host="localhost", port=6379, db=0, decode_responses=True)
class PromptRequest(BaseModel):
prompt: str
CACHE_TTL_SECONDS = 3600 # Cache expiry time
def cache_key(prompt: str) -> str:
# Create a hash key for the prompt
return "llm_cache:" + hashlib.sha256(prompt.encode()).hexdigest()
@app.post("/generate")
async def generate_text(request: PromptRequest):
key = cache_key(request.prompt)
cached = redis_client.get(key)
if cached:
return {"text": cached, "cached": True}
try:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": request.prompt}]
)
text = response.choices[0].message.content
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
# Cache the response
redis_client.setex(key, CACHE_TTL_SECONDS, text)
return {"text": text, "cached": False} Common variations
- Use
functools.lru_cachefor simple in-memory caching if your app runs on a single instance. - Implement async Redis clients like
aioredisfor fully async endpoints. - Switch models by changing the
modelparameter in theclient.chat.completions.createcall.
from functools import lru_cache
@lru_cache(maxsize=128)
def cached_generate(prompt: str) -> str:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.content
@app.post("/generate_lru")
def generate_lru(request: PromptRequest):
text = cached_generate(request.prompt)
return {"text": text} Troubleshooting
- If Redis connection fails, ensure Redis server is running and accessible at the configured host and port.
- Cache misses are normal on first requests; verify caching by repeated calls with the same prompt.
- Handle API errors gracefully with try-except to avoid crashing the endpoint.
Key Takeaways
- Use Redis or in-memory caches to store LLM responses keyed by prompt to reduce redundant API calls.
- Cache keys should be deterministic and collision-resistant, e.g., SHA-256 hash of the prompt.
- Set cache expiration to balance freshness and cost savings.
- Handle API errors and cache misses gracefully in your FastAPI endpoint.
- Async Redis clients enable fully async FastAPI endpoints for better performance.