How to intermediate · 4 min read

How to add request caching to FastAPI LLM endpoint

Q: How to add request caching to FastAPI LLM endpoint

Use an in-memory cache like Python's functools.lru_cache or an external cache such as Redis to store LLM responses keyed by request parameters in your FastAPI endpoint. This avoids repeated calls to the LLM API for identical inputs, improving response time and reducing API usage.

Quick answer

Use an in-memory cache like Python's functools.lru_cache or an external cache such as Redis to store LLM responses keyed by request parameters in your FastAPI endpoint. This avoids repeated calls to the LLM API for identical inputs, improving response time and reducing API usage.

PREREQUISITES

Python 3.8+
FastAPI
OpenAI API key (free tier works)
pip install fastapi uvicorn openai redis

Setup

Install required packages and set your environment variable for the OpenAI API key.

Install FastAPI, Uvicorn, OpenAI SDK, and Redis client:

bash

pip install fastapi uvicorn openai redis

Step by step

This example shows a FastAPI endpoint that caches LLM responses in Redis keyed by the user prompt to avoid redundant API calls.

python

import os
import hashlib
import json
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from redis import Redis
from openai import OpenAI

app = FastAPI()
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Connect to Redis (default localhost:6379)
redis_client = Redis(host="localhost", port=6379, db=0, decode_responses=True)

class PromptRequest(BaseModel):
    prompt: str

CACHE_TTL_SECONDS = 3600  # Cache expiry time

def cache_key(prompt: str) -> str:
    # Create a hash key for the prompt
    return "llm_cache:" + hashlib.sha256(prompt.encode()).hexdigest()

@app.post("/generate")
async def generate_text(request: PromptRequest):
    key = cache_key(request.prompt)
    cached = redis_client.get(key)
    if cached:
        return {"text": cached, "cached": True}

    try:
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": request.prompt}]
        )
        text = response.choices[0].message.content
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

    # Cache the response
    redis_client.setex(key, CACHE_TTL_SECONDS, text)
    return {"text": text, "cached": False}

Common variations

Use functools.lru_cache for simple in-memory caching if your app runs on a single instance.
Implement async Redis clients like aioredis for fully async endpoints.
Switch models by changing the model parameter in the client.chat.completions.create call.

python

from functools import lru_cache

@lru_cache(maxsize=128)
def cached_generate(prompt: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

@app.post("/generate_lru")
def generate_lru(request: PromptRequest):
    text = cached_generate(request.prompt)
    return {"text": text}

Troubleshooting

If Redis connection fails, ensure Redis server is running and accessible at the configured host and port.
Cache misses are normal on first requests; verify caching by repeated calls with the same prompt.
Handle API errors gracefully with try-except to avoid crashing the endpoint.

✅

Key Takeaways

Use Redis or in-memory caches to store LLM responses keyed by prompt to reduce redundant API calls.
Cache keys should be deterministic and collision-resistant, e.g., SHA-256 hash of the prompt.
Set cache expiration to balance freshness and cost savings.
Handle API errors and cache misses gracefully in your FastAPI endpoint.
Async Redis clients enable fully async FastAPI endpoints for better performance.

Verified 2026-04 · gpt-4o-mini

Verify ↗