How to Intermediate · 3 min read

Semantic caching for LLM explained

Quick answer
Semantic caching for LLM involves storing vector embeddings of previous queries and their responses to quickly retrieve relevant answers for similar new queries, reducing redundant API calls. It uses embedding models and vector similarity search to find cached responses semantically close to the input, optimizing cost and latency.

PREREQUISITES

  • Python 3.8+
  • OpenAI API key (free tier works)
  • pip install openai>=1.0
  • pip install faiss-cpu or chromadb

Setup

Install the required Python packages for OpenAI API access and vector similarity search. Set your API key as an environment variable.

bash
pip install openai faiss-cpu
output
Collecting openai
Collecting faiss-cpu
Successfully installed openai-1.x.x faiss-cpu-1.x.x

Step by step

This example demonstrates semantic caching by embedding queries, storing them in a vector index, and retrieving cached responses for similar queries to avoid repeated LLM calls.

python
import os
from openai import OpenAI
import faiss
import numpy as np

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Simple in-memory cache: list of (embedding, response) pairs
cache_embeddings = []
cache_responses = []

# Function to embed text
def embed_text(text):
    response = client.embeddings.create(model="text-embedding-3-small", input=text)
    return np.array(response.data[0].embedding, dtype=np.float32)

# Initialize FAISS index for cosine similarity (inner product on normalized vectors)
dim = 1536  # embedding dimension for text-embedding-3-small
index = faiss.IndexFlatIP(dim)

# Normalize embeddings before adding to index
def normalize(vec):
    return vec / np.linalg.norm(vec)

# Add to cache
def add_to_cache(query, response_text):
    emb = embed_text(query)
    emb = normalize(emb)
    cache_embeddings.append(emb)
    cache_responses.append(response_text)
    index.add(np.array([emb]))

# Query cache
def query_cache(query, threshold=0.85):
    emb = embed_text(query)
    emb = normalize(emb)
    D, I = index.search(np.array([emb]), k=1)
    if len(I[0]) > 0 and D[0][0] >= threshold:
        return cache_responses[I[0][0]]
    return None

# Main function
def get_response(query):
    cached = query_cache(query)
    if cached:
        print("Cache hit")
        return cached
    print("Cache miss, calling LLM")
    completion = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": query}]
    )
    response_text = completion.choices[0].message.content
    add_to_cache(query, response_text)
    return response_text

# Example usage
if __name__ == "__main__":
    q1 = "What is semantic caching for LLMs?"
    print(get_response(q1))
    # Repeat similar query to trigger cache hit
    q2 = "Explain semantic caching in large language models."
    print(get_response(q2))
output
Cache miss, calling LLM
Semantic caching for LLMs involves storing vector embeddings of previous queries and their responses to reuse answers for similar inputs, reducing API calls and latency.
Cache hit
Semantic caching for LLMs involves storing vector embeddings of previous queries and their responses to reuse answers for similar inputs, reducing API calls and latency.

Common variations

  • Use chromadb or pinecone for scalable vector stores instead of FAISS.
  • Implement async calls with asyncio and OpenAI's async SDK methods.
  • Adjust similarity threshold to balance cache hit rate and accuracy.
  • Use different embedding models like text-embedding-3-large for better semantic understanding.

Troubleshooting

  • If cache hits return irrelevant results, lower the similarity threshold or improve embedding quality.
  • If FAISS index throws dimension errors, verify embedding dimension matches index setup.
  • For large caches, consider persistent vector databases to avoid memory overflow.
  • Ensure environment variable OPENAI_API_KEY is set correctly to avoid authentication errors.

Key Takeaways

  • Semantic caching uses vector embeddings to find and reuse similar past LLM responses, reducing API calls and cost.
  • FAISS or vector databases enable efficient similarity search for cached query embeddings.
  • Adjust similarity thresholds to balance cache accuracy and hit rate.
  • Use semantic caching to improve latency and cost efficiency in LLM-powered applications.
Verified 2026-04 · gpt-4o-mini, text-embedding-3-small
Verify ↗