How to Intermediate · 3 min read

Embedding caching strategies

Quick answer
Use embedding caching to store and reuse vector representations from models like text-embedding-3-small, reducing redundant API calls and latency. Common strategies include local in-memory caches, persistent key-value stores, and hybrid approaches combining fast retrieval with periodic refreshes.

PREREQUISITES

  • Python 3.8+
  • OpenAI API key (free tier works)
  • pip install openai>=1.0

Setup embedding environment

Install the openai Python package and set your API key as an environment variable to access embedding models.

bash
pip install openai>=1.0

Step by step caching example

This example demonstrates caching embeddings locally in a Python dictionary to avoid repeated API calls for the same input text.

python
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Simple in-memory cache dictionary
embedding_cache = {}

def get_embedding(text: str) -> list[float]:
    if text in embedding_cache:
        print("Cache hit")
        return embedding_cache[text]
    print("Cache miss - calling API")
    response = client.embeddings.create(model="text-embedding-3-small", input=text)
    vector = response.data[0].embedding
    embedding_cache[text] = vector
    return vector

# Usage
texts = ["Hello world", "Hello world", "OpenAI embeddings"]
for t in texts:
    vec = get_embedding(t)
    print(f"Embedding length for '{t}':", len(vec))
output
Cache miss - calling API
Embedding length for 'Hello world': 1536
Cache hit
Embedding length for 'Hello world': 1536
Cache miss - calling API
Embedding length for 'OpenAI embeddings': 1536

Common variations

  • Persistent caching: Use databases like Redis, SQLite, or disk-based key-value stores to cache embeddings across sessions.
  • Batch embedding requests: Cache embeddings for batches of texts to reduce API calls and improve throughput.
  • Async caching: Implement asynchronous calls with caching for scalable, non-blocking embedding retrieval.
  • Cache invalidation: Periodically refresh cached embeddings if the model or input data changes.

Troubleshooting cache issues

  • If you see stale embeddings, implement cache expiration or versioning to refresh outdated vectors.
  • For memory overflow, switch from in-memory cache to persistent storage or limit cache size with eviction policies.
  • Ensure consistent input normalization (e.g., trimming whitespace) to avoid cache misses due to input variations.

Key Takeaways

  • Cache embeddings locally or persistently to reduce API calls and latency.
  • Use batch and async embedding requests combined with caching for scalability.
  • Implement cache invalidation strategies to keep embeddings up to date.
Verified 2026-04 · text-embedding-3-small
Verify ↗