How to Intermediate · 3 min read

How to evaluate AI memory quality

Quick answer

To evaluate AI memory quality, use retrieval-augmented generation techniques to test recall accuracy and relevance by querying stored memory vectors. Measure consistency by comparing repeated query responses and use embedding similarity metrics with text-embedding-3-small to quantify memory fidelity.

PREREQUISITES

Python 3.8+
OpenAI API key (free tier works)
pip install openai>=1.0

Setup

Install the openai Python SDK and set your API key as an environment variable to access AI models for memory evaluation.

bash

pip install openai>=1.0

output

Collecting openai
  Downloading openai-1.x.x-py3-none-any.whl
Installing collected packages: openai
Successfully installed openai-1.x.x

Step by step

This example demonstrates how to evaluate AI memory quality by embedding stored memory texts, querying with a prompt, and measuring recall accuracy and embedding similarity.

python

import os
from openai import OpenAI
import json

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Sample memory texts stored in AI memory
memory_texts = [
    "The capital of France is Paris.",
    "Python is a popular programming language.",
    "The Earth revolves around the Sun."
]

# Step 1: Create embeddings for memory texts
embedding_responses = [
    client.embeddings.create(model="text-embedding-3-small", input=text)
    for text in memory_texts
]
memory_vectors = [resp.data[0].embedding for resp in embedding_responses]

# Step 2: Query prompt to test recall
query = "What is the capital city of France?"
query_embedding_resp = client.embeddings.create(model="text-embedding-3-small", input=query)
query_vector = query_embedding_resp.data[0].embedding

# Step 3: Compute cosine similarity between query and memory vectors
import numpy as np

def cosine_similarity(a, b):
    a = np.array(a)
    b = np.array(b)
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

similarities = [cosine_similarity(query_vector, vec) for vec in memory_vectors]
best_match_index = np.argmax(similarities)

# Step 4: Use chat completion to verify recall
messages = [
    {"role": "system", "content": "You are a helpful assistant."}
]
messages.append({"role": "user", "content": query})
response = client.chat.completions.create(model="gpt-4o-mini", messages=messages)
answer = response.choices[0].message.content

print(f"Best memory match: {memory_texts[best_match_index]}")
print(f"Similarity score: {similarities[best_match_index]:.4f}")
print(f"AI answer: {answer}")

output

Best memory match: The capital of France is Paris.
Similarity score: 0.8723
AI answer: The capital city of France is Paris.

Common variations

You can evaluate memory quality asynchronously using async calls, test with different embedding models like text-embedding-3-large, or use streaming chat completions for real-time feedback.

python

import asyncio
from openai import OpenAI

async def evaluate_memory_async():
    client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

    memory_texts = [
        "The capital of France is Paris.",
        "Python is a popular programming language.",
        "The Earth revolves around the Sun."
    ]

    embedding_tasks = [
        client.embeddings.create(model="text-embedding-3-small", input=text)
        for text in memory_texts
    ]
    embedding_responses = await asyncio.gather(*embedding_tasks)
    memory_vectors = [resp.data[0].embedding for resp in embedding_responses]

    query = "What is the capital city of France?"
    query_embedding_resp = await client.embeddings.create(model="text-embedding-3-small", input=query)
    query_vector = query_embedding_resp.data[0].embedding

    import numpy as np
    def cosine_similarity(a, b):
        a = np.array(a)
        b = np.array(b)
        return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

    similarities = [cosine_similarity(query_vector, vec) for vec in memory_vectors]
    best_match_index = np.argmax(similarities)

    messages = [
        {"role": "system", "content": "You are a helpful assistant."}
    ]
    messages.append({"role": "user", "content": query})

    response = await client.chat.completions.create(model="gpt-4o-mini", messages=messages)
    answer = response.choices[0].message.content

    print(f"Best memory match: {memory_texts[best_match_index]}")
    print(f"Similarity score: {similarities[best_match_index]:.4f}")
    print(f"AI answer: {answer}")

asyncio.run(evaluate_memory_async())

output

Best memory match: The capital of France is Paris.
Similarity score: 0.8723
AI answer: The capital city of France is Paris.

Troubleshooting

If similarity scores are unexpectedly low, verify that the embedding model matches between stored memory and queries.
If AI answers are inconsistent, increase max_tokens or adjust the system prompt for clarity.
Ensure your API key is correctly set in os.environ["OPENAI_API_KEY"] to avoid authentication errors.

✅

Key Takeaways

Use embedding similarity to quantitatively measure AI memory recall accuracy.
Combine vector search with chat completions to verify memory relevance and consistency.
Test asynchronously or with streaming for scalable and real-time memory evaluation.

Verified 2026-04 · gpt-4o-mini, text-embedding-3-small

Verify ↗