How to evaluate AI memory quality
Quick answer
To evaluate AI memory quality, use retrieval-augmented generation techniques to test recall accuracy and relevance by querying stored memory vectors. Measure consistency by comparing repeated query responses and use embedding similarity metrics with text-embedding-3-small to quantify memory fidelity.
PREREQUISITES
Python 3.8+OpenAI API key (free tier works)pip install openai>=1.0
Setup
Install the openai Python SDK and set your API key as an environment variable to access AI models for memory evaluation.
pip install openai>=1.0 output
Collecting openai Downloading openai-1.x.x-py3-none-any.whl Installing collected packages: openai Successfully installed openai-1.x.x
Step by step
This example demonstrates how to evaluate AI memory quality by embedding stored memory texts, querying with a prompt, and measuring recall accuracy and embedding similarity.
import os
from openai import OpenAI
import json
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Sample memory texts stored in AI memory
memory_texts = [
"The capital of France is Paris.",
"Python is a popular programming language.",
"The Earth revolves around the Sun."
]
# Step 1: Create embeddings for memory texts
embedding_responses = [
client.embeddings.create(model="text-embedding-3-small", input=text)
for text in memory_texts
]
memory_vectors = [resp.data[0].embedding for resp in embedding_responses]
# Step 2: Query prompt to test recall
query = "What is the capital city of France?"
query_embedding_resp = client.embeddings.create(model="text-embedding-3-small", input=query)
query_vector = query_embedding_resp.data[0].embedding
# Step 3: Compute cosine similarity between query and memory vectors
import numpy as np
def cosine_similarity(a, b):
a = np.array(a)
b = np.array(b)
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
similarities = [cosine_similarity(query_vector, vec) for vec in memory_vectors]
best_match_index = np.argmax(similarities)
# Step 4: Use chat completion to verify recall
messages = [
{"role": "system", "content": "You are a helpful assistant."}
]
messages.append({"role": "user", "content": query})
response = client.chat.completions.create(model="gpt-4o-mini", messages=messages)
answer = response.choices[0].message.content
print(f"Best memory match: {memory_texts[best_match_index]}")
print(f"Similarity score: {similarities[best_match_index]:.4f}")
print(f"AI answer: {answer}") output
Best memory match: The capital of France is Paris. Similarity score: 0.8723 AI answer: The capital city of France is Paris.
Common variations
You can evaluate memory quality asynchronously using async calls, test with different embedding models like text-embedding-3-large, or use streaming chat completions for real-time feedback.
import asyncio
from openai import OpenAI
async def evaluate_memory_async():
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
memory_texts = [
"The capital of France is Paris.",
"Python is a popular programming language.",
"The Earth revolves around the Sun."
]
embedding_tasks = [
client.embeddings.create(model="text-embedding-3-small", input=text)
for text in memory_texts
]
embedding_responses = await asyncio.gather(*embedding_tasks)
memory_vectors = [resp.data[0].embedding for resp in embedding_responses]
query = "What is the capital city of France?"
query_embedding_resp = await client.embeddings.create(model="text-embedding-3-small", input=query)
query_vector = query_embedding_resp.data[0].embedding
import numpy as np
def cosine_similarity(a, b):
a = np.array(a)
b = np.array(b)
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
similarities = [cosine_similarity(query_vector, vec) for vec in memory_vectors]
best_match_index = np.argmax(similarities)
messages = [
{"role": "system", "content": "You are a helpful assistant."}
]
messages.append({"role": "user", "content": query})
response = await client.chat.completions.create(model="gpt-4o-mini", messages=messages)
answer = response.choices[0].message.content
print(f"Best memory match: {memory_texts[best_match_index]}")
print(f"Similarity score: {similarities[best_match_index]:.4f}")
print(f"AI answer: {answer}")
asyncio.run(evaluate_memory_async()) output
Best memory match: The capital of France is Paris. Similarity score: 0.8723 AI answer: The capital city of France is Paris.
Troubleshooting
- If similarity scores are unexpectedly low, verify that the embedding model matches between stored memory and queries.
- If AI answers are inconsistent, increase
max_tokensor adjust the system prompt for clarity. - Ensure your API key is correctly set in
os.environ["OPENAI_API_KEY"]to avoid authentication errors.
Key Takeaways
- Use embedding similarity to quantitatively measure AI memory recall accuracy.
- Combine vector search with chat completions to verify memory relevance and consistency.
- Test asynchronously or with streaming for scalable and real-time memory evaluation.