How to beginner · 3 min read

Memory retrieval precision metrics

Q: Memory retrieval precision metrics

Use precision, recall, and F1 score to measure memory retrieval accuracy in AI systems. These metrics quantify how well the retrieved memory matches relevant information, balancing correctness and completeness.

Quick answer

Use precision, recall, and F1 score to measure memory retrieval accuracy in AI systems. These metrics quantify how well the retrieved memory matches relevant information, balancing correctness and completeness.

PREREQUISITES

Python 3.8+
pip install scikit-learn
Basic knowledge of AI memory retrieval

Setup

Install the scikit-learn library for metric calculations and set up your Python environment.

bash

pip install scikit-learn

output

Requirement already satisfied: scikit-learn in /usr/local/lib/python3.10/site-packages (1.3.0)
Requirement already satisfied: numpy>=1.17.3 in /usr/local/lib/python3.10/site-packages (from scikit-learn) (1.25.0)
Requirement already satisfied: scipy>=1.5.0 in /usr/local/lib/python3.10/site-packages (from scikit-learn) (1.11.0)

Step by step

Calculate precision, recall, and F1 score for memory retrieval by comparing retrieved items against relevant ground truth items.

python

from sklearn.metrics import precision_score, recall_score, f1_score

# Example ground truth and retrieved memory items as binary relevance vectors
# 1 = relevant, 0 = irrelevant

ground_truth = [1, 0, 1, 1, 0, 0, 1]
retrieved =    [1, 0, 0, 1, 0, 1, 0]

precision = precision_score(ground_truth, retrieved)
recall = recall_score(ground_truth, retrieved)
f1 = f1_score(ground_truth, retrieved)

print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 Score: {f1:.2f}")

output

Precision: 1.00
Recall: 0.50
F1 Score: 0.67

Common variations

You can compute these metrics for multi-class or multi-label retrieval tasks by adjusting the average parameter in scikit-learn metrics. For asynchronous or streaming retrieval, accumulate results incrementally and compute metrics periodically.

python

from sklearn.metrics import precision_score, recall_score, f1_score

# Multi-label example with average='macro'
ground_truth_multi = [[1,0,1],[0,1,1],[1,1,0]]
retrieved_multi = [[1,0,0],[0,1,1],[1,0,0]]

precision = precision_score(ground_truth_multi, retrieved_multi, average='macro')
recall = recall_score(ground_truth_multi, retrieved_multi, average='macro')
f1 = f1_score(ground_truth_multi, retrieved_multi, average='macro')

print(f"Macro Precision: {precision:.2f}")
print(f"Macro Recall: {recall:.2f}")
print(f"Macro F1 Score: {f1:.2f}")

output

Macro Precision: 0.83
Macro Recall: 0.67
Macro F1 Score: 0.72

Troubleshooting

If precision_score or recall_score throws an error about ill-defined metrics, ensure your input arrays contain both positive and negative labels.
For sparse or imbalanced retrieval results, use zero_division=0 in metric functions to avoid exceptions.
Verify that your ground truth and retrieved lists are aligned and of equal length.

python

precision = precision_score(ground_truth, retrieved, zero_division=0)
recall = recall_score(ground_truth, retrieved, zero_division=0)
f1 = f1_score(ground_truth, retrieved, zero_division=0)

output

Precision: 1.00
Recall: 0.50
F1 Score: 0.67

✅

Key Takeaways

Use precision, recall, and F1 score to quantify memory retrieval accuracy.
Leverage scikit-learn metrics for straightforward calculation with binary or multi-label data.
Handle imbalanced or sparse data by setting zero_division=0 to avoid metric errors.

Verified 2026-04

Verify ↗