How to beginner · 3 min read

Memory retrieval precision metrics

Quick answer
Use precision, recall, and F1 score to measure memory retrieval accuracy in AI systems. These metrics quantify how well the retrieved memory matches relevant information, balancing correctness and completeness.

PREREQUISITES

  • Python 3.8+
  • pip install scikit-learn
  • Basic knowledge of AI memory retrieval

Setup

Install the scikit-learn library for metric calculations and set up your Python environment.

bash
pip install scikit-learn
output
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.10/site-packages (1.3.0)
Requirement already satisfied: numpy>=1.17.3 in /usr/local/lib/python3.10/site-packages (from scikit-learn) (1.25.0)
Requirement already satisfied: scipy>=1.5.0 in /usr/local/lib/python3.10/site-packages (from scikit-learn) (1.11.0)

Step by step

Calculate precision, recall, and F1 score for memory retrieval by comparing retrieved items against relevant ground truth items.

python
from sklearn.metrics import precision_score, recall_score, f1_score

# Example ground truth and retrieved memory items as binary relevance vectors
# 1 = relevant, 0 = irrelevant

ground_truth = [1, 0, 1, 1, 0, 0, 1]
retrieved =    [1, 0, 0, 1, 0, 1, 0]

precision = precision_score(ground_truth, retrieved)
recall = recall_score(ground_truth, retrieved)
f1 = f1_score(ground_truth, retrieved)

print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 Score: {f1:.2f}")
output
Precision: 1.00
Recall: 0.50
F1 Score: 0.67

Common variations

You can compute these metrics for multi-class or multi-label retrieval tasks by adjusting the average parameter in scikit-learn metrics. For asynchronous or streaming retrieval, accumulate results incrementally and compute metrics periodically.

python
from sklearn.metrics import precision_score, recall_score, f1_score

# Multi-label example with average='macro'
ground_truth_multi = [[1,0,1],[0,1,1],[1,1,0]]
retrieved_multi = [[1,0,0],[0,1,1],[1,0,0]]

precision = precision_score(ground_truth_multi, retrieved_multi, average='macro')
recall = recall_score(ground_truth_multi, retrieved_multi, average='macro')
f1 = f1_score(ground_truth_multi, retrieved_multi, average='macro')

print(f"Macro Precision: {precision:.2f}")
print(f"Macro Recall: {recall:.2f}")
print(f"Macro F1 Score: {f1:.2f}")
output
Macro Precision: 0.83
Macro Recall: 0.67
Macro F1 Score: 0.72

Troubleshooting

  • If precision_score or recall_score throws an error about ill-defined metrics, ensure your input arrays contain both positive and negative labels.
  • For sparse or imbalanced retrieval results, use zero_division=0 in metric functions to avoid exceptions.
  • Verify that your ground truth and retrieved lists are aligned and of equal length.
python
precision = precision_score(ground_truth, retrieved, zero_division=0)
recall = recall_score(ground_truth, retrieved, zero_division=0)
f1 = f1_score(ground_truth, retrieved, zero_division=0)
output
Precision: 1.00
Recall: 0.50
F1 Score: 0.67

Key Takeaways

  • Use precision, recall, and F1 score to quantify memory retrieval accuracy.
  • Leverage scikit-learn metrics for straightforward calculation with binary or multi-label data.
  • Handle imbalanced or sparse data by setting zero_division=0 to avoid metric errors.
Verified 2026-04
Verify ↗