Memory retrieval precision metrics
Quick answer
Use
precision, recall, and F1 score to measure memory retrieval accuracy in AI systems. These metrics quantify how well the retrieved memory matches relevant information, balancing correctness and completeness.PREREQUISITES
Python 3.8+pip install scikit-learnBasic knowledge of AI memory retrieval
Setup
Install the scikit-learn library for metric calculations and set up your Python environment.
pip install scikit-learn output
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.10/site-packages (1.3.0) Requirement already satisfied: numpy>=1.17.3 in /usr/local/lib/python3.10/site-packages (from scikit-learn) (1.25.0) Requirement already satisfied: scipy>=1.5.0 in /usr/local/lib/python3.10/site-packages (from scikit-learn) (1.11.0)
Step by step
Calculate precision, recall, and F1 score for memory retrieval by comparing retrieved items against relevant ground truth items.
from sklearn.metrics import precision_score, recall_score, f1_score
# Example ground truth and retrieved memory items as binary relevance vectors
# 1 = relevant, 0 = irrelevant
ground_truth = [1, 0, 1, 1, 0, 0, 1]
retrieved = [1, 0, 0, 1, 0, 1, 0]
precision = precision_score(ground_truth, retrieved)
recall = recall_score(ground_truth, retrieved)
f1 = f1_score(ground_truth, retrieved)
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 Score: {f1:.2f}") output
Precision: 1.00 Recall: 0.50 F1 Score: 0.67
Common variations
You can compute these metrics for multi-class or multi-label retrieval tasks by adjusting the average parameter in scikit-learn metrics. For asynchronous or streaming retrieval, accumulate results incrementally and compute metrics periodically.
from sklearn.metrics import precision_score, recall_score, f1_score
# Multi-label example with average='macro'
ground_truth_multi = [[1,0,1],[0,1,1],[1,1,0]]
retrieved_multi = [[1,0,0],[0,1,1],[1,0,0]]
precision = precision_score(ground_truth_multi, retrieved_multi, average='macro')
recall = recall_score(ground_truth_multi, retrieved_multi, average='macro')
f1 = f1_score(ground_truth_multi, retrieved_multi, average='macro')
print(f"Macro Precision: {precision:.2f}")
print(f"Macro Recall: {recall:.2f}")
print(f"Macro F1 Score: {f1:.2f}") output
Macro Precision: 0.83 Macro Recall: 0.67 Macro F1 Score: 0.72
Troubleshooting
- If
precision_scoreorrecall_scorethrows an error about ill-defined metrics, ensure your input arrays contain both positive and negative labels. - For sparse or imbalanced retrieval results, use
zero_division=0in metric functions to avoid exceptions. - Verify that your ground truth and retrieved lists are aligned and of equal length.
precision = precision_score(ground_truth, retrieved, zero_division=0)
recall = recall_score(ground_truth, retrieved, zero_division=0)
f1 = f1_score(ground_truth, retrieved, zero_division=0) output
Precision: 1.00 Recall: 0.50 F1 Score: 0.67
Key Takeaways
- Use
precision,recall, andF1 scoreto quantify memory retrieval accuracy. - Leverage
scikit-learnmetrics for straightforward calculation with binary or multi-label data. - Handle imbalanced or sparse data by setting
zero_division=0to avoid metric errors.