How to beginner · 3 min read

Precision recall for extraction tasks

Quick answer
Use precision and recall metrics to evaluate AI extraction tasks by comparing predicted extractions against ground truth labels. Calculate precision as the ratio of correct extractions over all predicted extractions, and recall as the ratio of correct extractions over all true extractions.

PREREQUISITES

  • Python 3.8+
  • pip install scikit-learn
  • Basic knowledge of extraction tasks and evaluation metrics

Setup

Install scikit-learn for evaluation metrics and prepare your Python environment.

bash
pip install scikit-learn
output
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.10/site-packages (1.3.0)
Requirement already satisfied: numpy>=1.17.3 in /usr/local/lib/python3.10/site-packages (from scikit-learn) (1.25.0)
Requirement already satisfied: scipy>=1.5.0 in /usr/local/lib/python3.10/site-packages (from scikit-learn) (1.11.1)

Step by step

Calculate precision and recall for an extraction task by comparing predicted and true labels. Use sklearn.metrics.precision_score and recall_score for binary or multilabel extraction results.

python
from sklearn.metrics import precision_score, recall_score

# Example: binary extraction task (1 = extracted entity, 0 = no entity)
true_labels = [1, 0, 1, 1, 0, 0, 1]
predicted_labels = [1, 0, 0, 1, 0, 1, 1]

precision = precision_score(true_labels, predicted_labels)
recall = recall_score(true_labels, predicted_labels)

print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
output
Precision: 0.75
Recall: 0.75

Common variations

  • For multilabel extraction, use average='micro' or average='macro' in precision_score and recall_score.
  • Use f1_score to balance precision and recall.
  • For large-scale extraction, compute metrics per entity type or field.
python
from sklearn.metrics import f1_score

# Multilabel example
true_multilabel = [[1, 0, 1], [0, 1, 0], [1, 1, 0]]
pred_multilabel = [[1, 0, 0], [0, 1, 1], [1, 0, 0]]

precision_micro = precision_score(true_multilabel, pred_multilabel, average='micro')
recall_micro = recall_score(true_multilabel, pred_multilabel, average='micro')
f1 = f1_score(true_multilabel, pred_multilabel, average='micro')

print(f"Micro Precision: {precision_micro:.2f}")
print(f"Micro Recall: {recall_micro:.2f}")
print(f"Micro F1 Score: {f1:.2f}")
output
Micro Precision: 0.80
Micro Recall: 0.67
Micro F1 Score: 0.73

Troubleshooting

  • If precision or recall is zero, check that your predicted and true labels align in format and length.
  • For imbalanced data, use average='weighted' to get more representative metrics.
  • Ensure binary labels are 0/1 integers; otherwise, metrics may error or produce invalid results.

Key Takeaways

  • Use precision_score and recall_score from scikit-learn to evaluate extraction accuracy.
  • Calculate metrics per entity type for detailed extraction performance analysis.
  • Adjust average parameter for multilabel or imbalanced extraction tasks.
  • Validate label formats to avoid metric calculation errors.
Verified 2026-04
Verify ↗