ROUGE score for summarization
Quick answer
The
ROUGE score is a set of metrics to evaluate the quality of summaries by comparing them to reference texts, focusing on overlapping n-grams, longest common subsequence, and word sequences. Use Python libraries like rouge-score or datasets to calculate ROUGE metrics programmatically for summarization tasks.PREREQUISITES
Python 3.8+pip install rouge-score datasetsBasic knowledge of text summarization
Setup
Install the required Python packages to compute ROUGE scores. The rouge-score library provides official ROUGE implementations, and datasets from Hugging Face offers convenient evaluation utilities.
pip install rouge-score datasets output
Collecting rouge-score Collecting datasets Successfully installed rouge-score-0.1.2 datasets-2.14.5
Step by step
Use the rouge_score package to calculate ROUGE-1, ROUGE-2, and ROUGE-L scores between a generated summary and a reference summary.
from rouge_score import rouge_scorer
# Reference summary (ground truth)
reference = "The cat sat on the mat and looked at the dog."
# Generated summary (model output)
generated = "The cat was sitting on the mat looking at the dog."
# Initialize scorer for ROUGE-1, ROUGE-2, and ROUGE-L
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
# Calculate scores
scores = scorer.score(reference, generated)
# Print results
for key, score in scores.items():
print(f"{key}: Precision={score.precision:.3f}, Recall={score.recall:.3f}, F1={score.fmeasure:.3f}") output
rouge1: Precision=0.857, Recall=0.857, F1=0.857 rouge2: Precision=0.750, Recall=0.750, F1=0.750 rougeL: Precision=0.857, Recall=0.857, F1=0.857
Common variations
You can also use the Hugging Face datasets library for batch evaluation and integration with other NLP tasks. It supports ROUGE-1, ROUGE-2, ROUGE-L, and ROUGE-Lsum metrics.
Example with datasets:
from datasets import load_metric
rouge = load_metric('rouge')
references = ["The cat sat on the mat and looked at the dog."]
predictions = ["The cat was sitting on the mat looking at the dog."]
results = rouge.compute(predictions=predictions, references=references)
for key, value in results.items():
print(f"{key}: {value.mid.fmeasure:.3f}") output
rouge1: 0.857 rouge2: 0.750 rougeL: 0.857 rougeLsum: 0.857
Troubleshooting
- If you get
ModuleNotFoundError, ensure you installedrouge-scoreordatasetscorrectly. - ROUGE scores depend heavily on tokenization; using
use_stemmer=Trueimproves matching by stemming words. - For multi-document summaries, aggregate scores over all pairs for reliable evaluation.
Key Takeaways
- Use the
rouge-scorePython package for precise ROUGE metric calculations. - Hugging Face
datasetslibrary offers easy batch ROUGE evaluation with multiple metrics. - Enable stemming in ROUGE scoring to improve matching accuracy.
- ROUGE evaluates n-gram overlap and longest common subsequence to assess summary quality.