How to beginner · 3 min read

ROUGE score for summarization

Q: ROUGE score for summarization

The ROUGE score is a set of metrics to evaluate the quality of summaries by comparing them to reference texts, focusing on overlapping n-grams, longest common subsequence, and word sequences. Use Python libraries like rouge-score or datasets to calculate ROUGE metrics programmatically for summarization tasks.

Quick answer

The ROUGE score is a set of metrics to evaluate the quality of summaries by comparing them to reference texts, focusing on overlapping n-grams, longest common subsequence, and word sequences. Use Python libraries like rouge-score or datasets to calculate ROUGE metrics programmatically for summarization tasks.

PREREQUISITES

Python 3.8+
pip install rouge-score datasets
Basic knowledge of text summarization

Setup

Install the required Python packages to compute ROUGE scores. The rouge-score library provides official ROUGE implementations, and datasets from Hugging Face offers convenient evaluation utilities.

bash

pip install rouge-score datasets

output

Collecting rouge-score
Collecting datasets
Successfully installed rouge-score-0.1.2 datasets-2.14.5

Step by step

Use the rouge_score package to calculate ROUGE-1, ROUGE-2, and ROUGE-L scores between a generated summary and a reference summary.

python

from rouge_score import rouge_scorer

# Reference summary (ground truth)
reference = "The cat sat on the mat and looked at the dog."

# Generated summary (model output)
generated = "The cat was sitting on the mat looking at the dog."

# Initialize scorer for ROUGE-1, ROUGE-2, and ROUGE-L
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

# Calculate scores
scores = scorer.score(reference, generated)

# Print results
for key, score in scores.items():
    print(f"{key}: Precision={score.precision:.3f}, Recall={score.recall:.3f}, F1={score.fmeasure:.3f}")

output

rouge1: Precision=0.857, Recall=0.857, F1=0.857
rouge2: Precision=0.750, Recall=0.750, F1=0.750
rougeL: Precision=0.857, Recall=0.857, F1=0.857

Common variations

You can also use the Hugging Face datasets library for batch evaluation and integration with other NLP tasks. It supports ROUGE-1, ROUGE-2, ROUGE-L, and ROUGE-Lsum metrics.

Example with datasets:

python

from datasets import load_metric

rouge = load_metric('rouge')

references = ["The cat sat on the mat and looked at the dog."]
predictions = ["The cat was sitting on the mat looking at the dog."]

results = rouge.compute(predictions=predictions, references=references)

for key, value in results.items():
    print(f"{key}: {value.mid.fmeasure:.3f}")

output

rouge1: 0.857
rouge2: 0.750
rougeL: 0.857
rougeLsum: 0.857

Troubleshooting

If you get ModuleNotFoundError, ensure you installed rouge-score or datasets correctly.
ROUGE scores depend heavily on tokenization; using use_stemmer=True improves matching by stemming words.
For multi-document summaries, aggregate scores over all pairs for reliable evaluation.

Key Takeaways

Use the rouge-score Python package for precise ROUGE metric calculations.
Hugging Face datasets library offers easy batch ROUGE evaluation with multiple metrics.
Enable stemming in ROUGE scoring to improve matching accuracy.
ROUGE evaluates n-gram overlap and longest common subsequence to assess summary quality.

Verified 2026-04

Verify ↗

Community Notes

No notes yetBe the first to share a version-specific fix or tip.