What is ROUGE score in NLP
ROUGE score is a set of metrics used in Natural Language Processing (NLP) to evaluate the quality of generated text by measuring overlap with reference texts, especially in summarization tasks. It compares n-grams, word sequences, and word pairs between candidate and reference summaries to quantify similarity.ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is an NLP evaluation metric that measures the quality of generated text by comparing it to reference summaries based on overlapping units like n-grams and sequences.How it works
ROUGE works by comparing the generated text (candidate) against one or more reference texts to measure how much content overlaps. It focuses on recall-oriented metrics, meaning it checks how much of the reference content is captured by the candidate. The most common variants are ROUGE-N (n-gram overlap), ROUGE-L (longest common subsequence), and ROUGE-S (skip-bigram overlap). Think of it like checking how many words or phrases from a model's summary appear in a human-written summary, similar to grading a student's paraphrase by matching key phrases.
Concrete example
Suppose you have a reference summary and a candidate summary:
Reference: "The cat sat on the mat."
Candidate: "The cat is sitting on the mat."
Using ROUGE-1 (unigram overlap), count overlapping words:
reference = "The cat sat on the mat.".lower().split()
candidate = "The cat is sitting on the mat.".lower().split()
# Calculate ROUGE-1 recall
overlap = set(reference) & set(candidate)
recall = len(overlap) / len(reference)
print(f"ROUGE-1 recall: {recall:.2f}") ROUGE-1 recall: 0.83
When to use it
Use ROUGE scores to evaluate tasks like text summarization, machine translation, or any generation task where you want to measure how well the output matches a reference. It is best when you have high-quality reference texts and want a quantitative metric for content overlap. Avoid using ROUGE for creative generation tasks where exact overlap is less meaningful, such as open-ended dialogue or story generation.
Key terms
| Term | Definition |
|---|---|
| ROUGE-N | Measures overlap of n-grams between candidate and reference. |
| ROUGE-L | Measures longest common subsequence to capture sentence-level structure. |
| ROUGE-S | Measures skip-bigram overlap allowing gaps between words. |
| Recall | Fraction of reference units found in candidate text. |
| Precision | Fraction of candidate units found in reference text. |
Key Takeaways
- ROUGE is a recall-focused metric comparing generated text to reference summaries using n-gram and sequence overlap.
- ROUGE-1, ROUGE-L, and ROUGE-S are common variants measuring different types of overlap.
- Use ROUGE to evaluate summarization and translation quality when reference texts are available.
- ROUGE is less suitable for creative or open-ended generation tasks without fixed references.