Concept Beginner · 3 min read

What is ROUGE score in NLP

Q: What is ROUGE score in NLP

The ROUGE score is a set of metrics used in Natural Language Processing (NLP) to evaluate the quality of generated text by measuring overlap with reference texts, especially in summarization tasks. It compares n-grams, word sequences, and word pairs between candidate and reference summaries to quantify similarity.

Quick answer

The ROUGE score is a set of metrics used in Natural Language Processing (NLP) to evaluate the quality of generated text by measuring overlap with reference texts, especially in summarization tasks. It compares n-grams, word sequences, and word pairs between candidate and reference summaries to quantify similarity.

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is an NLP evaluation metric that measures the quality of generated text by comparing it to reference summaries based on overlapping units like n-grams and sequences.

How it works

ROUGE works by comparing the generated text (candidate) against one or more reference texts to measure how much content overlaps. It focuses on recall-oriented metrics, meaning it checks how much of the reference content is captured by the candidate. The most common variants are ROUGE-N (n-gram overlap), ROUGE-L (longest common subsequence), and ROUGE-S (skip-bigram overlap). Think of it like checking how many words or phrases from a model's summary appear in a human-written summary, similar to grading a student's paraphrase by matching key phrases.

Concrete example

Suppose you have a reference summary and a candidate summary:

Reference: "The cat sat on the mat."
Candidate: "The cat is sitting on the mat."

Using ROUGE-1 (unigram overlap), count overlapping words:

python

reference = "The cat sat on the mat.".lower().split()
candidate = "The cat is sitting on the mat.".lower().split()

# Calculate ROUGE-1 recall
overlap = set(reference) & set(candidate)
recall = len(overlap) / len(reference)
print(f"ROUGE-1 recall: {recall:.2f}")

output

ROUGE-1 recall: 0.83

When to use it

Use ROUGE scores to evaluate tasks like text summarization, machine translation, or any generation task where you want to measure how well the output matches a reference. It is best when you have high-quality reference texts and want a quantitative metric for content overlap. Avoid using ROUGE for creative generation tasks where exact overlap is less meaningful, such as open-ended dialogue or story generation.

Key terms

Term	Definition
ROUGE-N	Measures overlap of n-grams between candidate and reference.
ROUGE-L	Measures longest common subsequence to capture sentence-level structure.
ROUGE-S	Measures skip-bigram overlap allowing gaps between words.
Recall	Fraction of reference units found in candidate text.
Precision	Fraction of candidate units found in reference text.

Key Takeaways

ROUGE is a recall-focused metric comparing generated text to reference summaries using n-gram and sequence overlap.
ROUGE-1, ROUGE-L, and ROUGE-S are common variants measuring different types of overlap.
Use ROUGE to evaluate summarization and translation quality when reference texts are available.
ROUGE is less suitable for creative or open-ended generation tasks without fixed references.

Verified 2026-04

Verify ↗

Community Notes

No notes yetBe the first to share a version-specific fix or tip.