What is BLEU score in NLP
BLEU (Bilingual Evaluation Understudy) score is a metric in natural language processing that evaluates the quality of machine-generated text by comparing it to one or more reference texts. It measures n-gram overlap precision with a brevity penalty to assess translation accuracy.BLEU (Bilingual Evaluation Understudy) score is an automatic evaluation metric that measures how closely machine-generated text matches human reference translations by calculating n-gram precision with length penalties.How it works
The BLEU score works by comparing the n-grams (contiguous sequences of words) in a candidate translation against one or more reference translations. It calculates the precision of these n-grams, meaning the proportion of n-grams in the candidate that appear in the references. To avoid favoring overly short translations, it applies a brevity penalty if the candidate is shorter than the reference. Think of it like grading a student’s essay by checking how many phrases match a model answer, but penalizing if the essay is too short to be complete.
Concrete example
Suppose we have a candidate translation and a reference translation:
- Candidate: "The cat is on the mat"
- Reference: "There is a cat on the mat"
We calculate unigram (1-gram) precision by counting matching words:
from nltk.translate.bleu_score import sentence_bleu
reference = [['There', 'is', 'a', 'cat', 'on', 'the', 'mat']]
candidate = ['The', 'cat', 'is', 'on', 'the', 'mat']
score = sentence_bleu(reference, candidate)
print(f"BLEU score: {score:.4f}") BLEU score: 0.4671
| N-gram | Candidate n-grams | Reference n-grams | Matches |
|---|---|---|---|
| 1-gram | "The", "cat", "is", "on", "the", "mat" | "There", "is", "a", "cat", "on", "the", "mat" | "cat", "is", "on", "the", "mat" |
| 2-gram | "The cat", "cat is", "is on", "on the", "the mat" | "There is", "is a", "a cat", "cat on", "on the", "the mat" | "on the", "the mat" |
When to use it
Use BLEU score when you need an automatic, quick, and language-agnostic metric to evaluate machine translation or text generation quality against human references. It is best suited for tasks where n-gram overlap correlates with quality, such as machine translation or summarization. Avoid using BLEU for tasks requiring semantic understanding beyond surface word overlap or for very short texts where n-gram statistics are unreliable.
Key terms
| Term | Definition |
|---|---|
| BLEU | Bilingual Evaluation Understudy, an automatic metric for evaluating text quality by n-gram overlap. |
| N-gram | A contiguous sequence of n words in text, e.g., unigram (1), bigram (2), trigram (3). |
| Precision | The fraction of candidate n-grams that appear in the reference text(s). |
| Brevity penalty | A penalty applied to candidate translations shorter than references to discourage overly short outputs. |
Key Takeaways
- BLEU score measures n-gram precision between candidate and reference texts with a brevity penalty.
- It is widely used for automatic evaluation of machine translation quality.
- BLEU works best when multiple reference translations are available for comparison.
- Avoid BLEU for tasks needing deep semantic evaluation or very short text outputs.