Concept Beginner · 3 min read

What is BLEU score in NLP

Quick answer
The BLEU (Bilingual Evaluation Understudy) score is a metric in natural language processing that evaluates the quality of machine-generated text by comparing it to one or more reference texts. It measures n-gram overlap precision with a brevity penalty to assess translation accuracy.
BLEU (Bilingual Evaluation Understudy) score is an automatic evaluation metric that measures how closely machine-generated text matches human reference translations by calculating n-gram precision with length penalties.

How it works

The BLEU score works by comparing the n-grams (contiguous sequences of words) in a candidate translation against one or more reference translations. It calculates the precision of these n-grams, meaning the proportion of n-grams in the candidate that appear in the references. To avoid favoring overly short translations, it applies a brevity penalty if the candidate is shorter than the reference. Think of it like grading a student’s essay by checking how many phrases match a model answer, but penalizing if the essay is too short to be complete.

Concrete example

Suppose we have a candidate translation and a reference translation:

  • Candidate: "The cat is on the mat"
  • Reference: "There is a cat on the mat"

We calculate unigram (1-gram) precision by counting matching words:

python
from nltk.translate.bleu_score import sentence_bleu

reference = [['There', 'is', 'a', 'cat', 'on', 'the', 'mat']]
candidate = ['The', 'cat', 'is', 'on', 'the', 'mat']

score = sentence_bleu(reference, candidate)
print(f"BLEU score: {score:.4f}")
output
BLEU score: 0.4671
N-gramCandidate n-gramsReference n-gramsMatches
1-gram"The", "cat", "is", "on", "the", "mat""There", "is", "a", "cat", "on", "the", "mat""cat", "is", "on", "the", "mat"
2-gram"The cat", "cat is", "is on", "on the", "the mat""There is", "is a", "a cat", "cat on", "on the", "the mat""on the", "the mat"

When to use it

Use BLEU score when you need an automatic, quick, and language-agnostic metric to evaluate machine translation or text generation quality against human references. It is best suited for tasks where n-gram overlap correlates with quality, such as machine translation or summarization. Avoid using BLEU for tasks requiring semantic understanding beyond surface word overlap or for very short texts where n-gram statistics are unreliable.

Key terms

TermDefinition
BLEUBilingual Evaluation Understudy, an automatic metric for evaluating text quality by n-gram overlap.
N-gramA contiguous sequence of n words in text, e.g., unigram (1), bigram (2), trigram (3).
PrecisionThe fraction of candidate n-grams that appear in the reference text(s).
Brevity penaltyA penalty applied to candidate translations shorter than references to discourage overly short outputs.

Key Takeaways

  • BLEU score measures n-gram precision between candidate and reference texts with a brevity penalty.
  • It is widely used for automatic evaluation of machine translation quality.
  • BLEU works best when multiple reference translations are available for comparison.
  • Avoid BLEU for tasks needing deep semantic evaluation or very short text outputs.
Verified 2026-04
Verify ↗