How to beginner · 3 min read

BLEU score for translation evaluation

Quick answer
The BLEU score is a standard metric to evaluate machine translation quality by comparing n-gram overlaps between candidate and reference texts. Use Python libraries like nltk.translate.bleu_score to calculate BLEU scores programmatically for translation evaluation.

PREREQUISITES

  • Python 3.8+
  • pip install nltk
  • Basic knowledge of machine translation evaluation

Setup

Install the nltk library and download required resources for BLEU score calculation.

bash
pip install nltk

python -m nltk.downloader punkt
output
Collecting nltk
  Downloading nltk-3.8.1-py3-none-any.whl (1.5 MB)
Installing collected packages: nltk
Successfully installed nltk-3.8.1

[nltk_data] Downloading package punkt to /home/user/nltk_data...
[nltk_data]   Package punkt is already up-to-date!

Step by step

This example shows how to compute the BLEU score for a candidate translation against one or more reference translations using nltk.translate.bleu_score.

python
import nltk
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction

# Candidate translation (hypothesis)
candidate = "The cat is on the mat".split()

# Reference translations (ground truth)
references = [
    "There is a cat on the mat".split(),
    "A cat sits on the mat".split()
]

# Use smoothing to handle short sentences
smooth_fn = SmoothingFunction().method1

# Calculate BLEU score
bleu_score = sentence_bleu(references, candidate, smoothing_function=smooth_fn)

print(f"BLEU score: {bleu_score:.4f}")
output
BLEU score: 0.4671

Common variations

  • Use different n-gram weights to emphasize unigram, bigram, or higher order matches.
  • Calculate corpus-level BLEU for multiple sentences using corpus_bleu.
  • Use other smoothing methods from SmoothingFunction to improve score stability on short texts.
  • Integrate BLEU calculation in translation pipelines or evaluation scripts.
python
from nltk.translate.bleu_score import corpus_bleu

# Multiple candidate sentences
candidates = ["The cat is on the mat".split(), "There is a dog".split()]

# Corresponding references
references = [
    ["There is a cat on the mat".split(), "A cat sits on the mat".split()],
    ["There is a dog".split()]
]

corpus_score = corpus_bleu(references, candidates)
print(f"Corpus BLEU score: {corpus_score:.4f}")
output
Corpus BLEU score: 0.6389

Troubleshooting

  • If you see ZeroDivisionError, use smoothing functions like SmoothingFunction().method1 to avoid zero scores on short sentences.
  • Ensure candidate and reference texts are tokenized consistently (e.g., using str.split() or nltk.word_tokenize).
  • BLEU scores range from 0 to 1; very low scores may indicate poor translation or tokenization mismatch.

Key Takeaways

  • Use nltk.translate.bleu_score to calculate BLEU scores for translation evaluation in Python.
  • Apply smoothing functions to handle short sentences and avoid zero BLEU scores.
  • Corpus-level BLEU aggregates multiple sentence scores for overall translation quality.
  • Consistent tokenization of candidate and reference texts is critical for accurate BLEU calculation.
Verified 2026-04
Verify ↗