How to beginner · 3 min read

BLEU score for translation evaluation

Q: BLEU score for translation evaluation

The BLEU score is a standard metric to evaluate machine translation quality by comparing n-gram overlaps between candidate and reference texts. Use Python libraries like nltk.translate.bleu_score to calculate BLEU scores programmatically for translation evaluation.

Quick answer

The BLEU score is a standard metric to evaluate machine translation quality by comparing n-gram overlaps between candidate and reference texts. Use Python libraries like nltk.translate.bleu_score to calculate BLEU scores programmatically for translation evaluation.

PREREQUISITES

Python 3.8+
pip install nltk
Basic knowledge of machine translation evaluation

Setup

Install the nltk library and download required resources for BLEU score calculation.

bash

pip install nltk

python -m nltk.downloader punkt

output

Collecting nltk
  Downloading nltk-3.8.1-py3-none-any.whl (1.5 MB)
Installing collected packages: nltk
Successfully installed nltk-3.8.1

[nltk_data] Downloading package punkt to /home/user/nltk_data...
[nltk_data]   Package punkt is already up-to-date!

Step by step

This example shows how to compute the BLEU score for a candidate translation against one or more reference translations using nltk.translate.bleu_score.

python

import nltk
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction

# Candidate translation (hypothesis)
candidate = "The cat is on the mat".split()

# Reference translations (ground truth)
references = [
    "There is a cat on the mat".split(),
    "A cat sits on the mat".split()
]

# Use smoothing to handle short sentences
smooth_fn = SmoothingFunction().method1

# Calculate BLEU score
bleu_score = sentence_bleu(references, candidate, smoothing_function=smooth_fn)

print(f"BLEU score: {bleu_score:.4f}")

output

BLEU score: 0.4671

Common variations

Use different n-gram weights to emphasize unigram, bigram, or higher order matches.
Calculate corpus-level BLEU for multiple sentences using corpus_bleu.
Use other smoothing methods from SmoothingFunction to improve score stability on short texts.
Integrate BLEU calculation in translation pipelines or evaluation scripts.

python

from nltk.translate.bleu_score import corpus_bleu

# Multiple candidate sentences
candidates = ["The cat is on the mat".split(), "There is a dog".split()]

# Corresponding references
references = [
    ["There is a cat on the mat".split(), "A cat sits on the mat".split()],
    ["There is a dog".split()]
]

corpus_score = corpus_bleu(references, candidates)
print(f"Corpus BLEU score: {corpus_score:.4f}")

output

Corpus BLEU score: 0.6389

Troubleshooting

If you see ZeroDivisionError, use smoothing functions like SmoothingFunction().method1 to avoid zero scores on short sentences.
Ensure candidate and reference texts are tokenized consistently (e.g., using str.split() or nltk.word_tokenize).
BLEU scores range from 0 to 1; very low scores may indicate poor translation or tokenization mismatch.

Key Takeaways

Use nltk.translate.bleu_score to calculate BLEU scores for translation evaluation in Python.
Apply smoothing functions to handle short sentences and avoid zero BLEU scores.
Corpus-level BLEU aggregates multiple sentence scores for overall translation quality.
Consistent tokenization of candidate and reference texts is critical for accurate BLEU calculation.

Verified 2026-04

Verify ↗

Community Notes

No notes yetBe the first to share a version-specific fix or tip.