How is BLEU score calculated?

BLEU multiplies a brevity penalty by the geometric mean of modified n-gram precisions: BLEU = BP · exp(Σ wₙ·ln pₙ), with uniform weights wₙ = 1/N. Each pₙ is the count of candidate n-grams that appear in a reference (clipped to the reference's count) divided by the total candidate n-grams. BLEU-4 uses orders 1 through 4. This is the method from Papineni et al. (2002).

What is a good BLEU score?

On the 0–100 scale, under 10 is almost useless, 10–29 conveys the gist with significant errors, 30–40 is understandable to good, 40–50 is high quality, and above 60 is often at or beyond human quality (Google AutoML Translation guidance). Scores only compare meaningfully under identical tokenisation and the same number of references.

What is the brevity penalty in BLEU?

The brevity penalty (BP) stops a system from gaming precision by emitting very short output. With candidate length c and the closest reference length r, BP = 1 when c > r, otherwise BP = exp(1 − r/c). A candidate shorter than the reference is multiplied down; one at least as long is not penalised.

What is modified n-gram precision and clipping?

Plain precision would let a candidate score 1.0 by repeating one correct word. Modified precision clips each candidate n-gram's count to the maximum times it appears in any single reference. For candidate 'the the the' against a reference with two 'the's, the clipped count is 2, not 3, so p1 = 2/3 instead of 3/3.

Why is my BLEU score 0 for a short sentence?

BLEU-4 needs at least one matching 4-gram. A short or near-perfect short candidate may have no 4-grams at all, making p4 = 0, which zeroes the geometric mean and the whole score. Turn on ε smoothing (NLTK method 1) or lower the max n-gram order N to get a meaningful number for short text.

Can I use more than one reference?

Yes. Put one reference per line in the reference box. Each candidate n-gram is clipped against whichever single reference contains it most, and the brevity penalty uses the reference length closest to the candidate (ties go to the shorter one). More references usually raise the score because there are more legitimate ways to match.

Does this match NLTK or sacreBLEU?

It matches NLTK sentence_bleu with uniform weights and, when enabled, SmoothingFunction.method1. sacreBLEU can differ because it applies its own tokeniser (e.g. 13a, intl) before scoring. This tool uses plain whitespace tokenisation, so match your tokenisation to compare numbers, and report BLEU with its settings as recommended by Post (2018).

Should I lowercase before scoring?

Case-insensitive BLEU is the common default because a translation should not be punished for capitalisation. Lowercasing is on by default here. Turn it off when case is part of what you are evaluating, such as named-entity casing, and keep the same choice across every system you compare.

AI · Machine Translation

BLEU Score Calculator

Paste a candidate translation and one or more references to get the BLEU score on the 0–1 and 0–100 scales, the modified n-gram precisions p1–p4, the brevity penalty, and the clipped match counts — so you can see exactly how the number was derived. Matches NLTK, no signup, runs in your browser.

By Induwara Ashinsana— Executive Director, Ryzera TechnologiesUpdated Jun 9, 2026

BLEU Score

Candidate (machine output)

One sentence. Stays on your device.0 tokens

Reference(s) — one per line

At least one. Stays on your device.0 tokens

Max n-gram (N)

Examples

BLEU-4

—

Rating

—

Brevity penalty

—

Geometric mean

—

Runs entirely in your browser — your text is never uploaded, logged, or stored. Method: modified n-gram precision with clipping, brevity penalty, and uniform-weight geometric mean, per Papineni et al. (2002); reconciled to NLTK sentence_bleu. Up to 50,000 characters per box.

How it works

BLEU (Bilingual Evaluation Understudy) is the most widely reported automatic metric for machine translation and other text-generation tasks. It compares a system's output (the candidate) with one or more correct reference texts and rewards candidates that share n-grams with a reference while not being too short. The score, defined by Papineni et al. (2002), is:

BLEU = BP · exp( Σ wₙ · ln pₙ ), wₙ = 1/N

The two ingredients are the modified n-gram precisions pₙ and the brevity penalty BP. They are computed in four steps:

Tokenise.The candidate and each reference are optionally lowercased and split on whitespace into word tokens. This tool uses plain whitespace tokenisation with no stemming, matching the BLEU paper's running examples.
Modified precision with clipping.For each order n from 1 to N, count the candidate's n-grams. Each distinct n-gram is credited only up to the maximum number of times it appears in a single reference — count_clip(g) = min(count_cand(g), max_ref count_ref(g)). Then pₙ = Σ clipped ÷ Σ candidate n-grams. Clipping is what stops a candidate from scoring well by repeating one correct word.
Brevity penalty.Let c be the candidate token count and r the reference length closest to c (ties go to the shorter reference). BP = 1 when c > r, otherwise BP = exp(1 − r/c). Without it, a system could inflate precision by emitting only the words it is most sure of.
Combine. Take the geometric mean of p1…pN with equal weights 1/N, then multiply by BP. Because it is a geometric mean, a single pₙ of 0 makes the whole score 0 — common for short sentences, which is why an ε-smoothing toggle (NLTK method 1) is provided to add a tiny count to empty orders.

The same precisions are combined a second time in the product domain (the N-th root of the product of the pₙ) and compared with the log-domain result as a cross-check; when they agree the score is flagged “cross-checked”. The numbers reconcile with NLTK's sentence_bleu using uniform weights, so they are directly comparable when tokenisation matches. Because tokenisation and the number of references change the score, Post (2018) recommends always reporting BLEU together with its settings.

Worked examples

Partial match → 57.89

Candidate: the cat is on mat
Reference: the cat is on the mat

p1 = 5/5 = 1.0000 (every candidate word is in the reference)
p2 = 3/4 = 0.7500 ('on mat' is not a reference bigram)
p3 = 2/3 = 0.6667, p4 = 1/2 = 0.5000
c = 5, r = 6 → BP = exp(1 − 6/5) = exp(−0.2) = 0.8187
geo mean = (1 · 0.75 · 0.6667 · 0.5)^(1/4) = 0.7071
BLEU = 0.8187 × 0.7071 = 0.5789 → 57.89 / 100

Clipping → BLEU 0

Candidate: the the the the the the the
References: the cat is on the matthere is a cat on the mat

Candidate is 'the' ×7; the best single reference has 'the' ×2
p1 = min(7, 2) / 7 = 2/7 = 0.2857 (clipping in action)
No reference contains 'the the' → p2 = 0
c = 7, closest reference length r = 7 → BP = 1
With N = 4 and p2 = 0, the geometric mean is 0
BLEU = 0 — turn on smoothing to see a small non-zero score

Short perfect match → 0 (then smoothed)

Candidate: hello world
Reference: hello world

p1 = 2/2 = 1, p2 = 1/1 = 1 — a perfect 2-word match
But there are no candidate 3-grams or 4-grams → p3 = p4 = 0
BLEU-4 = 0 even though the text is identical
Enable smoothing: empty orders use ε = 0.1 → p3 = p4 = 0.1
geo mean = (1 · 1 · 0.1 · 0.1)^(1/4) = 0.3162, BP = 1
Smoothed BLEU = 0.3162 → 31.62 / 100

Frequently asked questions

Sources & references

The formulas and the five worked examples on this page were last reconciled against Papineni et al. (2002) and NLTK sentence_bleu on 2026-06-09. The calculation module ships with a built-in assertion that re-runs every worked example, so a regression in the BLEU math fails fast.

Related tools

LiveAI

METEOR Score Calculator

Calculate the METEOR score for a candidate translation against a reference, entirely in your browser. Shows unigram matches, precision, recall, the recall-weighted Fmean, the chunk-based fragmentation penalty, and the aligned tokens. Matches NLTK single_meteor_score, no signup.

Open tool

LiveAI

ROUGE Score Calculator

Calculate ROUGE-1, ROUGE-2, ROUGE-L and ROUGE-Lsum precision, recall and F1 between a generated summary and one or more references, entirely in your browser. Shows matched n-grams and the longest common subsequence. Matches Google rouge-score, no signup.

Open tool

LiveAI

F1 Score Calculator

Calculate the F1 score, precision, recall and F-beta of a binary classifier from confusion-matrix counts (TP, FP, FN) or directly from precision and recall, with every step of the arithmetic shown. Matches scikit-learn, runs in your browser.

Open tool

Rate this tool

Be the first to rate

Comments & feedback

Spotted a bug or want an improvement? Tell us — our team reviews every comment, and good ideas get built. Comments are public and anonymous.

Found a bug, edge case, or want to suggest an improvement?

Email me at [email protected] — most fixes ship within 24 hours.