induwara.lk
induwara.lkAI · Machine Translation

BLEU Score Calculator

Paste a candidate translation and one or more references to get the BLEU score on the 0–1 and 0–100 scales, the modified n-gram precisions p1–p4, the brevity penalty, and the clipped match counts — so you can see exactly how the number was derived. Matches NLTK, no signup, runs in your browser.

By Induwara AshinsanaUpdated Jun 9, 2026
BLEU Score
One sentence. Stays on your device.0 tokens
At least one. Stays on your device.0 tokens
Max n-gram (N)
Examples
BLEU-4
Rating
Brevity penalty
Geometric mean

Runs entirely in your browser — your text is never uploaded, logged, or stored. Method: modified n-gram precision with clipping, brevity penalty, and uniform-weight geometric mean, per Papineni et al. (2002); reconciled to NLTK sentence_bleu. Up to 50,000 characters per box.

How it works

BLEU (Bilingual Evaluation Understudy) is the most widely reported automatic metric for machine translation and other text-generation tasks. It compares a system's output (the candidate) with one or more correct reference texts and rewards candidates that share n-grams with a reference while not being too short. The score, defined by Papineni et al. (2002), is:

BLEU = BP · exp( Σ wₙ · ln pₙ ),  wₙ = 1/N

The two ingredients are the modified n-gram precisions pₙ and the brevity penalty BP. They are computed in four steps:

  1. Tokenise.The candidate and each reference are optionally lowercased and split on whitespace into word tokens. This tool uses plain whitespace tokenisation with no stemming, matching the BLEU paper's running examples.
  2. Modified precision with clipping.For each order n from 1 to N, count the candidate's n-grams. Each distinct n-gram is credited only up to the maximum number of times it appears in a single reference — count_clip(g) = min(count_cand(g), max_ref count_ref(g)). Then pₙ = Σ clipped ÷ Σ candidate n-grams. Clipping is what stops a candidate from scoring well by repeating one correct word.
  3. Brevity penalty.Let c be the candidate token count and r the reference length closest to c (ties go to the shorter reference). BP = 1 when c > r, otherwise BP = exp(1 − r/c). Without it, a system could inflate precision by emitting only the words it is most sure of.
  4. Combine. Take the geometric mean of p1…pN with equal weights 1/N, then multiply by BP. Because it is a geometric mean, a single pₙ of 0 makes the whole score 0 — common for short sentences, which is why an ε-smoothing toggle (NLTK method 1) is provided to add a tiny count to empty orders.

The same precisions are combined a second time in the product domain (the N-th root of the product of the pₙ) and compared with the log-domain result as a cross-check; when they agree the score is flagged “cross-checked”. The numbers reconcile with NLTK's sentence_bleu using uniform weights, so they are directly comparable when tokenisation matches. Because tokenisation and the number of references change the score, Post (2018) recommends always reporting BLEU together with its settings.

Worked examples

Partial match → 57.89

Candidate
the cat is on mat
Reference
the cat is on the mat
  1. p1 = 5/5 = 1.0000 (every candidate word is in the reference)
  2. p2 = 3/4 = 0.7500 ('on mat' is not a reference bigram)
  3. p3 = 2/3 = 0.6667, p4 = 1/2 = 0.5000
  4. c = 5, r = 6 → BP = exp(1 − 6/5) = exp(−0.2) = 0.8187
  5. geo mean = (1 · 0.75 · 0.6667 · 0.5)^(1/4) = 0.7071
  6. BLEU = 0.8187 × 0.7071 = 0.5789 → 57.89 / 100

Clipping → BLEU 0

Candidate
the the the the the the the
References
the cat is on the matthere is a cat on the mat
  1. Candidate is 'the' ×7; the best single reference has 'the' ×2
  2. p1 = min(7, 2) / 7 = 2/7 = 0.2857 (clipping in action)
  3. No reference contains 'the the' → p2 = 0
  4. c = 7, closest reference length r = 7 → BP = 1
  5. With N = 4 and p2 = 0, the geometric mean is 0
  6. BLEU = 0 — turn on smoothing to see a small non-zero score

Short perfect match → 0 (then smoothed)

Candidate
hello world
Reference
hello world
  1. p1 = 2/2 = 1, p2 = 1/1 = 1 — a perfect 2-word match
  2. But there are no candidate 3-grams or 4-grams → p3 = p4 = 0
  3. BLEU-4 = 0 even though the text is identical
  4. Enable smoothing: empty orders use ε = 0.1 → p3 = p4 = 0.1
  5. geo mean = (1 · 1 · 0.1 · 0.1)^(1/4) = 0.3162, BP = 1
  6. Smoothed BLEU = 0.3162 → 31.62 / 100

Frequently asked questions

Sources & references

The formulas and the five worked examples on this page were last reconciled against Papineni et al. (2002) and NLTK sentence_bleu on 2026-06-09. The calculation module ships with a built-in assertion that re-runs every worked example, so a regression in the BLEU math fails fast.

Related tools

Rate this tool
Be the first to rate

Comments & feedback

Spotted a bug or want an improvement? Tell us — our team reviews every comment, and good ideas get built. Comments are public and anonymous.

Found a bug, edge case, or want to suggest an improvement?

Email me at [email protected] — most fixes ship within 24 hours.