BLEU Score Calculator
Paste a candidate translation and one or more references to get the BLEU score on the 0–1 and 0–100 scales, the modified n-gram precisions p1–p4, the brevity penalty, and the clipped match counts — so you can see exactly how the number was derived. Matches NLTK, no signup, runs in your browser.
How it works
BLEU (Bilingual Evaluation Understudy) is the most widely reported automatic metric for machine translation and other text-generation tasks. It compares a system's output (the candidate) with one or more correct reference texts and rewards candidates that share n-grams with a reference while not being too short. The score, defined by Papineni et al. (2002), is:
BLEU = BP · exp( Σ wₙ · ln pₙ ), wₙ = 1/N
The two ingredients are the modified n-gram precisions pₙ and the brevity penalty BP. They are computed in four steps:
- Tokenise.The candidate and each reference are optionally lowercased and split on whitespace into word tokens. This tool uses plain whitespace tokenisation with no stemming, matching the BLEU paper's running examples.
- Modified precision with clipping.For each order n from 1 to N, count the candidate's n-grams. Each distinct n-gram is credited only up to the maximum number of times it appears in a single reference —
count_clip(g) = min(count_cand(g), max_ref count_ref(g)). Then pₙ = Σ clipped ÷ Σ candidate n-grams. Clipping is what stops a candidate from scoring well by repeating one correct word. - Brevity penalty.Let c be the candidate token count and r the reference length closest to c (ties go to the shorter reference). BP = 1 when c > r, otherwise BP = exp(1 − r/c). Without it, a system could inflate precision by emitting only the words it is most sure of.
- Combine. Take the geometric mean of p1…pN with equal weights 1/N, then multiply by BP. Because it is a geometric mean, a single pₙ of 0 makes the whole score 0 — common for short sentences, which is why an ε-smoothing toggle (NLTK method 1) is provided to add a tiny count to empty orders.
The same precisions are combined a second time in the product domain (the N-th root of the product of the pₙ) and compared with the log-domain result as a cross-check; when they agree the score is flagged “cross-checked”. The numbers reconcile with NLTK's sentence_bleu using uniform weights, so they are directly comparable when tokenisation matches. Because tokenisation and the number of references change the score, Post (2018) recommends always reporting BLEU together with its settings.
Worked examples
Frequently asked questions
Sources & references
- Papineni, Roukos, Ward & Zhu (2002) — BLEU: a Method for Automatic Evaluation of Machine Translation (ACL)
- NLTK nltk.translate.bleu_score — reference implementation (sentence_bleu, smoothing methods)
- Post, M. (2018) — A Call for Clarity in Reporting BLEU Scores (sacreBLEU, WMT)
- Google Cloud AutoML Translation — BLEU score interpretation bands
The formulas and the five worked examples on this page were last reconciled against Papineni et al. (2002) and NLTK sentence_bleu on 2026-06-09. The calculation module ships with a built-in assertion that re-runs every worked example, so a regression in the BLEU math fails fast.
Related tools
Comments & feedback
Spotted a bug or want an improvement? Tell us — our team reviews every comment, and good ideas get built. Comments are public and anonymous.
Found a bug, edge case, or want to suggest an improvement?
Email me at [email protected] — most fixes ship within 24 hours.