METEOR Score Calculator
Paste a candidate translation and a reference to get the METEOR score with the full breakdown — unigram matches, precision, recall, the recall-weighted Fmean, the chunk-based fragmentation penalty, and the aligned tokens highlighted. Matches NLTK, no signup, runs in your browser.
How it works
METEOR (Metric for Evaluation of Translation with Explicit ORdering), defined by Banerjee & Lavie (2005), scores a candidate against a reference by first aligning their words and then balancing how much content is shared against how well the word order is preserved. Unlike BLEU, it rewards recall heavily and adds an explicit penalty for jumbled output. The score is:
METEOR = Fmean · (1 − Pen)
It is built in four steps from the tokenised, optionally lowercased texts:
- Align unigrams.Build the largest one-to-one mapping between candidate and reference words. This tool uses exact matching and, optionally, a Porter-stem stage applied to the words left over after the exact pass — the same exact/stem ordering as NLTK's
meteor_score. WordNet synonym matching is out of scope here. - Precision and recall. With m mapped unigrams, candidate length c and reference length r,
P = m/candR = m/r. - Fmean. A recall-weighted harmonic mean,
Fmean = (P·R)/(α·P + (1 − α)·R). With the default α = 0.9 this equals 10·P·R/(R + 9·P), so recall pulls nine times harder than precision. - Fragmentation penalty. Group the mapped unigrams into the fewest chunks — runs adjacent in both the candidate and the reference. With ch chunks over m matches,
Pen = γ·(ch/m)^βusing γ = 0.5 and β = 3. Many short chunks (scrambled word order) drive the penalty up; one long chunk barely dents the score.
The α = 0.9, γ = 0.5 and β = 3 defaults are the values from Banerjee & Lavie (2005), confirmed against NLTK's single_meteor_score defaults; Lavie & Agarwal (2007) discuss tuning them per language. Fmean is computed as the direct ratio and independently re-derived as the reciprocal 1/(α/R + (1 − α)/P); when the two agree the score is flagged “cross-checked”. One quirk worth knowing: two identical sentences score about 0.998, not 1, because a single chunk still incurs Pen = γ·(1/m)^β — a correct, documented property of METEOR.
Worked examples
Frequently asked questions
Sources & references
- Banerjee & Lavie (2005) — METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments (ACL Workshop)
- NLTK nltk.translate.meteor_score — reference implementation (single_meteor_score, default α/γ/β)
- Lavie & Agarwal (2007) — METEOR: An Automatic Metric for MT Evaluation with High Levels of Correlation with Human Judgments (WMT)
- Porter, M. F. (1980) — An algorithm for suffix stripping (the Porter stemmer used in stem mode)
The formulas and the worked examples on this page were last reconciled against Banerjee & Lavie (2005) and NLTK single_meteor_score on 2026-06-12. The calculation module ships with a built-in assertion that re-runs every worked example, so a regression in the METEOR math fails fast.
Related tools
Comments & feedback
Spotted a bug or want an improvement? Tell us — our team reviews every comment, and good ideas get built. Comments are public and anonymous.
Found a bug, edge case, or want to suggest an improvement?
Email me at [email protected] — most fixes ship within 24 hours.