How is the METEOR score calculated?

METEOR aligns the candidate and reference word by word, then combines a recall-weighted Fmean with a word-order penalty. With m matched unigrams, candidate length c and reference length r: P = m/c, R = m/r, and Fmean = P·R/(α·P + (1−α)·R) with α = 0.9. The mapped words are grouped into the fewest chunks ch, the penalty is Pen = γ·(ch/m)^β with γ = 0.5 and β = 3, and the final score is Fmean·(1 − Pen).

What is a good METEOR score?

There is no universal pass mark — METEOR is only meaningful against a baseline on the same data and tokenisation. As a rough guide on the 0–1 scale, sentence scores under 0.2 share little with the reference, 0.4–0.6 is the typical band for usable machine translation, and above 0.8 is near-reference. Two identical sentences score about 0.998, not 1, because of the chunk penalty.

What is the difference between METEOR and BLEU?

BLEU is precision-driven: it counts how many candidate n-grams appear in a reference, with a brevity penalty for short output. METEOR is recall-weighted, scores on aligned unigrams rather than higher-order n-grams, adds a word-order (chunk) penalty, and can match stems and synonyms. METEOR tends to correlate better with human judgment at the sentence level, while BLEU is the long-standing corpus-level standard.

Why does METEOR weight recall more than precision?

Banerjee and Lavie found recall correlates more strongly with human judgment than precision, so the Fmean weight α is set to 0.9, giving recall nine times the pull of precision. The intuition is that missing content from the reference hurts a translation more than including a few extra words, so the metric penalises dropped meaning harder than padding.

What does the METEOR fragmentation penalty measure?

It measures how scattered the matched words are. The mapped unigrams are grouped into chunks — runs that are adjacent in both the candidate and the reference. Few long chunks mean the word order largely agrees; many short chunks mean the right words appear in the wrong order. The penalty Pen = γ·(ch/m)^β grows with the chunk-to-match ratio, so a jumbled sentence is marked down even when every word matches.

Does this calculator match NLTK's meteor_score?

Yes, for exact and Porter-stem matching. It uses the same default parameters (α = 0.9, γ = 0.5, β = 3) and the same greedy alignment and chunk counting as NLTK single_meteor_score, so the numbers reconcile when you tokenise the same way. It does not include the WordNet synonym module, so scores can differ slightly from NLTK runs that enable synonym matching.

Why is the score not exactly 1 for identical sentences?

The fragmentation penalty is never zero when there is at least one match: with all m words in a single chunk, Pen = γ·(1/m)^β. For a six-word identical pair that is 0.5·(1/6)³ ≈ 0.0023, so METEOR ≈ 0.9977. This is a documented and correct property of the metric, not a bug — a perfect score would require m to be infinite.

Does this support stemming or synonyms?

It supports exact matching and an optional Porter-stem mode, which matches inflected forms like 'running' to 'run' or 'cats' to 'cat' after the exact pass. WordNet synonym matching is out of scope in this version because bundling WordNet would blow the page-weight budget; the page states this clearly so your numbers stay reproducible.

AI · Machine Translation

METEOR Score Calculator

Paste a candidate translation and a reference to get the METEOR score with the full breakdown — unigram matches, precision, recall, the recall-weighted Fmean, the chunk-based fragmentation penalty, and the aligned tokens highlighted. Matches NLTK, no signup, runs in your browser.

By Induwara Ashinsana— Executive Director, Ryzera TechnologiesUpdated Jun 12, 2026

METEOR Score

Candidate (machine output)

One sentence. Stays on your device.0 words

Reference (human / gold)

One reference. Stays on your device.0 words

Matching

Examples

METEOR

—

Rating

—

Fmean

—

Penalty

—

Runs entirely in your browser — your text is never uploaded, logged, or stored. Method: one-to-one unigram alignment, recall-weighted Fmean, and a chunk-based fragmentation penalty, per Banerjee & Lavie (2005); reconciled to NLTK single_meteor_score. Up to 50,000 characters per box.

How it works

METEOR (Metric for Evaluation of Translation with Explicit ORdering), defined by Banerjee & Lavie (2005), scores a candidate against a reference by first aligning their words and then balancing how much content is shared against how well the word order is preserved. Unlike BLEU, it rewards recall heavily and adds an explicit penalty for jumbled output. The score is:

METEOR = Fmean · (1 − Pen)

It is built in four steps from the tokenised, optionally lowercased texts:

Align unigrams.Build the largest one-to-one mapping between candidate and reference words. This tool uses exact matching and, optionally, a Porter-stem stage applied to the words left over after the exact pass — the same exact/stem ordering as NLTK's meteor_score. WordNet synonym matching is out of scope here.
Precision and recall. With m mapped unigrams, candidate length c and reference length r, P = m/c and R = m/r.
Fmean. A recall-weighted harmonic mean, Fmean = (P·R)/(α·P + (1 − α)·R). With the default α = 0.9 this equals 10·P·R/(R + 9·P), so recall pulls nine times harder than precision.
Fragmentation penalty. Group the mapped unigrams into the fewest chunks — runs adjacent in both the candidate and the reference. With ch chunks over m matches, Pen = γ·(ch/m)^β using γ = 0.5 and β = 3. Many short chunks (scrambled word order) drive the penalty up; one long chunk barely dents the score.

The α = 0.9, γ = 0.5 and β = 3 defaults are the values from Banerjee & Lavie (2005), confirmed against NLTK's single_meteor_score defaults; Lavie & Agarwal (2007) discuss tuning them per language. Fmean is computed as the direct ratio and independently re-derived as the reciprocal 1/(α/R + (1 − α)/P); when the two agree the score is flagged “cross-checked”. One quirk worth knowing: two identical sentences score about 0.998, not 1, because a single chunk still incurs Pen = γ·(1/m)^β — a correct, documented property of METEOR.

Worked examples

One substitution → 0.8067

Candidate: the cat is on the mat
Reference: the cat sat on the mat

Matches: the, cat, on, the, mat → m = 5 (sat ≠ is)
c = 6, r = 6 → P = R = 5/6 = 0.8333
Fmean = 10·(5/6)² / ((5/6) + 9·(5/6)) = 0.8333
Chunks {the, cat} and {on, the, mat} → ch = 2
Pen = 0.5·(2/5)³ = 0.5·0.064 = 0.0320
METEOR = 0.8333 · (1 − 0.0320) = 0.8067

Same words, reordered → 0.8519

Candidate: the bird flew over a house
Reference: a bird flew over the house

Every word matches → m = 6, P = R = 1, Fmean = 1
Best alignment chunks: {the}, {bird, flew, over}, {a}, {house}
ch = 4 over m = 6
Pen = 0.5·(4/6)³ = 0.5·0.2963 = 0.1481
METEOR = 1 · (1 − 0.1481) = 0.8519
Same words as a perfect match, but the swapped order costs ~0.15

Porter-stem mode → 0.6389

Candidate: the cats are running
Reference: the cat is run

Exact: the = the. Stem: cats → cat, running → run
m = 3 (are/is do not match), c = r = 4
P = R = 3/4 = 0.7500 → Fmean = 0.7500
Chunks {the, cat} and {run} → ch = 2
Pen = 0.5·(2/3)³ = 0.1481
METEOR = 0.7500 · (1 − 0.1481) = 0.6389

Frequently asked questions

Sources & references

The formulas and the worked examples on this page were last reconciled against Banerjee & Lavie (2005) and NLTK single_meteor_score on 2026-06-12. The calculation module ships with a built-in assertion that re-runs every worked example, so a regression in the METEOR math fails fast.

Related tools

LiveAI

BLEU Score Calculator

Calculate the BLEU score for a candidate translation against one or more references, entirely in your browser. Shows the modified n-gram precisions p1–p4, the brevity penalty, and the clipped match counts. Matches NLTK sentence_bleu, no signup.

Open tool

LiveAI

ROUGE Score Calculator

Calculate ROUGE-1, ROUGE-2, ROUGE-L and ROUGE-Lsum precision, recall and F1 between a generated summary and one or more references, entirely in your browser. Shows matched n-grams and the longest common subsequence. Matches Google rouge-score, no signup.

Open tool

LiveAI

MCC Calculator

Compute the Matthews Correlation Coefficient from a confusion matrix or two label columns, with formula breakdown and imbalanced-data interpretation, entirely in the browser.

Open tool

Rate this tool

Be the first to rate

Comments & feedback

Spotted a bug or want an improvement? Tell us — our team reviews every comment, and good ideas get built. Comments are public and anonymous.

Found a bug, edge case, or want to suggest an improvement?

Email me at [email protected] — most fixes ship within 24 hours.