How do you calculate a ROUGE score?

Tokenise the candidate and reference, then for ROUGE-N count the overlapping n-grams (clipped to the reference's count). Recall is overlap ÷ reference n-grams, precision is overlap ÷ candidate n-grams, and F1 = 2·P·R ÷ (P + R). ROUGE-L replaces n-gram overlap with the longest common subsequence length. This follows Lin (2004).

What is a good ROUGE score?

There is no universal threshold — ROUGE is comparative, not absolute. On the CNN/DailyMail news summarisation benchmark, strong systems reach roughly 0.40–0.45 ROUGE-1 F1, 0.19–0.21 ROUGE-2, and 0.37–0.42 ROUGE-L. What counts as 'good' depends on the dataset, the reference quality, and your tokenisation settings, so only compare scores computed the same way.

What is the difference between ROUGE-1, ROUGE-2 and ROUGE-L?

ROUGE-1 measures unigram (single-word) overlap and rewards getting the right words. ROUGE-2 measures bigram (word-pair) overlap and rewards short phrases in the right local order. ROUGE-L uses the longest common subsequence, so it rewards words that appear in the same overall order without needing to be adjacent. Papers usually report all three.

Is ROUGE precision or recall?

ROUGE reports both, plus their F1. It was designed as recall-oriented (the 'R' in ROUGE) because summarisation cares whether the reference content was captured. Modern toolkits, including Google's rouge-score and this calculator, return precision, recall and the balanced F1 so you can pick what your task needs. Most leaderboards quote the F1.

How is ROUGE-L calculated?

ROUGE-L finds the longest common subsequence (LCS) of the candidate and reference token sequences — the longest run of words appearing in both in the same order, not necessarily adjacent. With LCS length L, recall = L ÷ reference length and precision = L ÷ candidate length, combined as F1. ROUGE-Lsum applies the same idea sentence by sentence and unions the matches.

Should I turn on stemming and lowercasing?

Google's rouge-score library lowercases, strips punctuation and applies the Porter stemmer by default, so 'cats' and 'cat' or 'running' and 'run' count as matches. Those defaults are on here too. Turn stemming off if you need an exact-form comparison, but keep the same settings across every system you compare or the numbers will not line up.

Why does ROUGE-Lsum differ from ROUGE-L?

ROUGE-L treats each text as one token sequence, so reordering whole sentences lowers the score. ROUGE-Lsum splits the text into sentences and matches them independently, so two summaries with the same sentences in a different order can still score highly. For single-sentence input the two metrics are identical.

Does this match the Python rouge-score package?

Yes, with matching settings. Tokenisation, Porter stemming, the clipped n-gram counts and the summary-level union LCS follow Google's rouge-score. One difference: this tool keeps Unicode letters (so Sinhala and Tamil tokens are scored), whereas rouge-score drops non-ASCII characters. For ASCII English text with stemming on, the numbers reconcile.

AI · Text Evaluation

ROUGE Score Calculator

Paste a generated summary and one or more references to get ROUGE-1, ROUGE-2, ROUGE-L and ROUGE-Lsum — each with precision, recall and F1 — plus the matched n-grams and longest common subsequence so the number is explainable. Matches Google rouge-score, no signup, runs in your browser.

By Induwara Ashinsana— Executive Director, Ryzera TechnologiesUpdated Jun 10, 2026

ROUGE Score

Candidate (generated text)

Your text never leaves the browser.0 words

Reference(s) — blank line or --- between each

At least one. Stays on your device.0 words

Options

Show

Examples

ROUGE-1 F1

—

Unigram overlap

ROUGE-2 F1

—

Bigram overlap

ROUGE-L F1

—

Longest common subsequence

Runs entirely in your browser — your text is never uploaded, logged, or stored. Method: clipped n-gram overlap (ROUGE-1/2), sentence-level LCS (ROUGE-L) and summary-level union LCS (ROUGE-Lsum), per Lin (2004); tokenisation and Porter stemming reconciled to Google rouge-score. Up to 20,000 characters per box.

How it works

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is the standard automatic metric for text summarisation, introduced by Chin-Yew Lin in 2004. It compares a system's output (the candidate) with one or more human-written reference texts and rewards overlap. This tool computes the four variants reported in most papers — ROUGE-1, ROUGE-2, ROUGE-L and ROUGE-Lsum — each as precision, recall and F1.

Tokenise. Both texts are optionally lowercased, have punctuation stripped, are split on whitespace, and optionally Porter-stemmed so cats matches cat. These defaults match Google's rouge-score library.
ROUGE-N overlap. For N = 1 (unigrams) and N = 2 (bigrams), count how many candidate n-grams appear in the reference, clipping each to the number of times it occurs in the reference — overlap = Σ min(count_cand, count_ref). Recall is overlap ÷ reference n-grams; precision is overlap ÷ candidate n-grams. Clipping stops a candidate from gaming the score by repeating one correct word.
ROUGE-L. Find the longest common subsequence (LCS) of the two token sequences with the classic O(m·n) dynamic-programming table. With LCS length L, recall = L ÷ |reference| and precision = L ÷ |candidate|. Because LCS keeps word order without demanding adjacency, ROUGE-L credits fluent rephrasings that ROUGE-2 would miss.
ROUGE-Lsum. Split each text into sentences, take the union of LCS matches for every reference sentence across all candidate sentences, de-duplicate by token count, and divide the hits by the total tokens. For single-sentence input it equals ROUGE-L.
Combine. Each metric's F1 is 2·P·R ÷ (P + R) — the β = 1 case of Lin's general F-measure. With several references, the score is taken against the best reference (max, Lin's recommendation) or averaged.

Every F1 is independently re-derived with the general F-measure formula, and the reconstructed LCS is validated as a genuine common subsequence of both texts; when these agree the result is flagged “cross-checked”. The numbers reconcile with Google's rouge-score for ASCII English text under matching settings. Because tokenisation, stemming and reference count all change the score, always report ROUGE together with the settings used.

Worked examples

Summary overlap → ROUGE-1 F1 0.9231

Candidate: the cat was found under the bed
Reference: the cat was under the bed

Candidate 7 tokens, reference 6 tokens (lowercase, no stemming)
ROUGE-1: clipped overlap 6 → R = 6/6 = 1, P = 6/7 = 0.8571
ROUGE-1 F1 = 2·1·0.8571 / 1.8571 = 0.9231
ROUGE-2: bigram overlap 4 of 5 ref / 6 cand → F1 = 0.7273
ROUGE-L: LCS = 6 ('the cat was under the bed') → F1 = 0.9231

Word order matters → ROUGE-L diverges

Candidate: police shot the gunman dead
Reference: the gunman was shot dead by police

Candidate 5 tokens, reference 7 tokens
ROUGE-1: all 5 candidate words are in the reference → P = 1
ROUGE-1 R = 5/7 = 0.7143, F1 = 0.8333 (bag of words rewards it)
ROUGE-L: LCS = 3 ('the gunman dead') — reordering breaks the run
ROUGE-L R = 3/7, P = 3/5 → F1 = 0.5000 (order is penalised)

Sentence reorder → ROUGE-Lsum beats ROUGE-L

Candidate: the cat sat. the dog ran.
Reference: the dog ran. the cat sat.

Same two sentences, swapped order, no stemming
ROUGE-L (one sequence): the swap shortens the LCS → F1 = 0.5000
ROUGE-Lsum splits into sentences and matches each independently
Each reference sentence has an exact candidate sentence → all hit
ROUGE-Lsum F1 = 1.0000 — order across sentences is forgiven

Frequently asked questions

Sources & references

The formulas and worked examples on this page were last reconciled against Lin (2004) and Google rouge-score on 2026-06-10. The calculation module ships with a built-in assertion that re-runs every worked example, so a regression in the ROUGE math fails fast.

Related tools

LiveAI

METEOR Score Calculator

Calculate the METEOR score for a candidate translation against a reference, entirely in your browser. Shows unigram matches, precision, recall, the recall-weighted Fmean, the chunk-based fragmentation penalty, and the aligned tokens. Matches NLTK single_meteor_score, no signup.

Open tool

LiveAI

BLEU Score Calculator

Calculate the BLEU score for a candidate translation against one or more references, entirely in your browser. Shows the modified n-gram precisions p1–p4, the brevity penalty, and the clipped match counts. Matches NLTK sentence_bleu, no signup.

Open tool

LiveAI

Silhouette Score Calc

Compute the silhouette score (silhouette coefficient) of a clustering from raw data points and labels. Get the overall score, per-cluster means, and the full per-sample a(i)/b(i)/s(i) working — with misassigned points flagged. Matches scikit-learn silhouette_score, runs entirely in your browser.

Open tool

Rate this tool

Be the first to rate

Comments & feedback

Spotted a bug or want an improvement? Tell us — our team reviews every comment, and good ideas get built. Comments are public and anonymous.

Found a bug, edge case, or want to suggest an improvement?

Email me at [email protected] — most fixes ship within 24 hours.