induwara.lk
induwara.lkAI · Text Evaluation

ROUGE Score Calculator

Paste a generated summary and one or more references to get ROUGE-1, ROUGE-2, ROUGE-L and ROUGE-Lsum — each with precision, recall and F1 — plus the matched n-grams and longest common subsequence so the number is explainable. Matches Google rouge-score, no signup, runs in your browser.

By Induwara AshinsanaUpdated Jun 10, 2026
ROUGE Score
Your text never leaves the browser.0 words
At least one. Stays on your device.0 words
Options
Show
Examples
ROUGE-1 F1
Unigram overlap
ROUGE-2 F1
Bigram overlap
ROUGE-L F1
Longest common subsequence

Runs entirely in your browser — your text is never uploaded, logged, or stored. Method: clipped n-gram overlap (ROUGE-1/2), sentence-level LCS (ROUGE-L) and summary-level union LCS (ROUGE-Lsum), per Lin (2004); tokenisation and Porter stemming reconciled to Google rouge-score. Up to 20,000 characters per box.

How it works

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is the standard automatic metric for text summarisation, introduced by Chin-Yew Lin in 2004. It compares a system's output (the candidate) with one or more human-written reference texts and rewards overlap. This tool computes the four variants reported in most papers — ROUGE-1, ROUGE-2, ROUGE-L and ROUGE-Lsum — each as precision, recall and F1.

  1. Tokenise. Both texts are optionally lowercased, have punctuation stripped, are split on whitespace, and optionally Porter-stemmed so cats matches cat. These defaults match Google's rouge-score library.
  2. ROUGE-N overlap. For N = 1 (unigrams) and N = 2 (bigrams), count how many candidate n-grams appear in the reference, clipping each to the number of times it occurs in the reference — overlap = Σ min(count_cand, count_ref). Recall is overlap ÷ reference n-grams; precision is overlap ÷ candidate n-grams. Clipping stops a candidate from gaming the score by repeating one correct word.
  3. ROUGE-L. Find the longest common subsequence (LCS) of the two token sequences with the classic O(m·n) dynamic-programming table. With LCS length L, recall = L ÷ |reference| and precision = L ÷ |candidate|. Because LCS keeps word order without demanding adjacency, ROUGE-L credits fluent rephrasings that ROUGE-2 would miss.
  4. ROUGE-Lsum. Split each text into sentences, take the union of LCS matches for every reference sentence across all candidate sentences, de-duplicate by token count, and divide the hits by the total tokens. For single-sentence input it equals ROUGE-L.
  5. Combine. Each metric's F1 is 2·P·R ÷ (P + R) — the β = 1 case of Lin's general F-measure. With several references, the score is taken against the best reference (max, Lin's recommendation) or averaged.

Every F1 is independently re-derived with the general F-measure formula, and the reconstructed LCS is validated as a genuine common subsequence of both texts; when these agree the result is flagged “cross-checked”. The numbers reconcile with Google's rouge-score for ASCII English text under matching settings. Because tokenisation, stemming and reference count all change the score, always report ROUGE together with the settings used.

Worked examples

Summary overlap → ROUGE-1 F1 0.9231

Candidate
the cat was found under the bed
Reference
the cat was under the bed
  1. Candidate 7 tokens, reference 6 tokens (lowercase, no stemming)
  2. ROUGE-1: clipped overlap 6 → R = 6/6 = 1, P = 6/7 = 0.8571
  3. ROUGE-1 F1 = 2·1·0.8571 / 1.8571 = 0.9231
  4. ROUGE-2: bigram overlap 4 of 5 ref / 6 cand → F1 = 0.7273
  5. ROUGE-L: LCS = 6 ('the cat was under the bed') → F1 = 0.9231

Word order matters → ROUGE-L diverges

Candidate
police shot the gunman dead
Reference
the gunman was shot dead by police
  1. Candidate 5 tokens, reference 7 tokens
  2. ROUGE-1: all 5 candidate words are in the reference → P = 1
  3. ROUGE-1 R = 5/7 = 0.7143, F1 = 0.8333 (bag of words rewards it)
  4. ROUGE-L: LCS = 3 ('the gunman dead') — reordering breaks the run
  5. ROUGE-L R = 3/7, P = 3/5 → F1 = 0.5000 (order is penalised)

Sentence reorder → ROUGE-Lsum beats ROUGE-L

Candidate
the cat sat. the dog ran.
Reference
the dog ran. the cat sat.
  1. Same two sentences, swapped order, no stemming
  2. ROUGE-L (one sequence): the swap shortens the LCS → F1 = 0.5000
  3. ROUGE-Lsum splits into sentences and matches each independently
  4. Each reference sentence has an exact candidate sentence → all hit
  5. ROUGE-Lsum F1 = 1.0000 — order across sentences is forgiven

Frequently asked questions

Sources & references

The formulas and worked examples on this page were last reconciled against Lin (2004) and Google rouge-score on 2026-06-10. The calculation module ships with a built-in assertion that re-runs every worked example, so a regression in the ROUGE math fails fast.

Related tools

Rate this tool
Be the first to rate

Comments & feedback

Spotted a bug or want an improvement? Tell us — our team reviews every comment, and good ideas get built. Comments are public and anonymous.

Found a bug, edge case, or want to suggest an improvement?

Email me at [email protected] — most fixes ship within 24 hours.