ROUGE Score Calculator
Paste a generated summary and one or more references to get ROUGE-1, ROUGE-2, ROUGE-L and ROUGE-Lsum — each with precision, recall and F1 — plus the matched n-grams and longest common subsequence so the number is explainable. Matches Google rouge-score, no signup, runs in your browser.
How it works
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is the standard automatic metric for text summarisation, introduced by Chin-Yew Lin in 2004. It compares a system's output (the candidate) with one or more human-written reference texts and rewards overlap. This tool computes the four variants reported in most papers — ROUGE-1, ROUGE-2, ROUGE-L and ROUGE-Lsum — each as precision, recall and F1.
- Tokenise. Both texts are optionally lowercased, have punctuation stripped, are split on whitespace, and optionally Porter-stemmed so
catsmatchescat. These defaults match Google'srouge-scorelibrary. - ROUGE-N overlap. For N = 1 (unigrams) and N = 2 (bigrams), count how many candidate n-grams appear in the reference, clipping each to the number of times it occurs in the reference —
overlap = Σ min(count_cand, count_ref). Recall is overlap ÷ reference n-grams; precision is overlap ÷ candidate n-grams. Clipping stops a candidate from gaming the score by repeating one correct word. - ROUGE-L. Find the longest common subsequence (LCS) of the two token sequences with the classic O(m·n) dynamic-programming table. With LCS length L, recall = L ÷ |reference| and precision = L ÷ |candidate|. Because LCS keeps word order without demanding adjacency, ROUGE-L credits fluent rephrasings that ROUGE-2 would miss.
- ROUGE-Lsum. Split each text into sentences, take the union of LCS matches for every reference sentence across all candidate sentences, de-duplicate by token count, and divide the hits by the total tokens. For single-sentence input it equals ROUGE-L.
- Combine. Each metric's F1 is
2·P·R ÷ (P + R)— the β = 1 case of Lin's general F-measure. With several references, the score is taken against the best reference (max, Lin's recommendation) or averaged.
Every F1 is independently re-derived with the general F-measure formula, and the reconstructed LCS is validated as a genuine common subsequence of both texts; when these agree the result is flagged “cross-checked”. The numbers reconcile with Google's rouge-score for ASCII English text under matching settings. Because tokenisation, stemming and reference count all change the score, always report ROUGE together with the settings used.
Worked examples
Frequently asked questions
Sources & references
- Lin, C.-Y. (2004) — ROUGE: A Package for Automatic Evaluation of Summaries (ACL Workshop)
- Lin & Och (2004) — Automatic Evaluation of MT Quality Using LCS and Skip-Bigram Statistics (ACL)
- Google Research rouge-score — the reference implementation reconciled against
- Porter, M. F. (1980) — An algorithm for suffix stripping (the Porter stemmer)
The formulas and worked examples on this page were last reconciled against Lin (2004) and Google rouge-score on 2026-06-10. The calculation module ships with a built-in assertion that re-runs every worked example, so a regression in the ROUGE math fails fast.
Related tools
Comments & feedback
Spotted a bug or want an improvement? Tell us — our team reviews every comment, and good ideas get built. Comments are public and anonymous.
Found a bug, edge case, or want to suggest an improvement?
Email me at [email protected] — most fixes ship within 24 hours.