Plagiarism Checker — compare two texts for overlap
Paste two pieces of writing and see exactly which phrases are reused. Word-level n-gram overlap with Jaccard and containment scores. Runs entirely in your browser — your text is never uploaded.
How it works
The checker uses the same word n-gram fingerprinting that academic near-duplicate-detection systems have used since the 1990s. The procedure is purely textual — there is no AI model, no embedding, no server call — so the result is reproducible and you can verify the maths by hand on small inputs.
- Tokenise. Each text is split into Unicode word runs using the regular expression
[\p{L}\p{N}]+. Letters and digits from any script (Latin, Sinhala, Tamil, Devanagari) round-trip cleanly; punctuation and whitespace are dropped. Tokens are lower-cased by default. - Generate n-grams. A sliding window of n = 5 consecutive tokens produces every contiguous 5-word run in each text. For a document of k tokens there are k − n + 1 n-grams. You can change n with the slider above.
- Compute set overlap. The Jaccard index is
|A ∩ B| / |A ∪ B|, the fraction of distinct n-grams shared. The two one-sided containment scores are|A ∩ B| / |A|and|A ∩ B| / |B|— useful when the two texts differ in length. - Highlight. Every word in A that participates in a shared n-gram is marked, and consecutive marked words collapse into contiguous spans for highlighting. The same is done for B. This is what lets you read the matched passages in context rather than scrolling through a percentage.
- Cross-check. The Jaccard score is recomputed with an independent sort-and-merge implementation. The badge above the calculator stays green while both methods agree to within 10⁻⁹.
The headline “Similarity” score on the result tile is the maximum of Jaccard, containment-A-in-B, and containment-B-in-A. That makes the number behave intuitively when one text is much shorter than the other: a short passage copied verbatim from a long article reads close to 100% even though the Jaccard score is low.
What this tool is not: a web crawler. It cannot tell you whether a paragraph was copied from a website, a textbook, or another student's submission you haven't pasted in. For that, you still need a service with an index of the open web — Turnitin, Copyscape, and Quetext are the names worth knowing.
Worked examples
Frequently asked questions
Sources & references
- Jaccard index — definition and properties (Wikipedia)
- Broder (1997), On the resemblance and containment of documents
- N-gram — Wikipedia overview of word-level fingerprinting
- Unicode TR-18 — letter and digit character classes used by the tokeniser
The algorithm and worked examples on this page were last reconciled against the listed sources on 2026-05-11. The page is reviewed when the underlying method changes; the Jaccard index itself dates from 1901 and is not expected to.
Related tools
Comments & feedback
Spotted a bug or want an improvement? Tell us — our team reviews every comment, and good ideas get built. Comments are public and anonymous.
Found a false positive, edge case, or want a feature added?
Email me at [email protected] — most fixes ship within 24 hours.