induwara.lk
induwara.lkText · Writing

Plagiarism Checker — compare two texts for overlap

Paste two pieces of writing and see exactly which phrases are reused. Word-level n-gram overlap with Jaccard and containment scores. Runs entirely in your browser — your text is never uploaded.

By Induwara AshinsanaUpdated May 11, 2026
Compare two textsn-gram overlap · client-side
0 / 100,000 chars

Paste the first piece of writing — your draft, a student submission, anything.

0 / 100,000 chars

Paste the second piece of writing to compare against.

Quick actions
5

A run of 5consecutive words must match. Lower = more sensitive, higher = only catches longer reused passages.

Case sensitive

Off (default): “Colombo” matches “colombo”.

Similarity
Paste text on both sides to compare
Jaccard index
Set intersection ÷ union
Containment (A in B)
B in A: —
Shared phrases
out of 0 + 0 distinct

Token & n-gram breakdown

Paste text on both sides to see token counts, n-gram totals, and matched-word coverage.

Source: word-level n-gram overlap with the Jaccard index (Wikipedia §Jaccard index) and asymmetric containment (Broder, 1997). Comparison is local to your browser — neither text is uploaded.

How it works

The checker uses the same word n-gram fingerprinting that academic near-duplicate-detection systems have used since the 1990s. The procedure is purely textual — there is no AI model, no embedding, no server call — so the result is reproducible and you can verify the maths by hand on small inputs.

  1. Tokenise. Each text is split into Unicode word runs using the regular expression [\p{L}\p{N}]+. Letters and digits from any script (Latin, Sinhala, Tamil, Devanagari) round-trip cleanly; punctuation and whitespace are dropped. Tokens are lower-cased by default.
  2. Generate n-grams. A sliding window of n = 5 consecutive tokens produces every contiguous 5-word run in each text. For a document of k tokens there are k − n + 1 n-grams. You can change n with the slider above.
  3. Compute set overlap. The Jaccard index is |A ∩ B| / |A ∪ B|, the fraction of distinct n-grams shared. The two one-sided containment scores are |A ∩ B| / |A| and |A ∩ B| / |B| — useful when the two texts differ in length.
  4. Highlight. Every word in A that participates in a shared n-gram is marked, and consecutive marked words collapse into contiguous spans for highlighting. The same is done for B. This is what lets you read the matched passages in context rather than scrolling through a percentage.
  5. Cross-check. The Jaccard score is recomputed with an independent sort-and-merge implementation. The badge above the calculator stays green while both methods agree to within 10⁻⁹.

The headline “Similarity” score on the result tile is the maximum of Jaccard, containment-A-in-B, and containment-B-in-A. That makes the number behave intuitively when one text is much shorter than the other: a short passage copied verbatim from a long article reads close to 100% even though the Jaccard score is low.

What this tool is not: a web crawler. It cannot tell you whether a paragraph was copied from a website, a textbook, or another student's submission you haven't pasted in. For that, you still need a service with an index of the open web — Turnitin, Copyscape, and Quetext are the names worth knowing.

Worked examples

Identical paragraph (sanity check)

  • A and B: the same nine-word sentence — “the quick brown fox jumps over the lazy dog”.
  • n = 5 → each side has 5 five-grams, all shared.
  1. tokens(A) = tokens(B) = 9 words
  2. 5-grams(A) = 5-grams(B) = 5
  3. |A ∩ B| = 5, |A ∪ B| = 5
  4. Jaccard = 5/5 = 100%
  5. Containment(A↔B) = 100%
  6. Headline similarity = 100%

Partial reuse with trigrams

  • A: “I went to Colombo today” (5 tokens, 3 trigrams).
  • B: “Yesterday I went to Colombo with a friend” (8 tokens, 6 trigrams).
  • n = 3, case-insensitive.
  1. Shared trigrams: {“i went to”, “went to colombo”} = 2
  2. Union = 3 + 6 − 2 = 7
  3. Jaccard = 2 / 7 ≈ 28.57%
  4. Containment A-in-B = 2 / 3 ≈ 66.67%
  5. Containment B-in-A = 2 / 6 ≈ 33.33%
  6. Headline similarity = 66.67%
  7. Highlight in A: “I went to Colombo”
  8. Highlight in B: “I went to Colombo”

Edge — text too short for chosen n

  • A: “one two” (2 tokens).
  • B: “three four five six seven” (5 tokens).
  • n = 5.
  1. 5-grams(A) = 0 (needs ≥ 5 tokens)
  2. 5-grams(B) = 1
  3. |A ∩ B| = 0
  4. Jaccard = 0, Containment(A↔B) = 0
  5. Headline similarity = 0% (tool surfaces a “too short” hint)

Frequently asked questions

Sources & references

The algorithm and worked examples on this page were last reconciled against the listed sources on 2026-05-11. The page is reviewed when the underlying method changes; the Jaccard index itself dates from 1901 and is not expected to.

Related tools

Rate this tool
Be the first to rate

Comments & feedback

Spotted a bug or want an improvement? Tell us — our team reviews every comment, and good ideas get built. Comments are public and anonymous.

Found a false positive, edge case, or want a feature added?

Email me at [email protected] — most fixes ship within 24 hours.