Question 1

Does this plagiarism checker scan the internet?

Accepted Answer

No. It compares two texts you paste against each other. It does not crawl Google, Bing, or any academic database. Treat it as a tool for catching duplicated paragraphs between drafts, between a student submission and a reference document, or between a translation and its source — not a substitute for a paid web-scale service like Turnitin.

Question 2

How does the n-gram overlap method work?

Accepted Answer

Each text is split into Unicode word tokens. A sliding window of n consecutive words (default 5) generates the set of n-grams for each side. The checker reports Jaccard similarity (shared n-grams ÷ unique n-grams across both sides) and one-sided containment (shared ÷ A and shared ÷ B). When one document is shorter, containment usually tells you more than Jaccard.

Question 3

What n-gram size should I pick?

Accepted Answer

Five is the default and the academic norm for prose. Drop to 3 or 4 when texts are short (a few sentences) or paraphrasing is suspected. Raise to 7 or above when you only want to flag long verbatim copies. Smaller n is more sensitive but produces more false positives on common phrases like “in order to” or “according to the”.

Question 4

What similarity percentage counts as plagiarism?

Accepted Answer

There is no universal cut-off. As a rule of thumb for academic prose: under 15% is usually noise (shared common phrases), 15–30% warrants a look, and over 30% with long contiguous matched spans is a strong red flag. Always read the highlighted passages — a 40% score made up of common phrases differs from a 40% score made of one long copied paragraph.

Question 5

Is my text uploaded anywhere?

Accepted Answer

No. All processing happens in your browser using JavaScript. The text never leaves your device, is never stored, and is never logged. You can verify this by opening DevTools → Network and confirming no requests are made while you type.

Question 6

Does it work with Sinhala or Tamil text?

Accepted Answer

Yes. Tokenisation uses Unicode letter and digit classes (\p{L} and \p{N}), so Sinhala, Tamil, Devanagari, Arabic, and Latin scripts all tokenise correctly. Comparison of Sinhala vs English is meaningless of course — the words won't match — but Sinhala vs Sinhala works exactly the same as English vs English.

Question 7

What does “containment” mean vs Jaccard?

Accepted Answer

Jaccard treats the two texts as symmetric — it tells you what fraction of the combined vocabulary is shared. Containment is asymmetric: containment of A in B asks “how much of A is reused inside B”. When you compare a short student submission against a long source article, containment of the submission inside the article is the more honest measure of copying.

Question 8

What is the maximum length I can paste?

Accepted Answer

Each side is capped at 100,000 characters — roughly 15,000 words, or a 60-page paper. Longer inputs are trimmed with a warning. For very large jobs, split the document into sections and compare them one at a time.

Plagiarism Checker — compare two texts for overlap

Token & n-gram breakdown

How it works

Worked examples

Identical paragraph (sanity check)

Partial reuse with trigrams

Edge — text too short for chosen n

Frequently asked questions

Sources & references

Related tools

Word Counter

Character Counter

Case Converter

Comments & feedback