induwara.lk
induwara.lkAI · Text

AI Text Similarity Checker — Semantic, Server-Side

Paste two passages and get a 0–100% semantic similarity score from sentence-transformer embeddings — the same model researchers use to benchmark paraphrase detection. A sentence-by-sentence heatmap shows exactly which lines match. Free, no signup, sources cited.

By Induwara AshinsanaUpdated May 12, 2026
Compare two textsSentence-BERT · server-side
Sources cited
Plain text. Sent only to score; never stored.93 / 10,000
Plain text. Sent only to score; never stored.98 / 10,000
Try a sample

Lowercase before embedding. Recommended for prose.

Embedding runs on the server via the Hugging Face Inference API. Your text is sent once for scoring, not stored, not logged.

What this does

Paste any two texts and see how semantically close they are. A sentence-transformer running on the server converts every sentence to a 384-dimensional vector and reports cosine similarity. Catches paraphrases that word-overlap checkers miss.

Methodology: cosine similarity on mean-pooled sentence embeddings from all-MiniLM-L6-v2 (max 256 tokens per sentence). Lexical Jaccard cross-checks the result. Sources cited below.

How it works

The page sends your two texts to a server-side route, which calls the Hugging Face Inference API to embed every sentence (and each whole document) with sentence-transformers/all-MiniLM-L6-v2. The resulting 384-dimensional vectors are mean-pooled and L2-normalised, so cosine similarity reduces to a plain dot product. The same route also computes a lexical Jaccard score as a transparent cross-check.

1. Sentence segmentation

Each text is split on terminal punctuation (.!?) followed by whitespace or a newline; double newlines always force a break. Fragments shorter than four characters are dropped as punctuation noise. You can opt into a newline-only splitter when the input is a bullet list or a poem — anything where end-of-sentence punctuation is unreliable.

2. Embedding (server-side)

Each sentence and each full document is sent in a single batched request to api-inference.huggingface.co. The sentence-transformers pipeline applies the model's WordPiece tokenizer (max 256 tokens; longer inputs are truncated), runs the MiniLM encoder, mean-pools the token embeddings masked by the attention mask, and L2-normalises the result. The API returns one 384-dim unit vector per input. Your browser never downloads model weights.

3. Cosine similarity

For unit-length vectors, cosine similarity = dot product:

cosine(a, b) = Σ aᵢ · bᵢ        (vectors are unit-norm, so no division)

display% = max(0, cosine) × 100   (negatives clamp to 0)

The page returns two cosines side-by-side: overall (the dot product of the two whole-document vectors) and sentence-pair (the symmetric mean of best-match scores across the sentence-by-sentence matrix — the F1-style aggregation recommended in the Sentence-BERT paper). Pick the one that matches your question: overall for "is this paragraph saying the same thing?", sentence-pair for "does every sentence here line up with one over there?".

4. Interpretation bands

The thresholds below come from STS-Benchmark literature and the model card's reported Spearman correlation of 0.82 on STS-B. They are closed-open intervals — i.e., 0.85 is a near-duplicate, 0.849 is highly similar.

  • 0.85 → Near-duplicate (same idea, near-identical wording)
  • 0.700.85 → Highly similar
  • 0.500.70 → Moderately similar
  • 0.300.50 → Loosely related
  • < 0.30 → Different topics

5. Lexical Jaccard (cross-check)

Alongside the semantic score, the page computes Jaccard similarity over lowercased word sets. When the semantic cosine is high but Jaccard is low, you have a paraphrase — same meaning, different words. When both are high, the texts are close textually. When both are low, they really are about different things. Watching these two numbers together is the fastest way to judge what the headline is really telling you.

Worked examples

Paraphrase — high semantic, low lexical

Text A

The cat sat quietly on the wool mat next to the window.

Text B

A feline rested silently on the woollen rug beside the window.

  1. Encoder produces a 384-dim unit vector per sentence (server-side)
  2. Overall cosine ≈ 0.82 → Highly similar / Near-duplicate band
  3. Lexical Jaccard: {the, cat, sat, on, wool, mat, next, to, window} ∩ {feline, rested, silently, on, the, woollen, rug, beside, window} = 3 / 15 = 0.20 → 20%
  4. Verdict: semantic & lexical disagree → it is a paraphrase, not a copy

Opposite sentiment — both score high

Text A

I love programming in TypeScript. The type system is wonderful and the tooling makes every refactor safer.

Text B

I hate programming in TypeScript. The type system is awful and the tooling makes every refactor more painful.

  1. Overall cosine ≈ 0.72–0.78 → Highly similar
  2. Same topic (TypeScript), same syntactic shape — the encoder reads topic and structure, not polarity
  3. Documented limitation of mean-pooled sentence embeddings
  4. Verdict: if polarity matters, run the sentiment analyzer alongside this tool

Unrelated — low on both scales

Text A

Sri Lanka exports Ceylon tea to over a hundred countries each year.

Text B

The Fibonacci sequence appears throughout nature, from sunflower seeds to nautilus shells.

  1. Overall cosine ≈ 0.05–0.20 → Different topics
  2. Jaccard ≈ 0% (no overlap once stopwords are kept in the set)
  3. Verdict: model and lexical agree — no meaningful overlap

Frequently asked questions

Sources & references

The model card, API endpoint, interpretation thresholds, and source paper were last cross-checked on 2026-05-12. The Hugging Face model files and Inference API change independently — when an upstream patch lands, server responses pick it up on the next call. Inputs shorter than 30 characters are rejected client-side so the score has enough context to be meaningful.

Related tools

Rate this tool
Be the first to rate

Comments & feedback

Spotted a bug or want an improvement? Tell us — our team reviews every comment, and good ideas get built. Comments are public and anonymous.

Found an edge case where the score reads wrong, or want a multilingual variant?

Email me at [email protected] — most fixes ship within 24 hours.