AI Text Similarity Checker — Semantic, Server-Side
Paste two passages and get a 0–100% semantic similarity score from sentence-transformer embeddings — the same model researchers use to benchmark paraphrase detection. A sentence-by-sentence heatmap shows exactly which lines match. Free, no signup, sources cited.
How it works
The page sends your two texts to a server-side route, which calls the Hugging Face Inference API to embed every sentence (and each whole document) with sentence-transformers/all-MiniLM-L6-v2. The resulting 384-dimensional vectors are mean-pooled and L2-normalised, so cosine similarity reduces to a plain dot product. The same route also computes a lexical Jaccard score as a transparent cross-check.
1. Sentence segmentation
Each text is split on terminal punctuation (.!?) followed by whitespace or a newline; double newlines always force a break. Fragments shorter than four characters are dropped as punctuation noise. You can opt into a newline-only splitter when the input is a bullet list or a poem — anything where end-of-sentence punctuation is unreliable.
2. Embedding (server-side)
Each sentence and each full document is sent in a single batched request to api-inference.huggingface.co. The sentence-transformers pipeline applies the model's WordPiece tokenizer (max 256 tokens; longer inputs are truncated), runs the MiniLM encoder, mean-pools the token embeddings masked by the attention mask, and L2-normalises the result. The API returns one 384-dim unit vector per input. Your browser never downloads model weights.
3. Cosine similarity
For unit-length vectors, cosine similarity = dot product:
cosine(a, b) = Σ aᵢ · bᵢ (vectors are unit-norm, so no division) display% = max(0, cosine) × 100 (negatives clamp to 0)
The page returns two cosines side-by-side: overall (the dot product of the two whole-document vectors) and sentence-pair (the symmetric mean of best-match scores across the sentence-by-sentence matrix — the F1-style aggregation recommended in the Sentence-BERT paper). Pick the one that matches your question: overall for "is this paragraph saying the same thing?", sentence-pair for "does every sentence here line up with one over there?".
4. Interpretation bands
The thresholds below come from STS-Benchmark literature and the model card's reported Spearman correlation of 0.82 on STS-B. They are closed-open intervals — i.e., 0.85 is a near-duplicate, 0.849 is highly similar.
- ≥ 0.85 → Near-duplicate (same idea, near-identical wording)
- 0.70 – 0.85 → Highly similar
- 0.50 – 0.70 → Moderately similar
- 0.30 – 0.50 → Loosely related
- < 0.30 → Different topics
5. Lexical Jaccard (cross-check)
Alongside the semantic score, the page computes Jaccard similarity over lowercased word sets. When the semantic cosine is high but Jaccard is low, you have a paraphrase — same meaning, different words. When both are high, the texts are close textually. When both are low, they really are about different things. Watching these two numbers together is the fastest way to judge what the headline is really telling you.
Worked examples
Frequently asked questions
Sources & references
- Reimers & Gurevych, 2019 — Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks (arXiv:1908.10084)
- Hugging Face — sentence-transformers/all-MiniLM-L6-v2 (model card)
- Hugging Face Inference API — feature-extraction task documentation
- MTEB — STS-Benchmark dataset (interpretation thresholds)
- sentence-transformers — official documentation
The model card, API endpoint, interpretation thresholds, and source paper were last cross-checked on 2026-05-12. The Hugging Face model files and Inference API change independently — when an upstream patch lands, server responses pick it up on the next call. Inputs shorter than 30 characters are rejected client-side so the score has enough context to be meaningful.
Related tools
Comments & feedback
Spotted a bug or want an improvement? Tell us — our team reviews every comment, and good ideas get built. Comments are public and anonymous.
Found an edge case where the score reads wrong, or want a multilingual variant?
Email me at [email protected] — most fixes ship within 24 hours.