induwara.lk
induwara.lkAI · Data mining

Jaccard Similarity Calculator

Find the Jaccard index between two sets, two texts, or two binary vectors, in your browser. See the similarity score, the Jaccard distance, the intersection and union members, and the substituted formula behind every result. No signup, nothing uploaded.

By Induwara AshinsanaUpdated Jun 10, 2026
Jaccard similarity calculator

Turned into a set of words or character n-grams.

Compared token-set vs token-set with A.

Tokenise by
Examples
Jaccard similarity
0.6000
60.00% · range 0–1
Jaccard distance
0.4000
1 − J
|A ∩ B| / |A ∪ B|
3 / 5
|A| = 4 · |B| = 4
Interpretation
Moderately similar
J = |A ∩ B| / |A ∪ B| = 3 / 5 = 0.6000
distance = 1 − 0.6000 = 0.4000

Cross-check. The direct intersection-over-union gives 0.6000; the independent inclusion–exclusion form |A| + |B| − |A ∩ B| gives 0.6000. They reconcile, as they must.

Set breakdown

Set A4
thequickbrownfox
Set B4
thequickredfox
Intersection A ∩ B3
thequickfox
Union A ∪ B5
thequickbrownfoxred

Method: J = |A ∩ B| / |A ∪ B|; distance = 1 − J — the scikit-learn jaccard_score and Jaccard (1912) definition. The empty-union case is defined as J = 0, not NaN. Nothing leaves this page.

How it works

The Jaccard similarity coefficient — also called the Jaccard index or coefficient of community — measures how much two sets overlap as intersection over union. Two identical sets score 1, two sets with nothing in common score 0. The definition is the one used by scikit-learn's jaccard_score and goes back to Paul Jaccard's 1912 study of alpine flora.

For two sets A and B, the similarity is the number of shared members divided by the number of distinct members across both:

J(A, B) = |A ∩ B| / |A ∪ B| = |A ∩ B| / (|A| + |B| − |A ∩ B|)

The tool builds the two sets, then computes this in three steps:

  1. Intersection. The members present in both sets — A ∩ B. Its size is the numerator.
  2. Union. Every distinct member across both sets — A ∪ B. Its size is the denominator. If the union is empty (both sets empty), the result is defined as 0 rather than a divide-by-zero, matching scikit-learn.
  3. Divide, then derive distance. The similarity is |A ∩ B| / |A ∪ B|, and the Jaccard distance is 1 − J.

The three input modes only differ in how the sets are built. Sets mode splits a list on your chosen separator, trims each item, and removes duplicates and order — because a set ignores both. Text mode tokenises each snippet either into a set of words or into a set of character n-grams (the contiguous length-n substrings, the shingling approach used in near-duplicate detection). Binarymode reads two equal-length 0/1 label vectors and takes each vector's “present” set to be the positions holding a 1 — exactly how scikit-learn's jaccard_score treats binary indicator arrays. As a credibility check, the calculator also computes the coefficient a second way — from the inclusion–exclusion identity |A ∪ B| = |A| + |B| − |A ∩ B| — and, in binary mode, against jaccard_score, confirming all routes agree.

Worked examples

Sets — A = {apple, banana, cherry, date}, B = {banana, cherry, fig, grape}

  1. Intersection A ∩ B = {banana, cherry} → 2
  2. Union A ∪ B = {apple, banana, cherry, date, fig, grape} → 6
  3. J = 2 / 6 = 0.3333
  4. distance = 1 − 0.3333 = 0.6667 → Weakly similar

Text / words — “the quick brown fox” vs “the quick red fox”

  1. A = {the, quick, brown, fox}, B = {the, quick, red, fox}
  2. Intersection = {the, quick, fox} → 3
  3. Union = {the, quick, brown, fox, red} → 5
  4. J = 3 / 5 = 0.6000, distance = 0.4000 → Moderately similar

Binary — y_true = [1,0,1,1,0], y_pred = [1,1,1,0,0] (scikit-learn cross-check)

  1. Present positions: A = {1, 3, 4}, B = {1, 2, 3}
  2. Intersection = {1, 3} → 2 (both predicted 1)
  3. Union = {1, 2, 3, 4} → 4 (either is 1)
  4. J = 2 / 4 = 0.5000 = sklearn.metrics.jaccard_score(y_true, y_pred)

Frequently asked questions

Sources & references

The formulas on this page were last cross-checked against these sources on 2026-06-10. The Jaccard index is a stable mathematical definition, so this tool needs no rate or schedule updates — only the worked examples are periodically re-reconciled.

Related tools

Rate this tool
Be the first to rate

Comments & feedback

Spotted a bug or want an improvement? Tell us — our team reviews every comment, and good ideas get built. Comments are public and anonymous.

Found a bug, edge case, or want to suggest an improvement?

Email me at [email protected] — most fixes ship within 24 hours.