How do you calculate the Jaccard similarity coefficient?

Build the two sets, then divide the size of their intersection by the size of their union: J(A,B) = |A ∩ B| / |A ∪ B|. The intersection is the elements in both sets; the union is the elements in either. For example, {a,b,c,d} and {b,c,e,f} share {b,c} (2 elements) out of a union of 6, so J = 2/6 = 0.3333. The result is always between 0 and 1.

What is the difference between Jaccard similarity and Jaccard distance?

Jaccard similarity measures overlap, from 0 (no shared elements) to 1 (identical sets). Jaccard distance is simply 1 − similarity, so it runs the other way: 0 for identical sets, up to 1 for disjoint ones. Use similarity when a higher number should mean more alike, and distance when a lower number should mean more alike. Both come from the same intersection-over-union count.

How is Jaccard similarity different from cosine similarity?

Jaccard works on set membership — does an element appear or not — and ignores how many times it occurs. Cosine similarity works on numeric vectors, comparing their direction using counts or weights. For two short texts, Jaccard asks what fraction of the distinct words they share, while cosine compares word-frequency vectors. Jaccard is natural for tags, fingerprints and presence/absence data; cosine suits weighted or embedding vectors.

Is the Jaccard index the same as Intersection over Union (IoU)?

Yes — IoU is the Jaccard index applied to areas instead of discrete elements. Object-detection IoU divides the overlapping area of two bounding boxes by their combined area, which is exactly |A ∩ B| / |A ∪ B| for continuous regions. This calculator handles set, text and label overlap; for bounding boxes, use the dedicated IoU calculator linked below. The underlying coefficient is identical.

What is a good Jaccard similarity score?

It depends on the task, because the score is sensitive to set size and how you tokenise. For near-duplicate text detection, word-set Jaccard above roughly 0.5–0.7 often signals strong overlap; for sparse tag sets, even 0.3 can be meaningful. There is no universal threshold — scores are not comparable across different tokenisations, so calibrate a cut-off on examples you have already judged by hand.

Does Jaccard count repeated words or just unique ones?

Just unique ones. A set, by definition, ignores repeats and order, so a word appearing five times counts the same as a word appearing once. This calculator dedupes each input before comparing. If you need repeat counts to matter, you want a weighted or multiset variant (sometimes called the Tanimoto coefficient on counts), which is outside the scope of this set-based tool.

What happens when both sets are empty?

The union is empty, so the formula would be 0/0, which is undefined. This calculator follows scikit-learn's documented convention and defines J = 0 in that case rather than showing NaN, and it flags that the convention was applied. As soon as either set has at least one element the result is the ordinary intersection-over-union.

Does this calculator send my text or data anywhere?

No. Parsing your input, deduping into sets, computing the intersection and union, and dividing — all of it runs in your browser with plain JavaScript. Nothing is uploaded, logged, or stored, so you can paste private documents or label data without concern. The page keeps working offline once it has loaded.

AI · Data mining

Jaccard Similarity Calculator

Find the Jaccard index between two sets, two texts, or two binary vectors, in your browser. See the similarity score, the Jaccard distance, the intersection and union members, and the substituted formula behind every result. No signup, nothing uploaded.

By Induwara Ashinsana— Executive Director, Ryzera TechnologiesUpdated Jun 10, 2026

Jaccard similarity calculator

Text A

Turned into a set of words or character n-grams.

Text B

Compared token-set vs token-set with A.

Tokenise by

Examples

Case-insensitive (treat “Apple” and “apple” as one element)

Jaccard similarity

0.6000

60.00% · range 0–1

Jaccard distance

0.4000

1 − J

|A ∩ B| / |A ∪ B|

3 / 5

|A| = 4 · |B| = 4

Interpretation

Moderately similar

J = |A ∩ B| / |A ∪ B| = 3 / 5 = 0.6000

distance = 1 − 0.6000 = 0.4000

Cross-check. The direct intersection-over-union gives 0.6000; the independent inclusion–exclusion form |A| + |B| − |A ∩ B| gives 0.6000. They reconcile, as they must.

Set breakdown

Set A4

thequickbrownfox

Set B4

thequickredfox

Intersection A ∩ B3

thequickfox

Union A ∪ B5

thequickbrownfoxred

Method: J = |A ∩ B| / |A ∪ B|; distance = 1 − J — the scikit-learn jaccard_score and Jaccard (1912) definition. The empty-union case is defined as J = 0, not NaN. Nothing leaves this page.

How it works

The Jaccard similarity coefficient — also called the Jaccard index or coefficient of community — measures how much two sets overlap as intersection over union. Two identical sets score 1, two sets with nothing in common score 0. The definition is the one used by scikit-learn's jaccard_score and goes back to Paul Jaccard's 1912 study of alpine flora.

For two sets A and B, the similarity is the number of shared members divided by the number of distinct members across both:

J(A, B) = |A ∩ B| / |A ∪ B| = |A ∩ B| / (|A| + |B| − |A ∩ B|)

The tool builds the two sets, then computes this in three steps:

Intersection. The members present in both sets — A ∩ B. Its size is the numerator.
Union. Every distinct member across both sets — A ∪ B. Its size is the denominator. If the union is empty (both sets empty), the result is defined as 0 rather than a divide-by-zero, matching scikit-learn.
Divide, then derive distance. The similarity is |A ∩ B| / |A ∪ B|, and the Jaccard distance is 1 − J.

The three input modes only differ in how the sets are built. Sets mode splits a list on your chosen separator, trims each item, and removes duplicates and order — because a set ignores both. Text mode tokenises each snippet either into a set of words or into a set of character n-grams (the contiguous length-n substrings, the shingling approach used in near-duplicate detection). Binarymode reads two equal-length 0/1 label vectors and takes each vector's “present” set to be the positions holding a 1 — exactly how scikit-learn's jaccard_score treats binary indicator arrays. As a credibility check, the calculator also computes the coefficient a second way — from the inclusion–exclusion identity |A ∪ B| = |A| + |B| − |A ∩ B| — and, in binary mode, against jaccard_score, confirming all routes agree.

Worked examples

Sets — A = {apple, banana, cherry, date}, B = {banana, cherry, fig, grape}

Intersection A ∩ B = {banana, cherry} → 2
Union A ∪ B = {apple, banana, cherry, date, fig, grape} → 6
J = 2 / 6 = 0.3333
distance = 1 − 0.3333 = 0.6667 → Weakly similar

Text / words — “the quick brown fox” vs “the quick red fox”

A = {the, quick, brown, fox}, B = {the, quick, red, fox}
Intersection = {the, quick, fox} → 3
Union = {the, quick, brown, fox, red} → 5
J = 3 / 5 = 0.6000, distance = 0.4000 → Moderately similar

Binary — y_true = [1,0,1,1,0], y_pred = [1,1,1,0,0] (scikit-learn cross-check)

Present positions: A = {1, 3, 4}, B = {1, 2, 3}
Intersection = {1, 3} → 2 (both predicted 1)
Union = {1, 2, 3, 4} → 4 (either is 1)
J = 2 / 4 = 0.5000 = sklearn.metrics.jaccard_score(y_true, y_pred)

Frequently asked questions

Sources & references

The formulas on this page were last cross-checked against these sources on 2026-06-10. The Jaccard index is a stable mathematical definition, so this tool needs no rate or schedule updates — only the worked examples are periodically re-reconciled.

Related tools

LiveAI

IoU Calculator

Compute Intersection over Union for two bounding boxes or two label sets. Returns IoU, the Dice/F1 coefficient, raw intersection and union, a threshold pass/fail verdict, and a scaled overlap diagram — entirely in your browser, sources cited.

Open tool

LiveAI

Cosine Similarity Calc

Compute the cosine similarity, cosine distance, and angle between two numeric vectors or two short texts, with the full dot-product and magnitude working. Matches scikit-learn, runs entirely in the browser.

Open tool

LiveAI

Adjusted Rand Index Calc

Compares two clusterings (or a clustering against ground-truth labels) and computes the Adjusted Rand Index, the raw Rand Index, and the Fowlkes–Mallows index from two pasted label lists, with the pair-counting working shown. Matches scikit-learn, runs in your browser.

Open tool

Rate this tool

Be the first to rate

Comments & feedback

Spotted a bug or want an improvement? Tell us — our team reviews every comment, and good ideas get built. Comments are public and anonymous.

Found a bug, edge case, or want to suggest an improvement?

Email me at [email protected] — most fixes ship within 24 hours.