induwara.lk
induwara.lkAI · Machine learning

TF-IDF Calculator

Paste a few documents and see the full TF-IDF working — the term-frequency counts, the IDF for every word, and the final weighted matrix. Supports the textbook formula and scikit-learn's smoothed, L2-normalised variant. No signup, nothing uploaded.

By Induwara AshinsanaUpdated Jun 10, 2026
TF-IDF calculator

Each line is one document. Up to 20 documents, 2,000 characters each. Nothing is uploaded.

Examples
Documents (N)
4
Vocabulary terms
11
Matrix cells
44
Decimals

IDF per term

TermdfNidf (substituted)
ceylon34ln(4 / 3) = 0.2877
cinnamon24ln(4 / 2) = 0.6931
exports14ln(4 / 1) = 1.3863
famous24ln(4 / 2) = 0.6931
fine14ln(4 / 1) = 1.3863
grows14ln(4 / 1) = 1.3863
is24ln(4 / 2) = 0.6931
lanka24ln(4 / 2) = 0.6931
sri24ln(4 / 2) = 0.6931
tea24ln(4 / 2) = 0.6931
world24ln(4 / 2) = 0.6931

Cross-check. Every idf was computed a second, independent way — the subtraction form (log N − log df) ÷ log base — and the two routes agree to within 0.0000000000. They reconcile, as they must.

TF-IDF matrix

TermD1D2D3D4
ceylon0.05750.05750.05750.0000
cinnamon0.00000.13860.00000.1386
exports0.00000.00000.27730.0000
famous0.13860.13860.00000.0000
fine0.00000.00000.00000.2773
grows0.00000.00000.00000.2773
is0.13860.13860.00000.0000
lanka0.00000.00000.13860.1386
sri0.00000.00000.13860.1386
tea0.13860.00000.13860.0000
world0.13860.13860.00000.0000

Column header tooltips show each document's token count. Weights are raw tf × idf (no normalisation).

Most distinctive terms

D1 · 5 tokens

famous0.1386is0.1386tea0.1386world0.1386

D2 · 5 tokens

cinnamon0.1386famous0.1386is0.1386world0.1386

D3 · 5 tokens

exports0.2773lanka0.1386sri0.1386tea0.1386

D4 · 5 tokens

fine0.2773grows0.2773cinnamon0.1386lanka0.1386

Method: tf-idf = tf × idf, with idf = log(N/df) (Manning IR) or the scikit-learn smoothed form ln[(1+N)/(1+df)] + 1; optional L2 row-normalisation matches TfidfVectorizer. Sources: Manning, Raghavan & Schütze (IR-book Ch. 6) and scikit-learn. Nothing leaves this page.

How it works

TF-IDF(term frequency–inverse document frequency) scores how important a word is to one document within a collection. A word that appears often in a document but rarely across the corpus gets a high score; a word that appears everywhere gets a low one. The definitions here follow Manning, Raghavan & Schütze's Introduction to Information Retrieval, Chapter 6, and scikit-learn's TfidfVectorizer.

The tool computes it in four steps:

  1. Tokenise. Each line is split on whitespace into unigrams. With the default toggle on, tokens are lower-cased and stripped of leading and trailing punctuation, then a sorted vocabulary is built from every document.
  2. Term frequency. raw uses the count itself; relative divides by the document length; and sublinear uses 1 + ln(count), damping very frequent words — the same option as scikit-learn's sublinear_tf.
  3. Inverse document frequency. The document frequency df is how many documents contain the term. Standard idf is log_b(N / df) for base e, 10, or 2. The scikit-learn smoothed form is ln[(1 + N) / (1 + df)] + 1; the +1 inside avoids dividing by zero, and the trailing +1 stops a term that appears in every document from being zeroed out.
  4. Multiply and optionally normalise. Each weight is tf × idf. Turning on L2-normalisation divides each document's column by its Euclidean norm, so every document vector has unit length — required to match TfidfVectorizer's default norm='l2'.

One subtlety worth knowing: scikit-learn additionally discards single-character tokens and uses a regex tokeniser, so for very short words its vocabulary can differ slightly from this tool's plain whitespace split. For the toy corpora students usually check, the two agree once you select raw TF, smoothed IDF, and L2-normalisation. As a credibility check, the calculator re-derives every idf a second, independent way — the subtraction form (log N − log df) ÷ log b — and confirms the two routes agree. The optional cosine-similarity matrix then reuses the same vectors to show how alike the documents are.

Worked examples

Textbook — “the cat sat” / “the dog sat” / “the cat ran” (relative TF, standard idf base-e)

  1. N = 3. df: the = 3, cat = 2, sat = 2, dog = 1, ran = 1
  2. idf(the) = ln(3/3) = 0 (appears in every document)
  3. idf(cat) = ln(3/2) = 0.4055
  4. D1 “the cat sat”: each TF = 1/3 = 0.3333
  5. w(cat, D1) = 0.3333 × 0.4055 = 0.1352; w(the, D1) = 0.3333 × 0 = 0
  6. Most distinctive words in D1: cat, sat. “the” correctly drops to 0.

scikit-learn reconciliation — same corpus, raw TF, smoothed idf, L2-normalised, term “dog” in D2

  1. idf(dog) = ln[(1+3)/(1+1)] + 1 = ln(2) + 1 = 1.6931
  2. idf(the) = ln(4/4) + 1 = 1.0000; idf(sat) = ln(4/3) + 1 = 1.2877
  3. D2 “the dog sat” raw counts all 1 → column (the, sat, dog) = (1.0000, 1.2877, 1.6931)
  4. ‖D2‖ = √(1.0000² + 1.2877² + 1.6931²) = √5.5249 = 2.3505
  5. w(dog, D2) = 1.6931 / 2.3505 = 0.7203
  6. Matches TfidfVectorizer(smooth_idf=True, norm='l2').

IDF-zero boundary — a word in every document

  1. If a term sits in all N documents, df = N
  2. Standard idf = log(N / N) = log(1) = 0
  3. So its TF-IDF is 0 in every document, whatever its count
  4. Smoothed idf = ln[(1+N)/(1+N)] + 1 = 0 + 1 = 1, keeping a baseline weight
  5. This is why “stop words” like the, is, of often vanish under standard idf.

Frequently asked questions

Sources & references

The formulas on this page were last cross-checked against these sources on 2026-06-10. TF-IDF is a stable mathematical definition, so this tool needs no rate or schedule updates — only the worked examples are periodically re-reconciled against scikit-learn.

Related tools

Rate this tool
Be the first to rate

Comments & feedback

Spotted a bug or want an improvement? Tell us — our team reviews every comment, and good ideas get built. Comments are public and anonymous.

Found a bug, edge case, or want to suggest an improvement?

Email me at [email protected] — most fixes ship within 24 hours.