How do you calculate TF-IDF by hand?

For each document, count how often a term appears — that is the term frequency (TF). Then count how many of the N documents contain the term (its document frequency, df) and compute the inverse document frequency idf = log(N / df). Multiply: TF-IDF = TF × idf. Example: in a 3-document corpus where “cat” appears in 2, idf(cat) = ln(3/2) = 0.4055; if its TF in one document is 1/3, the weight is 0.3333 × 0.4055 = 0.1352.

What is the difference between TF-IDF and term frequency?

Term frequency only counts how often a word appears in one document, so common filler words like “the” score highly everywhere. TF-IDF multiplies that count by the inverse document frequency, which shrinks the weight of words that appear in many documents and lifts words that are rare across the corpus. The result highlights terms that are distinctive to a document rather than just frequent in it.

Why does scikit-learn's TF-IDF give different numbers?

scikit-learn's TfidfVectorizer uses a smoothed idf, ln[(1+N)/(1+df)] + 1, not the textbook log(N/df), and it L2-normalises each document vector by default so every row has unit length. It also drops single-character tokens and uses raw counts for TF. Switch this tool's IDF to “scikit-learn smoothed”, turn on L2-normalise, and pick raw-count TF to reproduce its output exactly.

What does an IDF of 0 mean?

With the standard formula idf = log(N / df), a term that appears in every document has df = N, so idf = log(1) = 0 and its TF-IDF weight becomes zero in every document. That is the intended behaviour: a word present everywhere carries no power to tell documents apart. scikit-learn's smoothed idf adds 1, so such a term keeps a small baseline weight instead of vanishing.

Is a higher TF-IDF score better?

Higher means more distinctive, not better in any absolute sense. A high weight says the term is frequent in this document but rare across the corpus, so it characterises that document well — useful for ranking, keyword extraction, and search. The scores are only comparable within the same corpus and settings; adding or removing documents changes every idf, so weights are not portable between corpora.

What does L2 normalisation do to TF-IDF?

L2 normalisation divides each document's TF-IDF vector by its Euclidean length, so the vector sums of squares equal 1. This makes long and short documents comparable and is what scikit-learn applies by default. It does not change which terms rank highest within a document — only their scale — and it makes cosine similarity between documents reduce to a plain dot product.

Does this calculator send my documents anywhere?

No. Tokenising the text, counting terms, computing the logarithms and assembling the matrix all run in your browser with plain JavaScript. Nothing is uploaded, logged, or stored, so you can paste private or unpublished text safely. The CSV export is generated locally too, and the page keeps working offline once it has loaded.

AI · Machine learning

TF-IDF Calculator

Paste a few documents and see the full TF-IDF working — the term-frequency counts, the IDF for every word, and the final weighted matrix. Supports the textbook formula and scikit-learn's smoothed, L2-normalised variant. No signup, nothing uploaded.

By Induwara Ashinsana— Executive Director, Ryzera TechnologiesUpdated Jun 10, 2026

TF-IDF calculator

Documents — one per line

Each line is one document. Up to 20 documents, 2,000 characters each. Nothing is uploaded.

Examples

Term frequency (TF)

Inverse doc freq (IDF)

IDF log base

L2-normalise vectorsLowercase + strip punctuationShow cosine similarity

Documents (N)

Vocabulary terms

Matrix cells

Decimals

IDF per term

Term	df	N	idf (substituted)
ceylon	3	4	ln(4 / 3) = 0.2877
cinnamon	2	4	ln(4 / 2) = 0.6931
exports	1	4	ln(4 / 1) = 1.3863
famous	2	4	ln(4 / 2) = 0.6931
fine	1	4	ln(4 / 1) = 1.3863
grows	1	4	ln(4 / 1) = 1.3863
is	2	4	ln(4 / 2) = 0.6931
lanka	2	4	ln(4 / 2) = 0.6931
sri	2	4	ln(4 / 2) = 0.6931
tea	2	4	ln(4 / 2) = 0.6931
world	2	4	ln(4 / 2) = 0.6931

Cross-check. Every idf was computed a second, independent way — the subtraction form (log N − log df) ÷ log base — and the two routes agree to within 0.0000000000. They reconcile, as they must.

TF-IDF matrix

Term	D1	D2	D3	D4
ceylon	0.0575	0.0575	0.0575	0.0000
cinnamon	0.0000	0.1386	0.0000	0.1386
exports	0.0000	0.0000	0.2773	0.0000
famous	0.1386	0.1386	0.0000	0.0000
fine	0.0000	0.0000	0.0000	0.2773
grows	0.0000	0.0000	0.0000	0.2773
is	0.1386	0.1386	0.0000	0.0000
lanka	0.0000	0.0000	0.1386	0.1386
sri	0.0000	0.0000	0.1386	0.1386
tea	0.1386	0.0000	0.1386	0.0000
world	0.1386	0.1386	0.0000	0.0000

Column header tooltips show each document's token count. Weights are raw tf × idf (no normalisation).

Most distinctive terms

D1 · 5 tokens

famous0.1386is0.1386tea0.1386world0.1386

D2 · 5 tokens

cinnamon0.1386famous0.1386is0.1386world0.1386

D3 · 5 tokens

exports0.2773lanka0.1386sri0.1386tea0.1386

D4 · 5 tokens

fine0.2773grows0.2773cinnamon0.1386lanka0.1386

Method: tf-idf = tf × idf, with idf = log(N/df) (Manning IR) or the scikit-learn smoothed form ln[(1+N)/(1+df)] + 1; optional L2 row-normalisation matches TfidfVectorizer. Sources: Manning, Raghavan & Schütze (IR-book Ch. 6) and scikit-learn. Nothing leaves this page.

How it works

TF-IDF(term frequency–inverse document frequency) scores how important a word is to one document within a collection. A word that appears often in a document but rarely across the corpus gets a high score; a word that appears everywhere gets a low one. The definitions here follow Manning, Raghavan & Schütze's Introduction to Information Retrieval, Chapter 6, and scikit-learn's TfidfVectorizer.

The tool computes it in four steps:

Tokenise. Each line is split on whitespace into unigrams. With the default toggle on, tokens are lower-cased and stripped of leading and trailing punctuation, then a sorted vocabulary is built from every document.
Term frequency. raw uses the count itself; relative divides by the document length; and sublinear uses 1 + ln(count), damping very frequent words — the same option as scikit-learn's sublinear_tf.
Inverse document frequency. The document frequency df is how many documents contain the term. Standard idf is log_b(N / df) for base e, 10, or 2. The scikit-learn smoothed form is ln[(1 + N) / (1 + df)] + 1; the +1 inside avoids dividing by zero, and the trailing +1 stops a term that appears in every document from being zeroed out.
Multiply and optionally normalise. Each weight is tf × idf. Turning on L2-normalisation divides each document's column by its Euclidean norm, so every document vector has unit length — required to match TfidfVectorizer's default norm='l2'.

One subtlety worth knowing: scikit-learn additionally discards single-character tokens and uses a regex tokeniser, so for very short words its vocabulary can differ slightly from this tool's plain whitespace split. For the toy corpora students usually check, the two agree once you select raw TF, smoothed IDF, and L2-normalisation. As a credibility check, the calculator re-derives every idf a second, independent way — the subtraction form (log N − log df) ÷ log b — and confirms the two routes agree. The optional cosine-similarity matrix then reuses the same vectors to show how alike the documents are.

Worked examples

Textbook — “the cat sat” / “the dog sat” / “the cat ran” (relative TF, standard idf base-e)

N = 3. df: the = 3, cat = 2, sat = 2, dog = 1, ran = 1
idf(the) = ln(3/3) = 0 (appears in every document)
idf(cat) = ln(3/2) = 0.4055
D1 “the cat sat”: each TF = 1/3 = 0.3333
w(cat, D1) = 0.3333 × 0.4055 = 0.1352; w(the, D1) = 0.3333 × 0 = 0
Most distinctive words in D1: cat, sat. “the” correctly drops to 0.

scikit-learn reconciliation — same corpus, raw TF, smoothed idf, L2-normalised, term “dog” in D2

idf(dog) = ln[(1+3)/(1+1)] + 1 = ln(2) + 1 = 1.6931
idf(the) = ln(4/4) + 1 = 1.0000; idf(sat) = ln(4/3) + 1 = 1.2877
D2 “the dog sat” raw counts all 1 → column (the, sat, dog) = (1.0000, 1.2877, 1.6931)
‖D2‖ = √(1.0000² + 1.2877² + 1.6931²) = √5.5249 = 2.3505
w(dog, D2) = 1.6931 / 2.3505 = 0.7203
Matches TfidfVectorizer(smooth_idf=True, norm='l2').

IDF-zero boundary — a word in every document

If a term sits in all N documents, df = N
Standard idf = log(N / N) = log(1) = 0
So its TF-IDF is 0 in every document, whatever its count
Smoothed idf = ln[(1+N)/(1+N)] + 1 = 0 + 1 = 1, keeping a baseline weight
This is why “stop words” like the, is, of often vanish under standard idf.

Frequently asked questions

Sources & references

The formulas on this page were last cross-checked against these sources on 2026-06-10. TF-IDF is a stable mathematical definition, so this tool needs no rate or schedule updates — only the worked examples are periodically re-reconciled against scikit-learn.

Related tools

LiveAI

ROUGE Score Calculator

Calculate ROUGE-1, ROUGE-2, ROUGE-L and ROUGE-Lsum precision, recall and F1 between a generated summary and one or more references, entirely in your browser. Shows matched n-grams and the longest common subsequence. Matches Google rouge-score, no signup.

Open tool

LiveAI

Gini Impurity Calculator

Compute the Gini impurity of a decision-tree node from class counts or proportions, with the full 1 − Σ pₖ² working, a Shannon-entropy comparison, and the Gini gain of a candidate split. Matches scikit-learn, runs in the browser.

Open tool

LiveAI

NIST Score Calculator

Calculate the NIST machine-translation score for a candidate translation against one or more references, entirely in your browser. Shows the information-weighted precision per n-gram order, the NIST brevity penalty, and every matched n-gram's weight. Matches NLTK sentence_nist, no signup.

Open tool

Rate this tool

Be the first to rate

Comments & feedback

Spotted a bug or want an improvement? Tell us — our team reviews every comment, and good ideas get built. Comments are public and anonymous.

Found a bug, edge case, or want to suggest an improvement?

Email me at [email protected] — most fixes ship within 24 hours.