TF-IDF Calculator
Paste a few documents and see the full TF-IDF working — the term-frequency counts, the IDF for every word, and the final weighted matrix. Supports the textbook formula and scikit-learn's smoothed, L2-normalised variant. No signup, nothing uploaded.
How it works
TF-IDF(term frequency–inverse document frequency) scores how important a word is to one document within a collection. A word that appears often in a document but rarely across the corpus gets a high score; a word that appears everywhere gets a low one. The definitions here follow Manning, Raghavan & Schütze's Introduction to Information Retrieval, Chapter 6, and scikit-learn's TfidfVectorizer.
The tool computes it in four steps:
- Tokenise. Each line is split on whitespace into unigrams. With the default toggle on, tokens are lower-cased and stripped of leading and trailing punctuation, then a sorted vocabulary is built from every document.
- Term frequency.
rawuses the count itself;relativedivides by the document length; andsublinearuses 1 + ln(count), damping very frequent words — the same option as scikit-learn'ssublinear_tf. - Inverse document frequency. The document frequency df is how many documents contain the term. Standard idf is
log_b(N / df)for base e, 10, or 2. The scikit-learn smoothed form isln[(1 + N) / (1 + df)] + 1; the +1 inside avoids dividing by zero, and the trailing +1 stops a term that appears in every document from being zeroed out. - Multiply and optionally normalise. Each weight is
tf × idf. Turning on L2-normalisation divides each document's column by its Euclidean norm, so every document vector has unit length — required to matchTfidfVectorizer's defaultnorm='l2'.
One subtlety worth knowing: scikit-learn additionally discards single-character tokens and uses a regex tokeniser, so for very short words its vocabulary can differ slightly from this tool's plain whitespace split. For the toy corpora students usually check, the two agree once you select raw TF, smoothed IDF, and L2-normalisation. As a credibility check, the calculator re-derives every idf a second, independent way — the subtraction form (log N − log df) ÷ log b — and confirms the two routes agree. The optional cosine-similarity matrix then reuses the same vectors to show how alike the documents are.
Worked examples
Frequently asked questions
Sources & references
- Manning, Raghavan & Schütze — Introduction to Information Retrieval, Ch. 6: tf, df, idf = log(N/df), and tf-idf weighting
- scikit-learn — Tf–idf term weighting: the smoothed idf formula, sublinear_tf, and L2 normalisation
The formulas on this page were last cross-checked against these sources on 2026-06-10. TF-IDF is a stable mathematical definition, so this tool needs no rate or schedule updates — only the worked examples are periodically re-reconciled against scikit-learn.
Related tools
Comments & feedback
Spotted a bug or want an improvement? Tell us — our team reviews every comment, and good ideas get built. Comments are public and anonymous.
Found a bug, edge case, or want to suggest an improvement?
Email me at [email protected] — most fixes ship within 24 hours.