induwara.lk
induwara.lkAI · Privacy-first

AI Keyword Extractor — Free, In-Browser, No Upload

Paste an article, abstract, or blog draft and get its most important keywords and key phrases, ranked. Runs two deterministic extractors — YAKE and RAKE — side by side, entirely in your browser. No signup, no model download, sources cited.

By Induwara AshinsanaUpdated Jun 8, 2026
Extract keywordsYAKE + RAKE · in-browser
Text stays on your device
Pure-JS extractors. No upload, no model download, no logging.569 / 20,000
Try a sample
130
Both extractors run in milliseconds on your device. The English stopword set is 179 terms (Snowball/Porter).

What this does

Reads any block of English text and returns its most important keywords and key phrases, ranked. YAKE looks at position, case, and neighbour diversity per token. RAKE splits the text on stopwords and ranks each phrase by its tokens' degree-over-frequency scores. Pick a method and press Extract keywords.

Methodology: YAKE single-document statistical extractor (Campos et al. 2020 — lower score = better) and RAKE phrase-level extractor (Rose et al. 2010 — higher = better). Both are deterministic. Sources linked below; last verified 2026-05-12.

How it works

The page runs two independent extractors over the same input and shows both, side by side. They use completely different methods, and watching them agree (or disagree) is the fastest way to judge how trustworthy a keyword set is for your text.

1. Shared pre-processing

The input is NFKC-normalized, split into sentences on terminal punctuation, then tokenized on Unicode letter boundaries. Tokens shorter than two characters and pure-numeric runs are dropped. From the surviving tokens the page builds candidate n-grams up to the user-selected length, admitting an n-gram only if neither its first nor its last token is a Snowball English stopword (179 terms).

2. YAKE (statistical extractor)

YAKE (Campos et al. 2020, Information Sciences 509) scores every single token using five local features:

TCase     = max(TF_UPPER, TF_ACRONYM) / log2(1 + TF)
TPos      = log(log(3 + medianSentenceIndex))
TFNorm    = TF / (meanTF + stdTF)
TRel      = 1 + (DL + DR) * TF / maxTF
TSentence = sentenceFreq / totalSentences

S(token)  = (TPos * TRel) / (TCase + (TFNorm + TSentence) / TRel)

Lower S(token) means more keyword-like. For an n-gram of tokens (t₁ … tₙ) with raw frequency KF, the candidate score is S(kw) = mean(S(tᵢ)) / (1 + KF). The extractor is fully deterministic — same input always yields the same ranking.

3. RAKE (co-occurrence extractor)

RAKE (Rose et al. 2010) takes a different angle on the same text. It splits the document into candidate phrases at every stopword and every punctuation boundary, giving a list of maximal runs of content words. For each token w it then counts:

freq(w)   = total occurrences of w across all phrases
deg(w)    = sum of phrase lengths over phrases containing w
wordScore = deg(w) / freq(w)

S(phrase) = sum of wordScore over the phrase's tokens

Higher S(phrase) means more keyword-like. RAKE is particularly effective on technical text where the same content words co-occur in long compound phrases (“reduced graphene oxide”, “impedance spectroscopy plot”) — the degree count rewards exactly those compounds.

4. Method agreement

The page reports the percentage overlap between YAKE's top-K and RAKE's top-K. High agreement (≥ 60%) means both extractors converge on the same vocabulary, which is a strong credibility signal — when a deterministic statistical extractor and a deterministic co-occurrence extractor independently pick the same phrases, those phrases are unambiguously the topic of the document. Low agreement is informative too: it usually flags input where word frequency alone is misleading (lists, code, or text with repeated function-word neighbours).

5. Why not KeyBERT / a neural extractor?

KeyBERT (Grootendorst 2020) and similar embedding-based extractors require shipping a 20–90 MB sentence-transformer model to the browser, or routing your text through a paid inference API. Neither fits the privacy and zero-friction goals of this site. Campos et al. 2020 benchmark YAKE within a few percentage points of supervised neural extractors on standard datasets (SemEval-2010, Inspec, DUC-2001), with no training and no inference cost — so the trade-off is worthwhile.

6. Validation and limits

Inputs shorter than 50 characters are rejected — a single sentence does not contain enough context for either method to rank candidates meaningfully. Inputs longer than 20,000 characters are rejected to keep extraction well under 100 ms even on older hardware. For book-length material, run the extractor on each section separately. The number of returned keywords is bounded between 1 and 30; the n-gram length is bounded between 1 and 3 tokens.

7. Using it in a content workflow

Keyword extraction is most useful as one step in a larger editing pass. A common sequence: draft the piece, run it through the AI Text Summarizer to confirm the lede actually states the main point, then run it through this extractor to check that your target phrase sits in the top five. If the summary and the keyword list both circle the same idea, the draft is on-topic; if they diverge, the body is burying the lead. For reviews, testimonials, or support replies, pair the keyword list with the AI Sentiment Analyzer so you can see both what a passage is about and the tone it carries. And because the stopword list here is English-only, run mixed-language drafts through the AI Language Detector first — if a passage is mostly Sinhala or Tamil, expect some function words to leak into the ranking until a validated stopword set for those languages ships.

Everything runs in your browser. Neither extractor ever touches the network — the page is fully usable offline once loaded.

Worked examples

Ceylon-tea news brief → top keywords (YAKE + RAKE)

Sri Lankan tea production reached 248 million kilograms in 2024, the lowest annual output in twenty-six years. Smallholder Ceylon tea farmers in the central highlands cite rising fertilizer costs, irregular monsoon rainfall, and labour shortages…

  1. Pre-processing produces 4 sentences and ~70 unique non-stopword tokens.
  2. At n-gram = 2, the YAKE candidate set contains ~90 phrases; RAKE produces ~30 candidate phrases (it splits more aggressively on stopwords).
  3. YAKE surfaces proper-noun phrases first because TCase rewards tokens capitalized after a non-stopword (Ceylon, Sri Lankan, Nuwara Eliya).
  4. RAKE rewards the high deg/freq word "tea", so multi-word phrases containing it ("ceylon tea", "tea farmers", "tea production") sit near the top.
  5. Both methods place "ceylon tea", "tea farmers", and "fertilizer costs" in their top-5, so the agreement column lights up — those are the safe SEO targets.

Graphene supercapacitor abstract → method disagreement

Graphene supercapacitors have emerged as a leading candidate for high-power energy storage thanks to their high specific surface area and excellent electrochemical performance…

  1. Method = Both, n-gram = 3, top-K = 8.
  2. YAKE leans on TPos (early sentences) and TCase: 'Graphene supercapacitors', 'specific capacitance', 'reduced graphene oxide' rank near the top.
  3. RAKE leans on co-occurrence degree: long technical compounds like 'reduced graphene oxide electrodes' and 'galvanostatic charge-discharge curves' get inflated word scores.
  4. Agreement: 5 of 8 phrases match — both methods agree on the document's core vocabulary, while disagreeing on which 3 longer compounds matter most.
  5. Compression: ~180 source words → 8 keywords covering ~85% of the abstract's distinct concepts.

Edge case — pure stopword input

The of and a but it is were the of and a but it is were.

  1. Validator passes (≥ 50 chars) but pre-processing finds 0 content tokens.
  2. buildCandidates returns an empty array — every n-gram starts or ends with a stopword.
  3. rankByRake's phrase buffer also stays empty — every token is a stopword and every match flushes the buffer.
  4. The UI shows a specific message: "No valid keyword candidates were found. The text may be too short or made entirely of stopwords."
  5. Behaviour is symmetric at the bounds: 49 characters → too-short message; 20,001 characters → overflow message.

Frequently asked questions

Sources & references

Related tools

Rate this tool
Be the first to rate

Comments & feedback

Spotted a bug or want an improvement? Tell us — our team reviews every comment, and good ideas get built. Comments are public and anonymous.

Found a keyword that misfired, an edge case, or want a different stopword set?

Email me at [email protected] — most fixes ship within 24 hours.