Question 1

What is named entity recognition (NER)?

Accepted Answer

Named entity recognition is the task of finding spans of text that refer to real-world things and labelling each one with a category. The standard four English categories — set by the CoNLL-2003 shared task — are PER (people), ORG (organisations), LOC (locations) and MISC (other proper nouns such as nationalities, languages, named events). NER is the first step of most knowledge-extraction pipelines: once you know who and where the text is about, you can link, count, summarise, or fact-check.

Question 2

How accurate is this NER tool?

Accepted Answer

It runs dslim/bert-base-NER, the BERT-base checkpoint fine-tuned on CoNLL-2003 (Reuters newswire). The upstream model card reports F1 ≈ 91.3 on the CoNLL-2003 test set — meaning roughly nine out of ten predicted entities are correct on news-style English. Accuracy drops on social posts with informal capitalisation, on heavily technical jargon, and on text outside Reuters' 1996–1997 training distribution. The confidence threshold above lets you trade recall for precision; the title-case heuristic mentioned in the methodology section is a sanity check you can run by eye.

Question 3

Is anything uploaded? Where does my text go?

Accepted Answer

Your text is sent once to this site's server, where it is passed to the Hugging Face Inference API for scoring, then discarded. We do not log input text, do not store it, and do not run analytics on its contents. If neural inference is not enabled on the current build (no HF token configured), the tool returns a clear placeholder instead of running anything.

Question 4

Why server-side and not in my browser?

Accepted Answer

A browser-side BERT-NER would have to download roughly 110 MB of model weights before the first analysis. On a typical Sri Lankan home connection that is a 30-second wait the user did not ask for, and on mobile it eats data. Running inference server-side keeps the page lightweight (under 100 KB JavaScript), the first analysis fast, and means the tool works on low-end devices that struggle to host ONNX Runtime Web.

Question 5

Does it work with Sinhala or Tamil text?

Accepted Answer

No. dslim/bert-base-NER was trained on English Reuters newswire only. Sinhala or Tamil names written in Latin script will sometimes be caught (the title case looks proper-noun-like), but the labels will be unreliable and Sinhala or Tamil script input will get zero hits. There is no production-quality browser-runnable Sinhala or Tamil NER model today; if that changes we will ship a localised page.

Question 6

What are the four CoNLL entity types?

Accepted Answer

PER — named human beings (politicians, authors, fictional characters). ORG — named organisations (companies, parties, agencies, sports clubs). LOC — geographic locations (cities, countries, regions, named features). MISC — every other proper noun, including nationalities, languages, named events, and titles of works. The taxonomy is intentionally small — DATE, MONEY, EVENT, PRODUCT are NOT in it. If you need those, you want a different model.

Question 7

What is the difference between the `simple` and `first` aggregation strategies?

Accepted Answer

Both decide how to merge per-token predictions into entity spans. `simple` merges any neighbouring tokens that share the same base type (PER, ORG, etc.) — greedy and good for everyday English but it can swallow a LOC inside an ORG name (`Election Commission of Sri Lanka`). `first` starts a new span on every B- tag, so embedded entities surface as separate rows at the cost of more fragments. Switch between them with the dropdown above; results update instantly without a model re-run.

Question 8

Can I export the entity list?

Accepted Answer

Yes. The Copy JSON button gives you the full structured payload (one object per unique entity with every occurrence). Copy List drops a newline-separated plain-text list of canonical surface forms. Download CSV gives you an RFC 4180-compliant file with Entity, Type, Count, AvgConfidence and FirstPosition columns. Everything is generated client-side after the inference call returns.

Question 9

What is the input character limit?

Accepted Answer

10,000 characters per analysis. That is comfortably enough for a full-length news article or a several-page research extract. The BERT tokeniser truncates at 512 sub-word tokens — roughly the length of a long news paragraph — so for very long inputs only the early portion is scored. Run the recognizer on chapters or sections rather than whole books.

Question 10

When was the model and source list last verified?

Accepted Answer

Model card, API endpoint, and entity taxonomy were last cross-checked on 2026-05-12. The Hugging Face model files and Inference API change independently — when an upstream patch lands, server responses pick it up on the next call.

Named Entity Recognition Online — Free, Server-Side, No Signup

How it works

1. Tokenisation and inference

2. Span aggregation

3. Threshold and dedupe

Worked examples

Frequently asked questions

Sources & references

Related tools

AI Keyword Extractor

Text Summarizer

Language Detector

Comments & feedback