Question 1

How many languages can this tool detect?

Accepted Answer

Two complementary layers cover different sets. The Unicode-script pass reliably identifies every language whose script is unique to it — Sinhala, Tamil, Thai, Korean, Hebrew, Khmer, Greek, Armenian, Georgian and the rest of the unique-script set in our curated registry of 55 languages. The franc tri-gram classifier adds fine-grained disambiguation across 187+ languages for Latin, Cyrillic, Arabic, Devanagari and Han scripts where script alone is ambiguous (English vs French vs German, Hindi vs Marathi, Russian vs Ukrainian).

Question 2

Does it detect Sinhala and Tamil correctly?

Accepted Answer

Yes — and instantly. Sinhala (Unicode block U+0D80 to U+0DFF) and Tamil (U+0B80 to U+0BFF) each have a unique Unicode block that no other widely-written language uses. A single Sinhala or Tamil codepoint is enough to identify the script with certainty. The pass is deterministic and runs in well under a millisecond on a typical paragraph.

Question 3

Is anything uploaded? Where does my text go?

Accepted Answer

Detection runs on this server, not on your device. Your text is sent once over HTTPS to score it, the result comes back, and it is not stored or logged. There is no third-party AI service involved — the classifier is an open-source tri-gram library that runs in the API route. If you want a fully offline tool, copy the text into the offline alternatives listed in the methodology section.

Question 4

Why is the model not running in my browser?

Accepted Answer

Neural language-ID models are typically tens to hundreds of megabytes. Shipping that to every visitor on a phone with limited data is a non-starter for a free utility. Server-side detection keeps the page tiny (the textarea loads in milliseconds), works on every device including budget Androids, and uses one shared compute pool instead of forcing each visitor to compute on their own battery.

Question 5

How does Unicode-script detection actually work?

Accepted Answer

Each character has a fixed numeric code point. The Unicode Consortium publishes which ranges belong to which script — for example, U+0D80 to U+0DFF is Sinhala, U+0B80 to U+0BFF is Tamil, U+0900 to U+097F is Devanagari. The tool iterates over every character in the input, looks up its range, and counts attributions per script. Spaces, digits, and punctuation are ignored. The script with the largest share usually wins; for languages whose script is unique to them, that one signal is enough.

Question 6

How does the franc classifier work?

Accepted Answer

franc is a pure-JavaScript implementation of the tri-gram language-ID method described by Vatanen, Väyrynen and Virpioja in their 2010 LREC paper. It builds a per-language profile of character tri-gram frequencies from Wikipedia corpora, then compares the input's tri-gram profile to each known profile and ranks them by similarity. It returns ISO 639-3 codes (e.g. "eng", "fra", "sin", "tam"). It works best on at least one full sentence — short snippets are noisy and the tool flags any input below 24 characters as low-reliability.

Question 7

How accurate is it for Latin-script languages?

Accepted Answer

On a paragraph or longer, the tri-gram classifier is usually correct on the first try for the major Latin-script languages (English, French, German, Spanish, Italian, Portuguese, Dutch). Closely-related pairs (e.g. Danish vs Norwegian, Indonesian vs Malay, Czech vs Slovak) sometimes swap places in the top-two — the result panel shows the runner-up so you can spot these. Very short input (single words, slogans, brand names) is unreliable; the page surfaces a warning for short inputs.

Question 8

What about mixed-language text, like English with a few Sinhala words?

Accepted Answer

The script breakdown shows every script that contributes 5% or more of the characters, and the result tile flags the input as 'mixed'. For mixed input, the server feeds only the Latin-script characters to the tri-gram classifier so the non-Latin runs do not confuse it; the non-Latin share is reported separately under 'Unicode script breakdown'. For genuinely bilingual content (code-switching, song lyrics, glossaries), split the text into single-language chunks for a per-chunk answer.

Question 9

Can I use this for very short snippets — single words or phrases?

Accepted Answer

For non-Latin scripts, yes — a single Sinhala or Tamil character is enough. For Latin-script or other shared-script input, the classifier needs context: at least 24 characters is the practical floor. Below that the page shows a warning and the confidence drops sharply. For brand names and proper nouns specifically there is no good answer either way; many languages share most names.

Question 10

Can it detect romanized Sinhala or Tamil typed in English letters (Singlish)?

Accepted Answer

Not reliably, and that is honest rather than a bug. When Sinhala or Tamil is typed phonetically in Latin letters — "mama gedara yanawa" — the Unicode pass sees only the Latin script, so it hands the text to the tri-gram classifier. franc has no trained profile for romanized Sinhala, so it usually guesses a Latin-script language that shares similar letter patterns. For an accurate read, paste the text in its native script. Romanized-input detection is a genuinely hard, largely unsolved problem.

Question 11

Is the language detector free, and are there usage limits?

Accepted Answer

Yes — it is free, with no signup, no API key, and no ads. The only limit is 10,000 characters per detection, which is plenty for a paragraph or two. There is no daily quota for normal use. Because detection runs server-side on an open-source library rather than a paid AI API, there is no per-request cost to pass on to you.

Question 12

What is the input character limit?

Accepted Answer

10,000 characters per detection. For longer documents, run a representative paragraph rather than the whole text — accuracy is the same and the result is faster. Splitting also helps surface language changes inside long documents that a single overall label would smooth over.

Question 13

When were the sources last verified?

Accepted Answer

Library version (franc 6.2.0), language list, and Unicode script ranges were last cross-checked on 2026-05-12. The Unicode script blocks themselves rarely change between versions — Sinhala, Tamil, Devanagari, etc. have been stable for over twenty years. The classifier library is reviewed each time we update or add languages.

AI Language Detector — 100+ Languages, No Signup

How it works

1. Unicode script analysis

2. Tri-gram classifier (server-side)

3. Reconciliation

Worked examples

Frequently asked questions

Sources & references

Related tools

AI Object Detector

Sentiment Analyzer

AI Audio Transcriber

Comments & feedback