AI Language Detector — 100+ Languages, No Signup
Paste any text and find out what language it is. Sinhala, Tamil, Hindi, English, Chinese, French — instant for non-Latin scripts via Unicode analysis, fine-grained for Latin scripts via a server-side tri-gram classifier. No upload to a third party, no API keys, no ads.
How it works
The page runs two passes on the same input and reconciles them. The two methods use entirely different signals, which is exactly why they cover each other's weak spots — and why watching them agree or disagree is a useful trust check on any single reading.
1. Unicode script analysis
Every character in any text has a fixed Unicode code point — for example, the Sinhala letter ක is U+0D9A. The Unicode Consortium publishes which blocks belong to which script; you can read the full mapping at the link in the sources section. The tool iterates over the input character-by-character and counts how many code points fall into each script range. Whitespace, digits, and common punctuation are counted but not attributed to any script.
Many scripts are unique to one language in everyday writing — Sinhala (U+0D80–U+0DFF), Tamil (U+0B80–U+0BFF), Thai (U+0E00–U+0E7F), Hebrew (U+0590–U+05FF), Khmer (U+1780–U+17FF), Korean Hangul (U+AC00–U+D7AF), Greek (U+0370–U+03FF), Armenian, Georgian, Ethiopic, Lao, Myanmar, Bengali, Telugu, Kannada, Malayalam, Gujarati, Gurmukhi, and Oriya. For these, a single character of input is enough to identify the language with certainty — the classifier is not even consulted.
Some scripts are shared by many languages: Latin (English, French, German, Spanish, Italian, Dutch, Polish, Turkish, Vietnamese, Swahili, hundreds more), Cyrillic (Russian, Bulgarian, Ukrainian, Serbian), Arabic (Arabic, Urdu, Persian, Pashto), Devanagari (Hindi, Marathi, Sanskrit, Nepali), and Han (Chinese, Japanese kanji). For these the script alone is not enough — and that is where the tri-gram classifier steps in.
2. Tri-gram classifier (server-side)
When the dominant script is shared, the API route invokes franc v6.2.0, a pure-JavaScript implementation of the tri-gram language-ID method published by Vatanen, Väyrynen and Virpioja (LREC 2010). The library ships per-language tri-gram profiles built from Wikipedia corpora and compares your input's tri-gram frequencies to every known profile, returning ranked ISO 639-3 candidates with a similarity score (0 – 1). It covers 187+ languages out of the box, weighs about 150 KB on disk, and never downloads to your browser — only the small JSON result does.
We feed the classifier the whole input unless the text is mixed-script. For mixed inputs the server filters down to the dominant-script characters first so the classifier is not confused by foreign runs. The top five candidates are surfaced; the top one becomes the headline answer when the dominant script is shared.
3. Reconciliation
Both signals feed reconcile() which applies a simple rule set:
if dominant script is unique to one language:
return that language (method = "script", confidence = script-share)
else if classifier returned a recognised label:
if classifier's script matches the dominant script:
return classifier top-1 (method = "agreement")
else:
return classifier top-1 (method = "classifier")
else if some script is present:
return script-default language (method = "script")
else:
refuse with NO_LETTERS_MESSAGEThe result panel always shows both: detected language, confidence, method, plus a per-script breakdown so any mixed-script input is visible at a glance. Inputs shorter than 24 characters get a low-reliability warning — tri-gram similarity calibration falls off sharply on very short snippets.
Language detection is usually step zero in a longer pipeline. Once you know the language, the next move is often analysis: confirm the tone of a paragraph with the AI Sentiment Analyzer, condense a long passage with the AI Text Summarizer, or — if you are feeding the text to a large language model — measure how much it will cost with the AI Token Counter. Each one runs in the same no-signup, sources-cited way as this detector.
Worked examples
Frequently asked questions
Sources & references
- franc (v6.2.0) — pure-JS tri-gram language identification library
- Vatanen, Väyrynen, Virpioja (2010) — Language identification of short text segments with N-gram models (LREC)
- Unicode Consortium — Code Charts (script range definitions)
- Unicode Character Database — Scripts.txt (authoritative codepoint-to-script mapping)
- SIL International — ISO 639-3 code tables
The classifier library, language list, and Unicode script ranges were last cross-checked on 2026-05-12. Script blocks are stable across Unicode versions for the languages covered here; the library is reviewed each time we update languages or upgrade dependencies.
Related tools
Comments & feedback
Spotted a bug or want an improvement? Tell us — our team reviews every comment, and good ideas get built. Comments are public and anonymous.
Found a misclassification, missing language, or have a model suggestion?
Email me at [email protected] — most fixes ship within 24 hours.