induwara.lk
induwara.lkAI · Sources cited

AI Language Detector — 100+ Languages, No Signup

Paste any text and find out what language it is. Sinhala, Tamil, Hindi, English, Chinese, French — instant for non-Latin scripts via Unicode analysis, fine-grained for Latin scripts via a server-side tri-gram classifier. No upload to a third party, no API keys, no ads.

By Induwara AshinsanaUpdated May 12, 2026
Detect languagefranc · Unicode · server-side
Sources cited
Detection runs server-side. Text is scored once and not stored.140 / 10,000
Try a sample
Live script preview
Latin100%
Non-Latin scripts (Sinhala, Tamil, Thai, Korean…) are decided instantly from Unicode alone. Latin / shared scripts go through the franc classifier server-side.

What this does

Reads any text and tells you what language it is, in two passes: a Unicode-script pass that handles Sinhala, Tamil, Thai, Korean, Arabic, and 20+ other scripts instantly, plus a tri-gram classifier covering 187+ languages that disambiguates Latin-script and other shared-script languages (English, French, German, Hindi vs Marathi, etc.). Both run on the server — no model weights download.

Methodology: deterministic Unicode-script analysis (55 languages covered) + the franc tri-gram classifier (187+ languages). Both signals are reconciled and shown side-by-side. Sources linked under “Sources” below.

How it works

The page runs two passes on the same input and reconciles them. The two methods use entirely different signals, which is exactly why they cover each other's weak spots — and why watching them agree or disagree is a useful trust check on any single reading.

1. Unicode script analysis

Every character in any text has a fixed Unicode code point — for example, the Sinhala letter is U+0D9A. The Unicode Consortium publishes which blocks belong to which script; you can read the full mapping at the link in the sources section. The tool iterates over the input character-by-character and counts how many code points fall into each script range. Whitespace, digits, and common punctuation are counted but not attributed to any script.

Many scripts are unique to one language in everyday writing — Sinhala (U+0D80–U+0DFF), Tamil (U+0B80–U+0BFF), Thai (U+0E00–U+0E7F), Hebrew (U+0590–U+05FF), Khmer (U+1780–U+17FF), Korean Hangul (U+AC00–U+D7AF), Greek (U+0370–U+03FF), Armenian, Georgian, Ethiopic, Lao, Myanmar, Bengali, Telugu, Kannada, Malayalam, Gujarati, Gurmukhi, and Oriya. For these, a single character of input is enough to identify the language with certainty — the classifier is not even consulted.

Some scripts are shared by many languages: Latin (English, French, German, Spanish, Italian, Dutch, Polish, Turkish, Vietnamese, Swahili, hundreds more), Cyrillic (Russian, Bulgarian, Ukrainian, Serbian), Arabic (Arabic, Urdu, Persian, Pashto), Devanagari (Hindi, Marathi, Sanskrit, Nepali), and Han (Chinese, Japanese kanji). For these the script alone is not enough — and that is where the tri-gram classifier steps in.

2. Tri-gram classifier (server-side)

When the dominant script is shared, the API route invokes franc v6.2.0, a pure-JavaScript implementation of the tri-gram language-ID method published by Vatanen, Väyrynen and Virpioja (LREC 2010). The library ships per-language tri-gram profiles built from Wikipedia corpora and compares your input's tri-gram frequencies to every known profile, returning ranked ISO 639-3 candidates with a similarity score (0 – 1). It covers 187+ languages out of the box, weighs about 150 KB on disk, and never downloads to your browser — only the small JSON result does.

We feed the classifier the whole input unless the text is mixed-script. For mixed inputs the server filters down to the dominant-script characters first so the classifier is not confused by foreign runs. The top five candidates are surfaced; the top one becomes the headline answer when the dominant script is shared.

3. Reconciliation

Both signals feed reconcile() which applies a simple rule set:

if dominant script is unique to one language:
    return that language (method = "script", confidence = script-share)

else if classifier returned a recognised label:
    if classifier's script matches the dominant script:
        return classifier top-1 (method = "agreement")
    else:
        return classifier top-1 (method = "classifier")

else if some script is present:
    return script-default language (method = "script")

else:
    refuse with NO_LETTERS_MESSAGE

The result panel always shows both: detected language, confidence, method, plus a per-script breakdown so any mixed-script input is visible at a glance. Inputs shorter than 24 characters get a low-reliability warning — tri-gram similarity calibration falls off sharply on very short snippets.

Language detection is usually step zero in a longer pipeline. Once you know the language, the next move is often analysis: confirm the tone of a paragraph with the AI Sentiment Analyzer, condense a long passage with the AI Text Summarizer, or — if you are feeding the text to a large language model — measure how much it will cost with the AI Token Counter. Each one runs in the same no-signup, sources-cited way as this detector.

Worked examples

Sinhala — instant, classifier skipped

ශ්‍රී ලංකාවේ අධ්‍යාපනය නොමිලයේ ලබා දෙන රටවල් අතර එකකි.

  1. Codepoint scan: every character lies in U+0D80–U+0DFF (Sinhala)
  2. Script-share table: Sinhala = 100% of attributable characters
  3. Sinhala is a uniquely-mapped script → method = "script"
  4. Result: Sinhala (සිංහල) · ISO 639-1 "si" · 639-3 "sin" · confidence 1.00
  5. Classifier never invoked — Unicode signal is conclusive

French — classifier disambiguates Latin script

Le Sri Lanka est une île de l'océan Indien, située au sud-est de l'Inde.

  1. Codepoint scan: ASCII letters dominate (with é, î, è)
  2. Script-share table: Latin = 100% of attributable characters
  3. Latin is shared → invoke franc tri-gram classifier
  4. Classifier returns: fra ≈ 1.00, spa ≈ 0.88, src ≈ 0.84, …
  5. Result: French (Français) · ISO 639-1 "fr" · 639-3 "fra"
  6. Script signal matches the classifier → method = "agreement"

Mixed script — Sinhala + English

The IRD published the 2025/26 brackets — මෙය නවතම තොරතුරයි.

  1. Codepoint scan attributes ~17 chars to Latin, ~15 to Sinhala
  2. Script-share table: Latin ≈ 53%, Sinhala ≈ 47%
  3. Neither share crosses the 90% dominance threshold
  4. Two scripts each above 5% → 'mixed-script' flag raised
  5. Server filters input to Latin-only chars and runs the classifier
  6. Result: English headline + Sinhala share shown in the breakdown

Edge case — short Latin snippet, low reliability

Bonjour

  1. Codepoint scan: 7 Latin letters, no accents present
  2. Script-share table: Latin = 100% — script alone cannot decide
  3. Input is 7 characters, below the 24-character reliability floor
  4. Classifier still runs: fra is the top tri-gram match, but the score is soft
  5. Result: French is surfaced, with a low-reliability warning on the panel
  6. Takeaway: a single word can match several languages — add a full sentence to firm up the call

Frequently asked questions

Sources & references

The classifier library, language list, and Unicode script ranges were last cross-checked on 2026-05-12. Script blocks are stable across Unicode versions for the languages covered here; the library is reviewed each time we update languages or upgrade dependencies.

Related tools

Rate this tool
Be the first to rate

Comments & feedback

Spotted a bug or want an improvement? Tell us — our team reviews every comment, and good ideas get built. Comments are public and anonymous.

Found a misclassification, missing language, or have a model suggestion?

Email me at [email protected] — most fixes ship within 24 hours.