English to Sinhala Translator — Tamil & 63+ Languages, In Your Browser
Translate English ↔ Sinhala, English ↔ Tamil, and 50+ other language pairs without leaving the page. Meta's NLLB-200 runs entirely in your browser via transformers.js — no signup, no upload, no daily character limit, sources cited below.
How it works
The translator wraps Meta's open-weights NLLB-200 distilled-600M model — the “No Language Left Behind” project from Meta AI Research, designed to push machine-translation quality up for the long tail of under-served languages including Sinhala and Tamil. The model weights are quantized to int8 and converted to ONNX by the Xenova port, so they run in your browser through @huggingface/transformers on top of ONNX Runtime Web. WebGPU is used automatically when the browser advertises it; WebAssembly SIMD is the universal fallback.
1. Pre-processing
Input is Unicode-normalised to NFC, zero-width joiners and control characters are stripped (whitespace and line breaks are preserved), and the text is split into sentences using a multilingual terminator regex (English . ! ?, Sanskrit/Devanagari danda ।, line breaks). Common English abbreviations (Mr., Dr., U.S., etc.) are guarded so they don't trigger spurious splits.
2. Source-language detection
When you pick “Auto-detect”, the tool counts the script of every Unicode codepoint in the input and picks the majority script — Sinhala (U+0D80–U+0DFF), Tamil (U+0B80–U+0BFF), Latin, Devanagari, Arabic, Han, etc. Each script maps to a default FLORES-200 source code (Latin → English, Devanagari → Hindi, and so on). Sinhala and Tamil are settled with high confidence from a single codepoint because no other widely-written language uses their script. For shared-script input the default is English; you can override the source from the dropdown at any time.
3. Entity preservation
Before each sentence is translated, the tool masks four classes of span with sentinels («NUM_001», «URL_001», etc.) so they survive the model unchanged:
- Numbers and dates (digits with optional thousands/decimal separators, dates in YYYY-MM-DD or DD/MM/YYYY) — preserved verbatim, never transliterated.
- URLs and email addresses — copied through unchanged.
- Sri Lankan institution and place names from a curated dictionary of 53+ entries: universities, ministries, regulators, banks, the 17 major cities, the Constitution's name of the country, etc. The source-language form is masked, the sentence is translated, and the target-language form is restored. This pins canonical names like University of Jaffna → யாழ்ப்பாணப் பல்கலைக்கழகம் instead of letting NLLB pick a different rendering each run.
4. Translation (beam search)
Each masked sentence is fed to the translation pipeline with the resolved source/target FLORES-200 codes (sin_Sinh, tam_Taml, eng_Latn, …). Three modes select different beam-search settings:
- Fast — greedy decode (
num_beams = 1), lowest latency, slightly rougher draft. - Standard — 4-beam search,
length_penalty = 1.0. Balanced and the page default. - Quality — 6-beam search,
length_penalty = 1.2. Slowest and most polished — useful for tricky idiomatic phrasing.
Sampling is off in every mode, so the model's output is fully deterministic for a given (input, mode, language pair). The per-sentence max_new_tokens budget is 512, which comfortably fits sentences up to about 300 words.
5. Post-processing and quality band
After each sentence is translated, the sentinels are restored from the mapping table (numbers verbatim; named entities via the target-language entry in the override dictionary). The translated sentences are reassembled using the original delimiters so user line breaks are preserved. Each translation displays a quality badge — Good / Fair / Limited — derived from the chrF++ score of the resolved direction on the FLORES-200 devtest, sourced directly from Meta's NLLB-200 paper (Table 14, distilled-600M row). For directions outside the cited baseline table, the badge falls back to the lower of (source ↔ English) and (English ↔ target), labelled “via English pivot”.
Worked examples
Frequently asked questions
Sources & references
- Hugging Face — Xenova/nllb-200-distilled-600M (ONNX model card)
- Hugging Face — facebook/nllb-200-distilled-600M (original Meta AI model card and licence)
- Costa-jussà et al. (2022) — No Language Left Behind: Scaling Human-Centered Machine Translation (arXiv)
- FLORES-200 — language-code list and dev/devtest splits used for the chrF++ baselines
- Hugging Face Transformers.js (v3) — in-browser ML runtime
- Unicode Character Database — Scripts.txt (authoritative codepoint-to-script mapping)
Model card, paper, library version, FLORES-200 score table, and the Sri Lankan institution override dictionary were last cross-checked on 2026-05-12. The Xenova model and FLORES baselines are reviewed quarterly; override dictionary entries are updated whenever a named institution publishes a name change.
NLLB-200 model weights are licensed CC-BY-NC 4.0 (non-commercial). NLLB-200 supports 200 languages upstream; this build surfaces a curated subset of 63 languages biased toward Sri Lankan use cases.
Related tools
Comments & feedback
Spotted a bug or want an improvement? Tell us — our team reviews every comment, and good ideas get built. Comments are public and anonymous.
Spotted a wrong translation, a name not in the override dictionary, or a missing language?
Email me at [email protected] — most fixes ship within 24 hours.