induwara.lk
induwara.lkAI · Privacy-first

English to Sinhala Translator — Tamil & 63+ Languages, In Your Browser

Translate English ↔ Sinhala, English ↔ Tamil, and 50+ other language pairs without leaving the page. Meta's NLLB-200 runs entirely in your browser via transformers.js — no signup, no upload, no daily character limit, sources cited below.

By Induwara AshinsanaUpdated May 12, 2026
Translate textNLLB-200 · in your browser
Text stays on your device
Everything runs in your browser. No upload, no logging.97 / 4,000
Try a sample
Substitutes 53+ Sri Lankan institution and place names from a curated dictionary; numbers, dates, URLs, and emails are always preserved verbatim.
Server-side translation — no model downloads, no signup. Sentence-by-sentence with override dictionary post-processing.

What this does

Translates between 63+ languages with Meta's NLLB-200 model running entirely in your browser. Auto-detects the source language from Unicode script, splits long input into sentences and translates them piece by piece, and pins Sri Lankan institution and place names from a curated dictionary so “University of Jaffna” survives a round-trip into Sinhala or Tamil. NLLB-200 supports 200 languages upstream; we surface a curated subset on this build.

Methodology: Unicode-script source detection + server-side translation via the free public Google Translate endpoint, with the 53-entry Sri Lankan override dictionary applied client-side. Methodology details cited under “Sources” below.

How it works

The translator wraps Meta's open-weights NLLB-200 distilled-600M model — the “No Language Left Behind” project from Meta AI Research, designed to push machine-translation quality up for the long tail of under-served languages including Sinhala and Tamil. The model weights are quantized to int8 and converted to ONNX by the Xenova port, so they run in your browser through @huggingface/transformers on top of ONNX Runtime Web. WebGPU is used automatically when the browser advertises it; WebAssembly SIMD is the universal fallback.

1. Pre-processing

Input is Unicode-normalised to NFC, zero-width joiners and control characters are stripped (whitespace and line breaks are preserved), and the text is split into sentences using a multilingual terminator regex (English . ! ?, Sanskrit/Devanagari danda , line breaks). Common English abbreviations (Mr., Dr., U.S., etc.) are guarded so they don't trigger spurious splits.

2. Source-language detection

When you pick “Auto-detect”, the tool counts the script of every Unicode codepoint in the input and picks the majority script — Sinhala (U+0D80–U+0DFF), Tamil (U+0B80–U+0BFF), Latin, Devanagari, Arabic, Han, etc. Each script maps to a default FLORES-200 source code (Latin → English, Devanagari → Hindi, and so on). Sinhala and Tamil are settled with high confidence from a single codepoint because no other widely-written language uses their script. For shared-script input the default is English; you can override the source from the dropdown at any time.

3. Entity preservation

Before each sentence is translated, the tool masks four classes of span with sentinels («NUM_001», «URL_001», etc.) so they survive the model unchanged:

  • Numbers and dates (digits with optional thousands/decimal separators, dates in YYYY-MM-DD or DD/MM/YYYY) — preserved verbatim, never transliterated.
  • URLs and email addresses — copied through unchanged.
  • Sri Lankan institution and place names from a curated dictionary of 53+ entries: universities, ministries, regulators, banks, the 17 major cities, the Constitution's name of the country, etc. The source-language form is masked, the sentence is translated, and the target-language form is restored. This pins canonical names like University of Jaffna யாழ்ப்பாணப் பல்கலைக்கழகம் instead of letting NLLB pick a different rendering each run.

4. Translation (beam search)

Each masked sentence is fed to the translation pipeline with the resolved source/target FLORES-200 codes (sin_Sinh, tam_Taml, eng_Latn, …). Three modes select different beam-search settings:

  • Fast — greedy decode (num_beams = 1), lowest latency, slightly rougher draft.
  • Standard — 4-beam search, length_penalty = 1.0. Balanced and the page default.
  • Quality — 6-beam search, length_penalty = 1.2. Slowest and most polished — useful for tricky idiomatic phrasing.

Sampling is off in every mode, so the model's output is fully deterministic for a given (input, mode, language pair). The per-sentence max_new_tokens budget is 512, which comfortably fits sentences up to about 300 words.

5. Post-processing and quality band

After each sentence is translated, the sentinels are restored from the mapping table (numbers verbatim; named entities via the target-language entry in the override dictionary). The translated sentences are reassembled using the original delimiters so user line breaks are preserved. Each translation displays a quality badge — Good / Fair / Limited — derived from the chrF++ score of the resolved direction on the FLORES-200 devtest, sourced directly from Meta's NLLB-200 paper (Table 14, distilled-600M row). For directions outside the cited baseline table, the badge falls back to the lower of (source ↔ English) and (English ↔ target), labelled “via English pivot”.

Worked examples

English → Sinhala · UGC notice

Source

The University Grants Commission announced that the 2026 academic year will begin on 1 September.

Translation

විශ්වවිද්‍යාල ප්‍රතිපාදන කොමිෂන් සභාව නිවේදනය කළේ 2026 අධ්‍යයන වර්ෂය සැප්තැම්බර් 1 වැනි දින ආරම්භ වන බවයි.

  1. Detect: 100% Latin codepoints → eng_Latn
  2. Mask: «NE_001» = University Grants Commission (override dict)
  3. Mask: «NUM_001» = 2026 ; «NUM_002» = 1
  4. Translate via NLLB-200, src=eng_Latn, tgt=sin_Sinh, Standard mode
  5. Restore: «NE_001» → විශ්වවිද්‍යාල ප්‍රතිපාදන කොමිෂන් සභාව
  6. Restore: «NUM_001» → 2026 ; «NUM_002» → 1
  7. chrF++ on FLORES-200: 44.7 → Good quality band

Tamil → English · Jaffna admissions

Source

யாழ்ப்பாணம் பல்கலைக்கழகம் 2026 ஆம் ஆண்டுக்கான புதிய மாணவர் சேர்க்கையை ஆகஸ்ட் 15 அன்று தொடங்கும்.

Translation

The University of Jaffna will begin the new student admission for 2026 on 15 August.

  1. Detect: all codepoints in U+0B80–U+0BFF → tam_Taml
  2. Mask: «NE_001» = யாழ்ப்பாணம் பல்கலைக்கழகம் (override dict id=uoj)
  3. Mask: «NUM_001» = 2026 ; «NUM_002» = 15
  4. Translate via NLLB-200, src=tam_Taml, tgt=eng_Latn, Standard mode
  5. Restore: «NE_001» → University of Jaffna
  6. Restore: «NUM_001» → 2026 ; «NUM_002» → 15
  7. chrF++ on FLORES-200: 51.8 → Good quality band

Sinhala → English · idiomatic phrasing

Source

අද උදේ ආගිය වර්ෂාව නිසා කොළඹ මාර්ග ජලයෙන් යටවී ඇත.

Translation

Due to the heavy rain this morning, the roads in Colombo are flooded.

  1. Detect: 100% U+0D80–U+0DFF (Sinhala) codepoints → sin_Sinh
  2. Mask: «NE_001» = කොළඹ (override dict id=colombo)
  3. Translate via NLLB-200 in Quality mode (num_beams=6)
  4. Beam search picks idiomatic 'are flooded' over literal 'submerged in water'
  5. Restore: «NE_001» → Colombo
  6. chrF++ on FLORES-200: 47.2 → Good quality band

Frequently asked questions

Sources & references

Model card, paper, library version, FLORES-200 score table, and the Sri Lankan institution override dictionary were last cross-checked on 2026-05-12. The Xenova model and FLORES baselines are reviewed quarterly; override dictionary entries are updated whenever a named institution publishes a name change.

NLLB-200 model weights are licensed CC-BY-NC 4.0 (non-commercial). NLLB-200 supports 200 languages upstream; this build surfaces a curated subset of 63 languages biased toward Sri Lankan use cases.

Related tools

Rate this tool
Be the first to rate

Comments & feedback

Spotted a bug or want an improvement? Tell us — our team reviews every comment, and good ideas get built. Comments are public and anonymous.

Spotted a wrong translation, a name not in the override dictionary, or a missing language?

Email me at [email protected] — most fixes ship within 24 hours.