Question 1

Is my audio uploaded to a server?

Accepted Answer

The file is sent once to induwara.lk's transcription proxy, which forwards the bytes to the Hugging Face Inference API hosting OpenAI Whisper. Nothing is written to disk on our end and the upstream is configured with caching disabled, so no copy lingers after the response. There is no account, no log of audio content, and no third party other than Hugging Face touches the bytes.

Question 2

How do I transcribe audio to text for free?

Accepted Answer

Drop a file (MP3, WAV, M4A, OGG, FLAC, MP4, or WebM, up to 25.0 MB) onto the tool above, leave the language on Auto-detect, and press Transcribe audio. The server forwards your audio to Whisper large-v3 on the Hugging Face Inference API; a one-minute clip typically returns in 15–30 seconds.

Question 3

Why is this server-side instead of in-browser?

Accepted Answer

Whisper's quantised weights are between 75 MB (tiny) and ~1.5 GB (large-v3). Downloading even the smallest checkpoint into every Sri Lankan visitor's browser eats a chunk of their mobile data plan, and the larger checkpoints — the ones that actually transcribe Sinhala and Tamil well — are not realistic for in-browser delivery at all. Server-side inference via the Hugging Face Inference API gives users the highest-quality checkpoint with no download.

Question 4

Is there a free Sinhala or Tamil audio-to-text tool?

Accepted Answer

Yes — this one. Whisper covers 99 languages and the multilingual checkpoints include Sinhala (si) and Tamil (ta). Quality on Sinhala is significantly better on the large-v3 model than on medium, per the per-language WER table in the Whisper paper. Set the source language explicitly (instead of Auto-detect) on short clips for a small accuracy bump.

Question 5

How do I generate SRT subtitles from an MP3 file?

Accepted Answer

Upload the MP3, transcribe it, then press the .srt download button. The tool builds SRT cues from Whisper's chunk timestamps — index, HH:MM:SS,mmm start --> HH:MM:SS,mmm end, cue text, blank line. Drop the .srt into Premiere, DaVinci Resolve, Final Cut, or YouTube Studio and your subtitles are wired up.

Question 6

What is the difference between SRT and VTT?

Accepted Answer

SRT (SubRip) is the older format video editors expect — numbered cues, comma in the millisecond separator (HH:MM:SS,mmm). VTT (WebVTT) is the W3C standard the HTML <track> element loads — same timestamps but a dot before milliseconds (HH:MM:SS.mmm) and a WEBVTT header line. This tool emits both from the same chunk data, so the cues line up sample-for-sample.

Question 7

What does Translate → English do?

Accepted Answer

Whisper has a built-in translate task that decodes any supported source language into English text. It is one-way only — there is no Sinhala→French in the model. Useful for getting English captions on a Sinhala or Tamil interview without a separate translation step. Leave the task on Transcribe if you want the words in their original language.

Question 8

What file types and sizes are supported?

Accepted Answer

Anything ffmpeg can decode on the Hugging Face backend: MP3, WAV, M4A (AAC), OGG/OGA, FLAC, AAC, plus video containers MP4, WebM, and MOV. The hard cap is 25.0 MB per file — a roughly 25-minute mono MP3 at 64 kbps. Longer recordings should be split with Audacity or ffmpeg before uploading.

Question 9

Can I transcribe a one-hour lecture?

Accepted Answer

Not in one go — the per-file cap is 25.0 MB, which is roughly 25 minutes of speech-quality MP3. For a one-hour lecture, split it into three ~20-minute chunks with Audacity or ffmpeg, transcribe each, and concatenate the SRT files (adjusting the second and third part's timestamps by the offset of the previous part). A dedicated long-form transcription mode is on the roadmap.

Question 10

When were these models and references last verified?

Accepted Answer

The Whisper paper, OpenAI model card, Hugging Face Inference API reference, WebVTT specification, and SRT format notes were all cross-checked on 2026-05-12. The page is reviewed whenever Hugging Face rotates the Whisper variants in its hosted inference catalogue or OpenAI ships a new Whisper checkpoint.

Model	Params	English WER	Notes
Whisper large-v3 (best quality)	1550M	2.7%	OpenAI's flagship Whisper checkpoint. Best Sinhala and Tamil quality. Default choice; ~2× slower than medium on the HF servers but the only tier that's reliably usable for non-English speech.
Whisper medium (faster)	769M	4.1%	About twice as fast as large-v3 on the HF infrastructure. Strong on English and most European languages; noticeably weaker on Sinhala, Tamil and other low-resource scripts — use the larger model when accuracy matters.

AI Audio Transcriber — Free Whisper, 99 Languages

How it works

Worked examples

Model variants at a glance

Frequently asked questions

Sources & references

Related tools

Language Detector

Text Summarizer

Speech to Text

Comments & feedback