induwara.lk
induwara.lkAI · 99 languages

AI Audio Transcriber — Free Whisper, 99 Languages

Drop an audio or video file and get a timestamped transcript — including Sinhala and Tamil — in seconds. Powered by OpenAI Whisper on the Hugging Face Inference API. No signup, no API key, no audio stored. Export plain text, SRT subtitles, or WebVTT captions.

By Induwara AshinsanaUpdated May 12, 2026
Transcribe audioWhisper · 99 languages
No account · sources cited

Transcription runs on our server via the Hugging Face Inference API. The audio is forwarded once and never stored.

What this does

Reads any audio or video file up to 25.0 MB and writes a timestamped transcript. Pick a Whisper checkpoint, the source language (or leave on Auto-detect), choose Transcribe or Translate → English, then press Transcribe audio. Export plain text, .srt subtitles, or .vtt captions.

Whisper detects the language from a 30-second preamble.

Server transcription via Whisper large-v3 (best quality). Typical clip returns in 20–90 s on the HF Inference API.

Sources cited on the page below. Last cross-checked 2026-05-12. Whisper paper: Radford et al., arXiv:2212.04356.

How it works

The tool runs OpenAI Whisper on the Hugging Face Inference API. The default checkpoint is openai/whisper-large-v3 (1550 M parameters, the lowest published WER of the public Whisper checkpoints); a faster whisper-medium (769 M) is offered for English-only dictation. We keep transcription on the server because every published quantised Whisper is at least 75 MB — too much to ship into a Sri Lankan mobile browser — and the larger checkpoints, the ones that actually read Sinhala and Tamil well, are out of reach for in-browser delivery entirely.

A single transcription goes through five deterministic steps:

  1. Validate. The file must be an audio or video format ffmpeg can decode (MP3, WAV, M4A/AAC, OGG, FLAC, MP4, WebM, MOV) and at most 25.0 MB.
  2. Forward. The bytes are POSTed once to /api/tools/transcribe-audio, which streams them to api-inference.huggingface.co with the chosen language and task. Nothing is persisted on our server — the bytes live only in request memory and are released when the response returns.
  3. Decode + resample. The HF backend uses ffmpeg to decode the container and resample to 16,000 Hz mono — the sample rate Whisper's feature extractor expects.
  4. Chunk + decode. Whisper natively processes 30-second windows. The HF pipeline is called with return_timestamps: true and chunk_length_s: 30 so cues stay aligned across chunk boundaries on files longer than 30 seconds. Decoding is greedy (no sampling) with the source language token forced if you picked one, or Whisper's built-in language-detection token if you left the picker on Auto-detect. The task token selects transcribe or translate (any-language → English).
  5. Build cues. The API returns { text, chunks: [{ text, timestamp:[start,end] }, …] }. SRT cues are built with HH:MM:SS,mmm timestamps (SubRip convention — comma separator). WebVTT cues use the same timestamps with a dot instead of a comma, per the W3C WebVTT specification. A second independent formatter (formatTimestampByMath) cross-checks the primary formatter against pure-arithmetic arithmetic on every cue — a build-time assert flags any drift.

End-to-end latency depends on the HF queue and the chosen checkpoint. A 1-minute clip on whisper-large-v3 typically returns in 15–30 seconds; a 5-minute clip in 30–90 seconds. Medium is about twice as fast and is the right choice for English-only speech where the slight accuracy hit is invisible. The server enforces a 120-second timeout per request and gracefully reports a friendly “model unavailable” banner when the upstream is queued or otherwise unreachable, so the page never shows a dead state.

Two privacy notes. The audio bytes are forwarded as a singlemultipart/form-data POST and discarded as soon as the upstream response returns — there is no log of audio content, only a server-side timing counter (request ms, model id). And the tool never attempts speaker identification, age, or ethnicity inference — Whisper is a sequence-to-sequence ASR model, not a speaker classifier.

Worked examples

UCSC supervisor meeting — English, large-v3, SRT export

A 12-minute M4A recorded on a phone, transcribed with whisper-large-v3 in English. Whisper emits two cues per minute on average at the default 30 s chunk; here are the first two for illustration.

Model:
Whisper large-v3 (best quality)
Language:
English

SRT preview

1
00:00:00,000 --> 00:00:30,000
Right, so let's go through the literature review you sent last week.

2
00:00:30,000 --> 00:01:00,000
I have a few suggestions on the methodology section in particular.

Notes: On the HF Inference API a 12-minute file processes in roughly 25–60 seconds end-to-end, including queueing. The SRT body is deterministic because Whisper runs with do_sample=false on the upstream — given the same audio and the same checkpoint, the timestamps and text round-trip identically across requests.

Sinhala radio clip — large-v3, VTT export

A 4-minute MP3 of a Sinhala-language radio show. We recommend whisper-large-v3 for Sinhala because medium's per-language WER drops significantly on low-resource scripts per the Whisper paper's Table 8.

Model:
Whisper large-v3 (best quality)
Language:
Sinhala

VTT preview

WEBVTT

00:00:00.000 --> 00:00:12.340
ආයුබෝවන් සියලු දෙනාටම, අද වැඩසටහනට.

00:00:12.340 --> 00:00:28.900
අපි කතා කරන්නේ දේශීය ක්‍රීඩා ගැන.

Notes: Auto-detect would also have tagged this file as Sinhala — Whisper emits a language token before the first content token and the HF endpoint surfaces it on the response. The cues use non-integer seconds to verify the millisecond formatting branch.

Hour boundary — verifies HH:MM:SS rollover

Edge case: a single cue that crosses the one-hour mark. The formatter must rollover the minute field cleanly without dropping the hour.

Model:
Whisper large-v3 (best quality)
Language:
English

SRT preview

1
00:59:59,500 --> 01:00:01,500
And that brings us to the close of part one.

Notes: Hand math: 3599.5 s → 0 h, 59 m, 59.5 s → 00:59:59,500. 3601.5 s → 1 h, 0 m, 1.5 s → 01:00:01,500. The cross-check formatter formatTimestampByMath agrees on both timestamps, so any future change to formatSrtTimestamp that breaks this case will fail the page's assert at build time.

Model variants at a glance

Whisper large-v3 is the default — best Sinhala and Tamil quality of any public Whisper checkpoint. Medium is about twice as fast on the HF infrastructure and is the right pick for English-only dictation where the per-language WER gap is small. The WER numbers below are the multilingual test scores reported in the Whisper paper.

ModelParamsEnglish WERNotes
Whisper large-v3 (best quality)1550M2.7%OpenAI's flagship Whisper checkpoint. Best Sinhala and Tamil quality. Default choice; ~2× slower than medium on the HF servers but the only tier that's reliably usable for non-English speech.
Whisper medium (faster)769M4.1%About twice as fast as large-v3 on the HF infrastructure. Strong on English and most European languages; noticeably weaker on Sinhala, Tamil and other low-resource scripts — use the larger model when accuracy matters.

Frequently asked questions

Sources & references

The Whisper paper, model cards, Hugging Face Inference API reference, and the WebVTT specification were all cross-checked on 2026-05-12. The page is reviewed quarterly and whenever Hugging Face rotates the Whisper variants in its hosted inference catalogue.

Related tools

Rate this tool
Be the first to rate

Comments & feedback

Spotted a bug or want an improvement? Tell us — our team reviews every comment, and good ideas get built. Comments are public and anonymous.

Found a bug, edge case, or want to suggest an improvement?

Email me at [email protected] — most fixes ship within 24 hours.