AI Audio Transcriber — Free Whisper, 99 Languages
Drop an audio or video file and get a timestamped transcript — including Sinhala and Tamil — in seconds. Powered by OpenAI Whisper on the Hugging Face Inference API. No signup, no API key, no audio stored. Export plain text, SRT subtitles, or WebVTT captions.
How it works
The tool runs OpenAI Whisper on the Hugging Face Inference API. The default checkpoint is openai/whisper-large-v3 (1550 M parameters, the lowest published WER of the public Whisper checkpoints); a faster whisper-medium (769 M) is offered for English-only dictation. We keep transcription on the server because every published quantised Whisper is at least 75 MB — too much to ship into a Sri Lankan mobile browser — and the larger checkpoints, the ones that actually read Sinhala and Tamil well, are out of reach for in-browser delivery entirely.
A single transcription goes through five deterministic steps:
- Validate. The file must be an audio or video format ffmpeg can decode (MP3, WAV, M4A/AAC, OGG, FLAC, MP4, WebM, MOV) and at most 25.0 MB.
- Forward. The bytes are POSTed once to
/api/tools/transcribe-audio, which streams them toapi-inference.huggingface.cowith the chosen language and task. Nothing is persisted on our server — the bytes live only in request memory and are released when the response returns. - Decode + resample. The HF backend uses ffmpeg to decode the container and resample to 16,000 Hz mono — the sample rate Whisper's feature extractor expects.
- Chunk + decode. Whisper natively processes 30-second windows. The HF pipeline is called with
return_timestamps: trueandchunk_length_s: 30so cues stay aligned across chunk boundaries on files longer than 30 seconds. Decoding is greedy (no sampling) with the source language token forced if you picked one, or Whisper's built-in language-detection token if you left the picker on Auto-detect. The task token selectstranscribeortranslate(any-language → English). - Build cues. The API returns
{ text, chunks: [{ text, timestamp:[start,end] }, …] }. SRT cues are built withHH:MM:SS,mmmtimestamps (SubRip convention — comma separator). WebVTT cues use the same timestamps with a dot instead of a comma, per the W3C WebVTT specification. A second independent formatter (formatTimestampByMath) cross-checks the primary formatter against pure-arithmetic arithmetic on every cue — a build-time assert flags any drift.
End-to-end latency depends on the HF queue and the chosen checkpoint. A 1-minute clip on whisper-large-v3 typically returns in 15–30 seconds; a 5-minute clip in 30–90 seconds. Medium is about twice as fast and is the right choice for English-only speech where the slight accuracy hit is invisible. The server enforces a 120-second timeout per request and gracefully reports a friendly “model unavailable” banner when the upstream is queued or otherwise unreachable, so the page never shows a dead state.
Two privacy notes. The audio bytes are forwarded as a singlemultipart/form-data POST and discarded as soon as the upstream response returns — there is no log of audio content, only a server-side timing counter (request ms, model id). And the tool never attempts speaker identification, age, or ethnicity inference — Whisper is a sequence-to-sequence ASR model, not a speaker classifier.
Worked examples
Model variants at a glance
Whisper large-v3 is the default — best Sinhala and Tamil quality of any public Whisper checkpoint. Medium is about twice as fast on the HF infrastructure and is the right pick for English-only dictation where the per-language WER gap is small. The WER numbers below are the multilingual test scores reported in the Whisper paper.
| Model | Params | English WER | Notes |
|---|---|---|---|
| Whisper large-v3 (best quality) | 1550M | 2.7% | OpenAI's flagship Whisper checkpoint. Best Sinhala and Tamil quality. Default choice; ~2× slower than medium on the HF servers but the only tier that's reliably usable for non-English speech. |
| Whisper medium (faster) | 769M | 4.1% | About twice as fast as large-v3 on the HF infrastructure. Strong on English and most European languages; noticeably weaker on Sinhala, Tamil and other low-resource scripts — use the larger model when accuracy matters. |
Frequently asked questions
Sources & references
- Radford et al. — Robust Speech Recognition via Large-Scale Weak Supervision (Whisper paper, 2022)
- Hugging Face — openai/whisper-large-v3 model card
- Hugging Face — openai/whisper-medium model card
- Hugging Face — Inference API reference (automatic-speech-recognition)
- W3C — WebVTT 1.0 specification (caption file format)
- SubRip — SRT file format reference
The Whisper paper, model cards, Hugging Face Inference API reference, and the WebVTT specification were all cross-checked on 2026-05-12. The page is reviewed quarterly and whenever Hugging Face rotates the Whisper variants in its hosted inference catalogue.
Related tools
Comments & feedback
Spotted a bug or want an improvement? Tell us — our team reviews every comment, and good ideas get built. Comments are public and anonymous.
Found a bug, edge case, or want to suggest an improvement?
Email me at [email protected] — most fixes ship within 24 hours.