AI Speech-to-Text API Comparison
Compare 14 hosted transcription APIs — Whisper, Deepgram, AssemblyAI, Google, Azure, Amazon, Groq, ElevenLabs and Rev AI — by price per minute, real-time support, diarization, language coverage and published WER. Enter your monthly audio volume and rank them by cost. Every figure cites the vendor source.
How it works
Choosing a speech-to-text (STT) provider is a multi-axis decision: price per minute, whether you need real-time streaming or just batch (pre-recorded) jobs, speaker diarization, word-level timestamps, language coverage, maximum file size, and published accuracy. This page lays all of those out for the 14 models that almost every developer ends up shortlisting, drawn from 9 vendors, and ranks them by what they would actually cost at your volume.
1. The cost formula
Every provider here is priced per unit of audio — some per minute, some per hour. The tool normalises everything to a per-minute rate and your volume to minutes, then applies any standing free tier:
monthly_cost = max(0, minutes − free_tier_minutes) × usd_per_minute
Hourly-priced vendors (AssemblyAI at $0.27/hr, Groq at $0.04–$0.111/hr, Azure at $1/hr, ElevenLabs at $0.40/hr) are divided by 60 to get the per-minute rate. The data module cross-checks every figure a second way — via the hourly rate — so the two routes have to agree to the millionth of a dollar before the page will build.
2. Batch versus real-time
Batch (pre-recorded) transcription processes a finished file; real-time streaming transcribes a live audio stream as it arrives. They are billed differently and not every model offers both. Deepgram Nova-3 is $0.0043/min batch but $0.0077/min streaming; AssemblyAI is the reverse, cheaper streaming than batch. Groq, ElevenLabs Scribe, Rev AI's machine endpoint and OpenAI's whisper-1 are batch-only. The mode toggle switches which rate the ranking uses, and a batch-only provider in real-time mode is marked unsupported rather than shown as free.
3. Free tiers and tiered pricing
Two providers here have a standing monthly free allowance: Google Cloud Speech-to-Text gives 60 free minutes a month, and Azure's F0 tier gives 5 free audio hours (300 minutes). Those are subtracted before billing. Amazon Transcribe's 60-minute free tier only lasts 12 months on new accounts, so it is treated as zero here. Amazon's per-minute rate also drops at high volume; the tier-1 rate is shown.
4. Accuracy (WER) is benchmark-dependent
Word error rate is the fraction of words a model gets wrong against a human transcript. The WER figures here are each vendor's or Artificial Analysis's published English benchmark number — never our own measurement — and they are indicative. Accent, recording quality, domain vocabulary and background noise move WER far more than the small gaps between leading models. Use the column to shortlist, then test your finalists on your own audio.
5. Best-for badges
The “Cheapest”, “Lowest WER”, “Best real-time” and “Most languages” callouts are derived deterministically from your current selection and volume — cheapest is the lowest projected monthly cost among supported providers, lowest-WER is the minimum published figure, best-real-time is the lowest streaming rate among streaming-capable providers, and most-languages is the highest documented language count. Requiring a feature greys out providers that lack it without deleting them, so the comparison stays honest.
Worked examples
Frequently asked questions
Sources & references
- OpenAI — API pricing (Whisper, gpt-4o-transcribe)
- Deepgram — pricing (Nova-3, Nova-2)
- AssemblyAI — pricing (Universal, Slam-1, Streaming)
- Google Cloud — Speech-to-Text pricing
- Microsoft Azure — AI Speech pricing
- Amazon — Transcribe pricing
- Groq — pricing (Whisper Large v3 / Turbo)
- ElevenLabs — API pricing (Scribe)
- Rev AI — pricing
- Artificial Analysis — Speech-to-Text leaderboard (WER benchmarks)
Every rate and capability flag was last cross-checked against these sources on 2026-06-20. Speech-to-text pricing and models change frequently; this page is reviewed manually and whenever a provider announces a substantive pricing or model update. WER figures are vendor- or Artificial-Analysis-published and benchmark-dependent.
Related tools
Comments & feedback
Spotted a bug or want an improvement? Tell us — our team reviews every comment, and good ideas get built. Comments are public and anonymous.
Spot a stale price, a missing provider, or a misclaimed capability?
Email me at [email protected] — most fixes ship within 24 hours.