induwara.lk
induwara.lkAI · Speech-to-text

AI Speech-to-Text API Comparison

Compare 14 hosted transcription APIs — Whisper, Deepgram, AssemblyAI, Google, Azure, Amazon, Groq, ElevenLabs and Rev AI — by price per minute, real-time support, diarization, language coverage and published WER. Enter your monthly audio volume and rank them by cost. Every figure cites the vendor source.

By Induwara AshinsanaUpdated Jun 20, 2026
Compare speech-to-text APIs14 models · 9 vendors

Providers to compare

4/6 selected (min 2)

Audio volume and options

= 2,400 audio minutes per month.

Mode
Currency
Quick volumes
Require
Cheapest
Whisper Large v3 Turbo
$1.60/mo
Lowest WER
Universal
6.6%
Best real-time
Universal
$0.0025/min
Most languages
Whisper Large v3 Turbo
99 languages

Projected monthly cost (batch, cheapest first)

#1Whisper Large v3 Turbo
Cheapest
Groq
$1.60/mo
$0.0007/min
99 langs
#2Nova-3
Deepgram
$10.32/mo
$0.0043/min
#3Universal
AssemblyAI
$10.80/mo
$0.0045/min
WER 6.6%99 langsBest real-time
#4Whisper
OpenAI
$14.40/mo
$0.0060/min
99 langs

Feature matrix

ProviderBatch $/minReal-timeDiarizationWER
Whisper Large v3 Turbo
Groq
$0.00078.4%
Nova-3
Deepgram
$0.0043$0.00776.8%
Universal
AssemblyAI
$0.0045$0.00256.6%
Whisper
OpenAI
$0.00608.1%

Real-time column shows the streaming per-minute rate where offered; “—” means the model is batch-only. Languages and max-input are vendor-documented and indicative. WER is benchmark-dependent — it measures English-language accuracy on the cited benchmark and is not a guarantee for your audio.

Per-provider notes

  • Groq Whisper Large v3 Turbo:Cheapest hosted option in this table — pennies per hour. No diarization; English-leaning accuracy.pricing
  • Deepgram Nova-3:Fast, cheap batch with diarization and word timestamps built in. Strong real-time option too.pricing
  • AssemblyAI Universal:Rich audio-intelligence add-ons (sentiment, topics, PII redaction). Streaming is cheaper per hour than batch.pricing
  • OpenAI Whisper:The open Whisper model, hosted. Open weights mean you can self-host. No built-in diarization.pricing
Static comparison — no audio uploaded, no API key, no logging.

Picking a provider here sends nothing to any vendor. Rates are dated constants reviewed manually; confirm the current price on the linked pricing page before you commit. LKR figures use a single indicative rate of Rs 300 per USD — not a live exchange rate.

How it works

Choosing a speech-to-text (STT) provider is a multi-axis decision: price per minute, whether you need real-time streaming or just batch (pre-recorded) jobs, speaker diarization, word-level timestamps, language coverage, maximum file size, and published accuracy. This page lays all of those out for the 14 models that almost every developer ends up shortlisting, drawn from 9 vendors, and ranks them by what they would actually cost at your volume.

1. The cost formula

Every provider here is priced per unit of audio — some per minute, some per hour. The tool normalises everything to a per-minute rate and your volume to minutes, then applies any standing free tier:

monthly_cost = max(0, minutes − free_tier_minutes) × usd_per_minute

Hourly-priced vendors (AssemblyAI at $0.27/hr, Groq at $0.04–$0.111/hr, Azure at $1/hr, ElevenLabs at $0.40/hr) are divided by 60 to get the per-minute rate. The data module cross-checks every figure a second way — via the hourly rate — so the two routes have to agree to the millionth of a dollar before the page will build.

2. Batch versus real-time

Batch (pre-recorded) transcription processes a finished file; real-time streaming transcribes a live audio stream as it arrives. They are billed differently and not every model offers both. Deepgram Nova-3 is $0.0043/min batch but $0.0077/min streaming; AssemblyAI is the reverse, cheaper streaming than batch. Groq, ElevenLabs Scribe, Rev AI's machine endpoint and OpenAI's whisper-1 are batch-only. The mode toggle switches which rate the ranking uses, and a batch-only provider in real-time mode is marked unsupported rather than shown as free.

3. Free tiers and tiered pricing

Two providers here have a standing monthly free allowance: Google Cloud Speech-to-Text gives 60 free minutes a month, and Azure's F0 tier gives 5 free audio hours (300 minutes). Those are subtracted before billing. Amazon Transcribe's 60-minute free tier only lasts 12 months on new accounts, so it is treated as zero here. Amazon's per-minute rate also drops at high volume; the tier-1 rate is shown.

4. Accuracy (WER) is benchmark-dependent

Word error rate is the fraction of words a model gets wrong against a human transcript. The WER figures here are each vendor's or Artificial Analysis's published English benchmark number — never our own measurement — and they are indicative. Accent, recording quality, domain vocabulary and background noise move WER far more than the small gaps between leading models. Use the column to shortlist, then test your finalists on your own audio.

5. Best-for badges

The “Cheapest”, “Lowest WER”, “Best real-time” and “Most languages” callouts are derived deterministically from your current selection and volume — cheapest is the lowest projected monthly cost among supported providers, lowest-WER is the minimum published figure, best-real-time is the lowest streaming rate among streaming-capable providers, and most-languages is the highest documented language count. Requiring a feature greys out providers that lack it without deleting them, so the comparison stays honest.

Worked examples

Podcast side-project — 40 hours/month, batch

A Colombo freelancer transcribing ~40 hours of podcast audio monthly, wants the cheapest provider. 40 h = 2,400 minutes.

  1. Volume: 40 hours × 60 = 2,400 audio minutes, batch mode.
  2. Groq Whisper v3 Turbo: 2,400 × ($0.04 ÷ 60) = 2,400 × $0.000667 = $1.60/mo.
  3. Deepgram Nova-3: 2,400 × $0.0043 = $10.32/mo (and it adds diarization).
  4. AssemblyAI Universal: 2,400 × ($0.27 ÷ 60) = 2,400 × $0.0045 = $10.80/mo.
  5. OpenAI Whisper: 2,400 × $0.006 = $14.40/mo.
  6. Cheapest is Groq at $1.60 — but it has no diarization. Needing speaker labels, the freelancer picks Deepgram Nova-3 at $10.32.

Voice-note app — 500 minutes/month, free tiers

A small app transcribing 500 minutes a month. Shows how free tiers change the ranking. Batch mode, USD.

  1. Volume: 500 audio minutes, batch mode.
  2. Google STT (Chirp 2): first 60 min free → bill 440 min × $0.016 = $7.04/mo.
  3. Amazon Transcribe: no standing free tier → 500 × $0.024 = $12.00/mo.
  4. Deepgram Nova-3: no free tier but a low rate → 500 × $0.0043 = $2.15/mo.
  5. Even with Google's free 60 minutes, Deepgram's low per-minute rate wins at $2.15/mo. The free tier matters most at tiny volumes.

Edge case — free-tier boundary and zero volume

Testing the arithmetic exactly at Google's 60-minute free boundary, and at zero, so the math never produces a negative or NaN.

  1. At exactly 60 minutes on Google: max(0, 60 − 60) = 0 billable → $0.00.
  2. At 61 minutes: max(0, 61 − 60) = 1 billable × $0.016 = $0.016/mo.
  3. At 0 minutes (or a blank/negative input): clamps to 0 → $0.00 for every provider, no NaN.
  4. Switching Groq to real-time mode returns “No real-time API” rather than a misleading $0, so it can't win a streaming ranking it doesn't serve.

Frequently asked questions

Sources & references

Every rate and capability flag was last cross-checked against these sources on 2026-06-20. Speech-to-text pricing and models change frequently; this page is reviewed manually and whenever a provider announces a substantive pricing or model update. WER figures are vendor- or Artificial-Analysis-published and benchmark-dependent.

Related tools

Rate this tool
Be the first to rate

Comments & feedback

Spotted a bug or want an improvement? Tell us — our team reviews every comment, and good ideas get built. Comments are public and anonymous.

Spot a stale price, a missing provider, or a misclaimed capability?

Email me at [email protected] — most fixes ship within 24 hours.