AI Text-to-Speech (TTS) API Comparison
Compare 18 hosted text-to-speech APIs — ElevenLabs, OpenAI, Google Cloud, Azure, Amazon Polly, PlayHT, Murf, Cartesia and Deepgram Aura — by price per character, voice cloning, streaming latency, language coverage and published naturalness. Enter your monthly volume and rank them by cost. Every figure cites the vendor source.
How it works
Choosing a text-to-speech (TTS) provider is a multi-axis decision: price per character, voice naturalness, whether you need voice cloning, whether you need low-latency streaming for a live voice agent or just batch rendering for narration, language coverage, SSML support, and the commercial-use licence. This page lays all of those out for the 18 models that developers and creators most often shortlist, drawn from 9 vendors, and ranks them by what they would actually cost at your volume.
1. The cost formula
Almost every TTS provider bills per character of input text. The tool normalises your volume to characters and applies any standing free tier:
monthly_cost = max(0, characters − free_tier) ÷ 1,000,000 × usd_per_million
If you enter words or minutes instead of characters, they are converted first: one word ≈ 6 characters (≈5 letters plus a space), and one spoken minute ≈ 900 characters (≈150 words a minute × 6). The data module cross-checks every figure a second way — via the per-1,000-character rate — so the two routes must agree to the millionth of a dollar before the page will build.
2. Per-character, credit and per-minute pricing
The big clouds (OpenAI, Google, Azure, Amazon) publish a clean per-character rate. ElevenLabs, PlayHT, Murf and Cartesia sell credits instead, where the cost per character depends on your plan; those rows show an effectiveper-million-character rate (marked “eff”) for a representative tier. OpenAI's gpt-4o-mini-tts is priced per audio minute, converted here using the 900-characters-per-minute assumption. Every conversion is documented in the data file and labelled in the table so nothing is hidden.
3. Free tiers
Three providers here have a standing monthly free allowance, subtracted before billing: Google Cloud gives 4,000,000 free Standard characters and 1,000,000 free WaveNet/Neural2 characters every month; Azure's F0 tier gives 500,000 free characters a month. Amazon Polly's free tier only lasts 12 months on new accounts, so it is treated as zero here. At small volumes the free tier can make a pricier per-character rate the cheapest overall — the ranking accounts for that automatically.
4. Quality (Elo) is benchmark-dependent
The Elo column is each model's community-published naturalness standing (TTS-Arena and Artificial Analysis), rounded and indicative — never our own measurement. Higher is better. Listener preference, the target language, and the kind of script (conversational versus formal narration) move naturalness far more than the small gaps between leading models. Use the column to shortlist, then generate a sample of your own text on your finalists.
5. Best-for badges
The “Cheapest”, “Best quality (Elo)”, “Lowest latency” and “Most languages” callouts are derived deterministically from your current selection and volume — cheapest is the lowest projected monthly cost, best-quality is the highest published Elo, lowest-latency is the minimum published streaming time-to-first-byte among streaming-capable providers, and most-languages is the highest documented locale count. Requiring a feature greys out providers that lack it without deleting them, so the comparison stays honest.
Worked examples
Frequently asked questions
Sources & references
- ElevenLabs — API pricing (Multilingual v2, Flash v2.5)
- OpenAI — API pricing (tts-1, tts-1-hd, gpt-4o-mini-tts)
- Google Cloud — Text-to-Speech pricing
- Microsoft Azure — AI Speech (TTS) pricing
- Amazon — Polly pricing (Standard, Neural, Generative)
- PlayHT — pricing
- Murf AI — API pricing
- Cartesia — pricing (Sonic)
- Deepgram — pricing (Aura-2 TTS)
- TTS-Arena — community naturalness Elo leaderboard
Every rate and capability flag was last cross-checked against these sources on 2026-06-21. Text-to-speech pricing, voices and models change frequently; this page is reviewed manually and whenever a provider announces a substantive pricing or model update. Quality Elo figures are community-published and benchmark-dependent.
Related tools
Comments & feedback
Spotted a bug or want an improvement? Tell us — our team reviews every comment, and good ideas get built. Comments are public and anonymous.
Spot a stale price, a missing provider, or a misclaimed capability?
Email me at [email protected] — most fixes ship within 24 hours.