AI Audio Token & Cost Calculator
Enter how long your audio is and see how many audio tokensGPT-4o-audio and Gemini bill for it, then the cost per request and per month in USD and LKR, side by side. Gemini's 32-tokens-per-second rule is taken straight from Google's docs. Everything runs in your browser.
How it works
A multimodal model doesn't bill audio by the megabyte — it converts the sound into tokens, the same unit it charges for text, and prices them at its audio input rate. The token count depends only on the clip's duration, not its bitrate, sample rate, or file format, so a 3-minute voice note costs the same whether it's a 16 kHz WhatsApp recording or a studio WAV.
Gemini — the documented anchor.Google's token-counting docs state that audio is counted at a fixed 32 tokens per second. So audioTokens = durationSeconds × 32. One minute is 1,920 tokens; a 3-minute clip is 5,760. This rule is exact and does not change when pricing changes, which is why it's the backbone of this calculator.
OpenAI — an estimate, plus an exact path.OpenAI bills audio as distinct audio tokens but does not publish a fixed tokens-per-second figure. The tool uses an estimate of about 25 tokens per second for the GPT-4o audio models, derived from OpenAI's per-minute audio pricing, and marks those rows with a ≈. For an exact figure, OpenAI returns input_token_details.audio_tokens in the usage object of every response — paste that into the By known token count mode for precise costs.
Cost.Each request has up to three billed parts, each priced at the model's own per-1M rate:
- audioInput = audioTokens / 1,000,000 × audioInputRate
- textInput = textPromptTokens / 1,000,000 × textInputRate
- output = outputTokens / 1,000,000 × outputRate
The per-request total is the sum of those three; the monthly figure multiplies it by your requests per month, and the LKR column applies your exchange rate. As a cross-check, the audio line can also be read per minute — 32 × 60 / 1,000,000 × audioInputRate — and the two derivations agree to the cent, which the build verifies on every deploy. Output here means generated text (a summary or answer); spoken audio replies use a separate rate and are out of scope.
Worked examples
Frequently asked questions
Sources & references
- Google — Gemini token counting (audio = 32 tokens/second)
- Google — Gemini API pricing (audio, text input and output rates)
- OpenAI — Audio guide (audio billed as audio tokens; usage object)
- OpenAI — API pricing (GPT-4o audio input and output rates)
- Central Bank of Sri Lanka — daily indicative USD/LKR rate
The 32 tokens/second rule and the per-1M rates were last cross-checked against these sources on 2026-06-30. The count rule is stable; pricing is revised periodically, so confirm dollar figures against your latest invoice.
Related tools
Comments & feedback
Spotted a bug or want an improvement? Tell us — our team reviews every comment, and good ideas get built. Comments are public and anonymous.
Found a bug, edge case, or want another model added?
Email me at [email protected] — most fixes ship within 24 hours.