Self-Hosting LLM Cost Calculator (vs API Break-Even)
Renting a GPU to self-host an open model is a flat monthly cost; a closed API bills per token. This tool finds the exact monthly token volume where self-hosting starts to win, shows both costs in USD and LKR, and gives a plain self-host or stay-on-API verdict. No signup, sources cited.
How it works
Two cost curves cross. A closed-LLM API charges per token, so its monthly bill rises in a straight line with your usage. A rented cloud GPU costs the same every month whether it is busy or idle — open-weight model weights (Llama, Mistral, Qwen) are free to run, so the only cost is the GPU plus your overhead. Because one line slopes up and the other is flat, there is a single crossover volume. Below it the API is cheaper; above it self-hosting wins. All figures are monthly, using an average month of 30.4 days, so hours per month H = 24 × 30.4 = 729.6.
- API path.
api_cost = in_M × price_in + out_M × price_out, wherein_Mandout_Mare input and output tokens in millions and the prices are the selected model's per-million-token USD list prices. - Self-host path.
gpu_cost = gpu_hourly × gpu_count × H × utilisation, thenself_cost = gpu_cost × (1 + overhead). Within the chosen pool this is independent of token volume — a flat line. - Capacity check. Effective output capacity =
tps × gpu_count × 3600 × H × utilisation, using a conservative vLLM steady-state tokens/sec figure. If your monthly output exceeds it — or the model doesn't fit in VRAM — the tool says “add GPUs” instead of reporting a false saving. - Break-even volume. Holding your input:output mix
r = in_M / (in_M + out_M)fixed, the blended price isr × price_in + (1 − r) × price_outper million tokens, so the crossover isT_be = self_cost / blended_price. - Verdict and LKR. Compare
api_costagainstself_costat your actual volume, report the signed difference, then multiply every USD figure by your USD-to-LKR rate for the rupee line.
Prices come from the providers' own pages (OpenAI, Anthropic, Google, DeepSeek); GPU hourly rates from RunPod and Lambda; tokens/sec from the vLLM benchmark suite; and the FX default from the Central Bank of Sri Lanka. Each figure carries a last-verified date because provider pricing changes without notice. The throughput table deliberately uses steady-state rather than peak numbers so the tool errs against over-promising self-hosting.
Worked examples
Frequently asked questions
Sources & references
- OpenAI — API pricing (per-million-token rates)
- Anthropic — Claude API pricing
- Google — Gemini API pricing
- DeepSeek — API pricing
- RunPod — cloud GPU hourly pricing (on-demand & community/spot)
- Lambda Cloud — on-demand GPU pricing
- vLLM — serving throughput benchmarks (tokens/sec)
- Central Bank of Sri Lanka — daily indicative USD/LKR rate
API prices, GPU hourly rates, throughput figures, and the USD-to-LKR default were last cross-checked against these sources on 2026-06-09. The tool is refreshed each quarter and whenever a major provider changes its pricing. It pairs with the GPU Cloud Cost Calculator (raw rental cost) and the LLM VRAM Calculator (which GPU fits which model).
Related tools
Comments & feedback
Spotted a bug or want an improvement? Tell us — our team reviews every comment, and good ideas get built. Comments are public and anonymous.
Found a bug, edge case, or want to suggest an improvement?
Email me at [email protected] — most fixes ship within 24 hours.