induwara.lkinduwara.lk
induwara.lkAI · Cost

Self-Hosting LLM Cost Calculator (vs API Break-Even)

Renting a GPU to self-host an open model is a flat monthly cost; a closed API bills per token. This tool finds the exact monthly token volume where self-hosting starts to win, shows both costs in USD and LKR, and gives a plain self-host or stay-on-API verdict. No signup, sources cited.

By Induwara AshinsanaUpdated Jun 9, 2026
Self-host vs API break-even
Prices verified 2026-06-09
Monthly workload
M

Prompt tokens you send, in millions. e.g. 120 = 120M.

M

Tokens the model generates, in millions. e.g. 40 = 40M.

Example workloads
Compare these two paths
GPU rental setup

Integer 1–8. Add GPUs to fit bigger models.

Spot is cheaper but can be reclaimed mid-run.

%

Share of the month the GPU is actually serving (1–100).

%

Ops, storage, egress markup on GPU cost (0–100).

Rs

CBSL indicative default. Used only for LKR lines.

Stay on the API — self-hosting costs $888.46 more/mo

At 160M tokens/month your volume is too low to amortise a 24/7 GPU. The API bill is $110.00 vs $998.46 to self-host.

API monthly cost
$110.00
Rs 33,550
Self-host monthly cost
$998.46
Rs 304,530
Self-host costs extra
$888.46
Rs 270,980

Break-even volume

With this GPU, model and mix, self-hosting beats the API above 1.45B tokens/month. You're at 160M below it, so the API wins.

0break-even 1.45B2.9B

Assumptions

InputValue usedSource
API price (in / out)$0.25 / $2 per 1MOpenAI
GPU rate$1.19/hr × 1 (on-demand)RunPod/Lambda
Hours/month729.6 hr (24 × 30.4 × 100%)calendar
Overhead15% → $130.23your input
Serving throughput2,500 tok/s → 6.57B/mo capacityvLLM
VRAM80 GB available · 18 GB needed8B model

Fixed GPU cost is volume-independent; API cost scales with tokens. The throughput figure is conservative steady-state, not peak, so the capacity check never reports a self-host saving the GPU couldn't actually serve.

API prices are each provider's published list price; GPU $/hr are RunPod/Lambda reference rates; tokens/sec are conservative vLLM steady-state figures. All last verified 2026-06-09 and listed with links below the calculator. Open-weight model weights are free to run — the only self-host cost modelled is GPU rental plus your overhead.

How it works

Two cost curves cross. A closed-LLM API charges per token, so its monthly bill rises in a straight line with your usage. A rented cloud GPU costs the same every month whether it is busy or idle — open-weight model weights (Llama, Mistral, Qwen) are free to run, so the only cost is the GPU plus your overhead. Because one line slopes up and the other is flat, there is a single crossover volume. Below it the API is cheaper; above it self-hosting wins. All figures are monthly, using an average month of 30.4 days, so hours per month H = 24 × 30.4 = 729.6.

  1. API path. api_cost = in_M × price_in + out_M × price_out, where in_M and out_Mare input and output tokens in millions and the prices are the selected model's per-million-token USD list prices.
  2. Self-host path. gpu_cost = gpu_hourly × gpu_count × H × utilisation, then self_cost = gpu_cost × (1 + overhead). Within the chosen pool this is independent of token volume — a flat line.
  3. Capacity check. Effective output capacity = tps × gpu_count × 3600 × H × utilisation, using a conservative vLLM steady-state tokens/sec figure. If your monthly output exceeds it — or the model doesn't fit in VRAM — the tool says “add GPUs” instead of reporting a false saving.
  4. Break-even volume. Holding your input:output mix r = in_M / (in_M + out_M) fixed, the blended price is r × price_in + (1 − r) × price_out per million tokens, so the crossover is T_be = self_cost / blended_price.
  5. Verdict and LKR. Compare api_cost against self_cost at your actual volume, report the signed difference, then multiply every USD figure by your USD-to-LKR rate for the rupee line.

Prices come from the providers' own pages (OpenAI, Anthropic, Google, DeepSeek); GPU hourly rates from RunPod and Lambda; tokens/sec from the vLLM benchmark suite; and the FX default from the Central Bank of Sri Lanka. Each figure carries a last-verified date because provider pricing changes without notice. The throughput table deliberately uses steady-state rather than peak numbers so the tool errs against over-promising self-hosting.

Worked examples

Small support bot — the API wins

120M input + 40M output · GPT-5 mini vs Llama 3.1 8B on 1× A100 80GB

  1. API: 120 × $0.25 + 40 × $2.00 = $30 + $80 = $110.00/mo
  2. GPU: $1.19 × 1 × 729.6 = $868.22; +15% overhead = $998.46/mo
  3. Verdict: stay on API — self-hosting costs $888.46 more
  4. Mix r = 120/160 = 0.75; blended = 0.75×0.25 + 0.25×2.00 = $0.6875/M
  5. Break-even: $998.46 / $0.6875 ≈ 1,452M (1.45B) tokens/mo
  6. At 160M tokens you're far below break-even — correctly: don't self-host

High-volume product — self-hosting wins

1,500M input + 500M output (2B total) · same A100 setup

  1. API: 1500 × $0.25 + 500 × $2.00 = $375 + $1,000 = $1,375.00/mo
  2. Self-host (flat, unchanged): $998.46/mo
  3. Verdict: self-host saves $376.54/mo
  4. Capacity: A100 serves ~2,500 tok/s → ~6.57B output tokens/mo ≥ 500M ✓
  5. Consistency: 2B total > 1.45B break-even, so self-hosting wins — matches example 1

Edge case — capacity guard fires

0M input + 20,000M output · Llama 3.1 8B on 1× RTX 4090

  1. Capacity: 1,300 tok/s × 1 × 3600 × 729.6 ≈ 3.41B output tokens/mo
  2. Requested output: 20,000M (20B) > 3.41B servable
  3. Verdict: setup can't serve this volume — add GPUs, no saving reported
  4. This guard stops the tool from claiming a saving the GPU couldn't deliver

Frequently asked questions

Sources & references

API prices, GPU hourly rates, throughput figures, and the USD-to-LKR default were last cross-checked against these sources on 2026-06-09. The tool is refreshed each quarter and whenever a major provider changes its pricing. It pairs with the GPU Cloud Cost Calculator (raw rental cost) and the LLM VRAM Calculator (which GPU fits which model).

Related tools

Rate this tool
Be the first to rate

Comments & feedback

Spotted a bug or want an improvement? Tell us — our team reviews every comment, and good ideas get built. Comments are public and anonymous.

Found a bug, edge case, or want to suggest an improvement?

Email me at [email protected] — most fixes ship within 24 hours.