Is it cheaper to self-host an LLM or use the OpenAI/Claude API?

It depends entirely on volume. A rented GPU is a fixed monthly cost whether it serves one token or a billion; an API bills per token. Below a crossover volume the API is cheaper because you aren't paying for an idle 24/7 GPU. Above it, self-hosting wins. This tool computes that exact crossover for your chosen GPU, open model, and token mix.

At what volume does running my own LLM become worth it?

For a single A100 80GB on-demand (~$1.19/hr ≈ $998/month with 15% overhead) serving Llama 3.1 8B against GPT-5 mini, the break-even sits near 1.45 billion tokens per month at a 3:1 input:output mix. Cheaper APIs push it higher; pricier APIs or spot GPUs pull it lower. Enter your own numbers to see your figure.

How much does it cost to self-host Llama 3.1 or Mistral per month?

Only the GPU rental plus your operational overhead — open weights are free to run. One A100 80GB at ~$1.19/hr is about $868/month at 24/7 (729.6 hours), or roughly $998 after a 15% ops markup. An RTX 4090 is far cheaper (~$0.69/hr) but holds smaller models. The calculator shows the exact figure for your GPU, count, and utilisation.

How many tokens per second can one A100 or H100 serve?

With vLLM continuous batching, a single A100 80GB serves roughly 2,500 output tokens/sec for an 8B model and an H100 around 3,800. A 70B model, tensor-parallel-sharded across two cards, lands far lower per card. This tool uses conservative published steady-state numbers, not peak benchmarks, so it never over-promises capacity.

What GPU do I need to self-host a 70B model and what does it cost?

A 70B model in fp16 needs about 140GB of VRAM for weights plus a working cache, so one 80GB card is not enough — you need at least two A100/H100 80GB GPUs. At two H100s on-demand that is roughly $3,900/month. The calculator flags the VRAM shortfall and tells you to add GPUs rather than reporting a false saving.

Does the calculator account for spot vs on-demand GPU pricing?

Yes. Toggle between on-demand (secure, never reclaimed) and spot/community (cheaper, can be pre-empted mid-run). Spot roughly halves the GPU cost, which lowers the break-even volume — but you accept interruption risk. The assumptions table shows the exact $/hr used in each mode.

Why does the tool sometimes say self-hosting can't serve my volume?

Two guards stop false savings. First, a VRAM check: if the GPU pool can't physically load the model, it tells you to add GPUs. Second, a throughput check: if your monthly output tokens exceed what the GPU pool can generate at the chosen utilisation, it flags the shortfall. Only when both pass does it compare costs and pick a winner.

Are training and fine-tuning costs included?

No — this is an inference-only break-even tool. It assumes you serve an existing open-weight checkpoint. Fine-tuning or training costs are separate one-time spends covered by the Fine-Tuning Cost Calculator and GPU Cloud Cost Calculator. Electricity for owned hardware is also out of scope; this models cloud GPU rental only.

When were these prices last verified?

The API per-token prices, GPU hourly rates, throughput figures, and the USD-to-LKR default were last cross-checked against the provider, RunPod/Lambda, vLLM, and CBSL sources on 2026-06-09. Provider prices move often, so the tool states its verified-on date inline and is refreshed each quarter.

AI · Cost

Self-Hosting LLM Cost Calculator (vs API Break-Even)

Renting a GPU to self-host an open model is a flat monthly cost; a closed API bills per token. This tool finds the exact monthly token volume where self-hosting starts to win, shows both costs in USD and LKR, and gives a plain self-host or stay-on-API verdict. No signup, sources cited.

By Induwara Ashinsana— Executive Director, Ryzera TechnologiesUpdated Jun 9, 2026

Self-host vs API break-even

Prices verified 2026-06-09

Monthly workload

Input tokens / month

Prompt tokens you send, in millions. e.g. 120 = 120M.

Output tokens / month

Tokens the model generates, in millions. e.g. 40 = 40M.

Example workloads

Compare these two paths

Closed API model

Open model to self-host

GPU rental setup

GPU

Number of GPUs

Integer 1–8. Add GPUs to fit bigger models.

Pricing mode

Spot is cheaper but can be reclaimed mid-run.

Utilisation %

Share of the month the GPU is actually serving (1–100).

Overhead %

Ops, storage, egress markup on GPU cost (0–100).

USD → LKR

CBSL indicative default. Used only for LKR lines.

Stay on the API — self-hosting costs $888.46 more/mo

At 160M tokens/month your volume is too low to amortise a 24/7 GPU. The API bill is $110.00 vs $998.46 to self-host.

API monthly cost

$110.00

Rs 33,550

Self-host monthly cost

$998.46

Rs 304,530

Self-host costs extra

$888.46

Rs 270,980

Break-even volume

With this GPU, model and mix, self-hosting beats the API above 1.45B tokens/month. You're at 160M — below it, so the API wins.

0break-even 1.45B2.9B

Assumptions

Input	Value used	Source
API price (in / out)	$0.25 / $2 per 1M	OpenAI
GPU rate	$1.19/hr × 1 (on-demand)	RunPod/Lambda
Hours/month	729.6 hr (24 × 30.4 × 100%)	calendar
Overhead	15% → $130.23	your input
Serving throughput	2,500 tok/s → 6.57B/mo capacity	vLLM
VRAM	80 GB available · 18 GB needed	8B model

Fixed GPU cost is volume-independent; API cost scales with tokens. The throughput figure is conservative steady-state, not peak, so the capacity check never reports a self-host saving the GPU couldn't actually serve.

API prices are each provider's published list price; GPU $/hr are RunPod/Lambda reference rates; tokens/sec are conservative vLLM steady-state figures. All last verified 2026-06-09 and listed with links below the calculator. Open-weight model weights are free to run — the only self-host cost modelled is GPU rental plus your overhead.

How it works

Two cost curves cross. A closed-LLM API charges per token, so its monthly bill rises in a straight line with your usage. A rented cloud GPU costs the same every month whether it is busy or idle — open-weight model weights (Llama, Mistral, Qwen) are free to run, so the only cost is the GPU plus your overhead. Because one line slopes up and the other is flat, there is a single crossover volume. Below it the API is cheaper; above it self-hosting wins. All figures are monthly, using an average month of 30.4 days, so hours per month H = 24 × 30.4 = 729.6.

API path. api_cost = in_M × price_in + out_M × price_out, where in_M and out_Mare input and output tokens in millions and the prices are the selected model's per-million-token USD list prices.
Self-host path. gpu_cost = gpu_hourly × gpu_count × H × utilisation, then self_cost = gpu_cost × (1 + overhead). Within the chosen pool this is independent of token volume — a flat line.
Capacity check. Effective output capacity = tps × gpu_count × 3600 × H × utilisation, using a conservative vLLM steady-state tokens/sec figure. If your monthly output exceeds it — or the model doesn't fit in VRAM — the tool says “add GPUs” instead of reporting a false saving.
Break-even volume. Holding your input:output mix r = in_M / (in_M + out_M) fixed, the blended price is r × price_in + (1 − r) × price_out per million tokens, so the crossover is T_be = self_cost / blended_price.
Verdict and LKR. Compare api_cost against self_cost at your actual volume, report the signed difference, then multiply every USD figure by your USD-to-LKR rate for the rupee line.

Prices come from the providers' own pages (OpenAI, Anthropic, Google, DeepSeek); GPU hourly rates from RunPod and Lambda; tokens/sec from the vLLM benchmark suite; and the FX default from the Central Bank of Sri Lanka. Each figure carries a last-verified date because provider pricing changes without notice. The throughput table deliberately uses steady-state rather than peak numbers so the tool errs against over-promising self-hosting.

Worked examples

Small support bot — the API wins

120M input + 40M output · GPT-5 mini vs Llama 3.1 8B on 1× A100 80GB

API: 120 × $0.25 + 40 × $2.00 = $30 + $80 = $110.00/mo
GPU: $1.19 × 1 × 729.6 = $868.22; +15% overhead = $998.46/mo
Verdict: stay on API — self-hosting costs $888.46 more
Mix r = 120/160 = 0.75; blended = 0.75×0.25 + 0.25×2.00 = $0.6875/M
Break-even: $998.46 / $0.6875 ≈ 1,452M (1.45B) tokens/mo
At 160M tokens you're far below break-even — correctly: don't self-host

High-volume product — self-hosting wins

1,500M input + 500M output (2B total) · same A100 setup

API: 1500 × $0.25 + 500 × $2.00 = $375 + $1,000 = $1,375.00/mo
Self-host (flat, unchanged): $998.46/mo
Verdict: self-host saves $376.54/mo
Capacity: A100 serves ~2,500 tok/s → ~6.57B output tokens/mo ≥ 500M ✓
Consistency: 2B total > 1.45B break-even, so self-hosting wins — matches example 1

Edge case — capacity guard fires

0M input + 20,000M output · Llama 3.1 8B on 1× RTX 4090

Capacity: 1,300 tok/s × 1 × 3600 × 729.6 ≈ 3.41B output tokens/mo
Requested output: 20,000M (20B) > 3.41B servable
Verdict: setup can't serve this volume — add GPUs, no saving reported
This guard stops the tool from claiming a saving the GPU couldn't deliver

Frequently asked questions

Sources & references

API prices, GPU hourly rates, throughput figures, and the USD-to-LKR default were last cross-checked against these sources on 2026-06-09. The tool is refreshed each quarter and whenever a major provider changes its pricing. It pairs with the GPU Cloud Cost Calculator (raw rental cost) and the LLM VRAM Calculator (which GPU fits which model).

Related tools

LiveAI

Subscription vs API Cost

Find out whether a flat AI subscription (ChatGPT Plus, Claude Pro, Gemini Advanced) or pay-as-you-go API access is cheaper for your usage. Enter your daily messages and token sizes to see the monthly cost both ways in USD and LKR, plus the break-even messages per day.

Open tool

LiveAI

AI Agent Cost Calculator

Estimate the real per-run, daily, and monthly cost of a multi-step LLM agent across Claude, GPT, and Gemini. Models the context accumulation single-call calculators miss — each tool result is re-sent every step — with caching, in USD and LKR.

Open tool

LiveAI

AI Chatbot Cost Calculator

Estimate the monthly API cost of a multi-turn AI chatbot across Claude, GPT, and Gemini. Models the quadratic context re-sending that single-call calculators miss, with and without prompt caching, in USD and LKR.

Open tool

Rate this tool

Be the first to rate

Comments & feedback

Spotted a bug or want an improvement? Tell us — our team reviews every comment, and good ideas get built. Comments are public and anonymous.

Found a bug, edge case, or want to suggest an improvement?

Email me at [email protected] — most fixes ship within 24 hours.