How many FLOPs does it take to train an LLM?

The standard estimate is C ≈ 6 × N × D, where N is the parameter count and D is the number of training tokens. For example, GPT-3 (175B params, 300B tokens) needs about 3.1×10²³ FLOPs — roughly 3,640 petaFLOP/s-days. The calculator applies this 6ND rule and also reports GPU-hours and cost.

What is the 6ND rule for transformer training compute?

Kaplan et al. (2020) showed training a dense transformer costs about 6 FLOPs per parameter per token: 2 for the forward pass and 4 for the backward pass (which is roughly twice the forward cost). Multiply by the parameter count N and token count D to get total compute, C ≈ 6ND.

How many tokens should I train a model on (Chinchilla optimal)?

Hoffmann et al. (2022, the Chinchilla paper) found that for a fixed compute budget, parameters and tokens should scale together — about 20 training tokens per parameter. So a 7B model is compute-optimal at roughly 140B tokens, and a 70B model at about 1.4T tokens.

How many GPU-hours does it take to train a 7B parameter model?

At the Chinchilla-optimal 140B tokens, a 7B model needs about 5.9×10²¹ FLOPs. On an H100 SXM at 40% MFU (≈396 TFLOP/s effective) that is roughly 4,100 GPU-hours — about $10,000 at $2.50/GPU-hour. Lower MFU or a slower GPU raises both numbers proportionally.

What is a petaFLOP/s-day?

One petaFLOP/s-day is the compute of a machine running at 10¹⁵ floating-point operations per second for a full day (86,400 seconds) — that is 8.64×10¹⁹ FLOPs. It is the unit OpenAI used to report GPT-3's training compute (≈3,640 petaFLOP/s-days), so it is handy for comparing run sizes.

What is MFU and why does it matter?

Model FLOPs Utilization is the fraction of a GPU's peak throughput your training run actually achieves after memory, communication, and kernel overhead. Real large runs report 30–55% (PaLM hit ~46%). Because GPU-hours equal compute divided by effective throughput, halving MFU doubles your GPU-hours and cost.

Why does my estimate differ from a model's published GPU-hours?

The 6ND rule is a dense approximation. Real runs vary with sequence length (attention adds FLOPs at long context), Mixture-of-Experts routing (active vs. total parameters), restarts and evaluation, and the exact MFU achieved. Treat the output as an order-of-magnitude planning figure, accurate to roughly ±10–30%.

Does this include inference, fine-tuning, or energy cost?

No — this tool covers pre-training compute only. For serving and fine-tuning budgets use the AI GPU cloud and fine-tuning cost calculators; for memory sizing use the LLM VRAM calculator; for energy and carbon use the AI energy & carbon calculator. They are linked under Related tools.

AI · Training compute

LLM Training Compute (FLOPs) Calculator

Estimate the FLOPs, petaFLOP/s-days, GPU-hours, and dollar cost to pre-train a transformer language model from scratch. Enter the parameter count and tokens, and the tool applies the 6ND rule, checks the Chinchilla-optimal token count, and prices the run. Runs entirely in your browser.

By Induwara Ashinsana— Executive Director, Ryzera TechnologiesUpdated Jun 9, 2026

Training compute estimate

6ND · Kaplan + Chinchilla

Reference runs

Parameters (N)

GPU (peak BF16/FP16)

Training tokens (D)

D is set to 20 × N = 140B tokens (Hoffmann et al. 2022).

MFU (% of peak)

Model FLOPs Utilization — real runs hit 30–55%.

Cost / GPU-hour (USD)

Total compute (FLOPs)

5.88×10²¹

PetaFLOP/s-days

68.06

GPU-hours

4,129

Est. cost (USD)

$10,321.87

Chinchilla optimality

Compute-optimal

Compute-optimal tokens for 7B params = 20 × 7B = 140B. Your D is 1× the optimum. Tokens are within ±25% of the Chinchilla optimum (20 tokens/param).

Wall-clock time

Cluster size	Wall-clock (ideal)
1 GPU	172 days
8 GPUs	22 days
64 GPUs	2.7 days
512 GPUs	8.1 hrs

Magnitude check vs. real runs

Run	FLOPs (6ND)	PF/s-days	vs. published
GPT-3 175B Undertrained	3.15×10²³	3,645.83	+0.32%
Chinchilla 70B Compute-optimal	5.88×10²³	6,805.56	+2.08%
Llama-2 7B Over-trained (inference-efficient)	8.40×10²²	972.22	—

The 6ND estimate reproduces GPT-3's published 3.14×10²³ FLOPs (≈3,640 PF/s-days) to within ~0.3% — the same idea as cross-checking a tax figure against the regulator's own formula.

Compute uses C ≈ 6·N·D from Kaplan et al. (2020) and the 20-tokens-per-parameter optimum from Hoffmann et al. (2022, Chinchilla). GPU peaks are dense BF16/FP16 from NVIDIA datasheets; estimates assume your stated MFU and ideal scaling. This is the dense 6ND approximation — MoE, attention-FLOPs, and very long context can shift the real number ±10–30%.

How it works

Every figure comes from two published scaling-law results, applied as closed-form arithmetic in your browser — nothing is trained, uploaded, or downloaded.

1. Total compute — the 6ND rule

Training a dense transformer costs about six floating-point operations per parameter per token: C ≈ 6 × N × D, where N is the parameter count and D is the number of training tokens. The factor 6 is two FLOPs for the forward pass plus four for the backward pass (the backward pass costs roughly twice the forward pass). This is the standard estimate from Kaplan et al.'s Scaling Laws for Neural Language Models (2020).

2. PetaFLOP/s-days

To compare runs, total FLOPs are divided by the compute in one petaFLOP/s-day: 1 PF/s-day = 10¹⁵ × 86,400 = 8.64×10¹⁹ FLOPs. This is the unit OpenAI used to report GPT-3 at ≈3,640 petaFLOP/s-days.

3. Chinchilla-optimal tokens

Hoffmann et al. (2022) showed that for a fixed compute budget, model size and dataset size should grow together — about D ≈ 20 × N training tokens per parameter. The tool computes that optimum and flags your run as undertrained (well below it), compute-optimal (within ±25%), or over-trained (well above it — a deliberate choice when you want a smaller model that is cheaper to serve, as with Llama-2).

4. GPU-hours, wall-clock, and cost

GPU-hours are the compute divided by the GPU's effective throughput: GPU-hours = C / (peak FLOP/s × MFU) / 3600. Peak throughput is the dense BF16/FP16 tensor figure from the NVIDIA datasheet (A100 312 TFLOP/s, H100 SXM 989 TFLOP/s), and MFU (Model FLOPs Utilization) is the fraction of peak a real run achieves — typically 30–55%. Wall-clock time divides GPU-hours by the number of GPUs (ideal scaling), and cost multiplies GPU-hours by your price per GPU-hour.

5. Cross-check against published runs

As an independent check, the 6ND formula is applied to three documented runs. For GPT-3 and Chinchilla — where the papers state a compute number — the tool shows the percentage difference. The 6ND estimate reproduces GPT-3's published 3.14×10²³ FLOPs to within about 0.3%, the same idea as cross-checking a tax figure against the regulator's own formula. This is the dense approximation: Mixture-of-Experts, the attention-FLOPs correction, and very long context can shift the real number by ±10–30%.

Worked examples

Reproduce GPT-3 — 175B params, 300B tokens

Total compute: 6 × 175e9 × 300e9 = 3.15×10²³ FLOPs (published 3.14×10²³, +0.3%)
PetaFLOP/s-days: 3.15×10²³ / 8.64×10¹⁹ = 3,646 (published ≈ 3,640 ✓)
Chinchilla optimum: 20 × 175e9 = 3.5×10¹² → D is 0.09× the optimum ⇒ undertrained
A100 @ 30% MFU: 3.15×10²³ / (312e12 × 0.30) / 3600 ≈ 934,829 GPU-hours
Cost at $1.50/GPU-hour ≈ $1.40M

7B model — Chinchilla-optimal mode

Tokens: 20 × 7e9 = 1.4×10¹¹ (140B), ratio 1.0 ⇒ compute-optimal
Total compute: 6 × 7e9 × 1.4e11 = 5.88×10²¹ FLOPs
PetaFLOP/s-days: 5.88×10²¹ / 8.64×10¹⁹ = 68.1
H100 @ 40% MFU: 5.88×10²¹ / (989e12 × 0.40) / 3600 ≈ 4,129 GPU-hours
Cost at $2.50/GPU-hour ≈ $10,300

Edge case — Llama-2 7B was deliberately over-trained

Real Llama-2 7B used ~2T tokens, not the 140B Chinchilla optimum
Total compute: 6 × 7e9 × 2e12 = 8.4×10²² FLOPs ; 972 PF/s-days
Ratio: 2e12 / 1.4e11 = 14.3× the optimum ⇒ over-trained
Why: more training data makes a small model stronger, so it is cheaper to serve at inference

Frequently asked questions

Sources & references

Formulas, constants, and GPU peak-throughput figures were last verified against the sources above on 2026-06-09. The output is an order-of-magnitude planning estimate, not a guarantee — real runs vary ±10–30% with architecture, context length, and achieved MFU.

Related tools

LiveAI

AI Parameter Count Calc

Compute the exact parameter count of a decoder-only (GPT-style) transformer from its architecture — vocab, hidden size, layers, FFN size, and head config — broken down into embedding, attention, feed-forward, and norm shares. GPT-2 124M and GPT-3 175B verified, formulas cited.

Open tool

LiveAI

GPU Cloud Cost Calculator

Estimate what it costs to rent cloud GPUs (RTX 4090, A100, H100, B200) to train or serve an AI model, and compare the same job across RunPod, Lambda, Vast.ai, and AWS — on-demand and spot — in USD and LKR.

Open tool

LiveAI

Self-Hosting Cost Calc

Find the monthly token volume where renting a cloud GPU to self-host an open LLM (Llama, Mistral, Qwen) beats paying a closed API per token. Shows both monthly costs, the crossover volume, and a self-host or stay-on-API verdict in USD and LKR.

Open tool

Rate this tool

Be the first to rate

Comments & feedback

Spotted a bug or want an improvement? Tell us — our team reviews every comment, and good ideas get built. Comments are public and anonymous.

Found a bug, edge case, or want another GPU or reference run added?

Email me at [email protected] — most fixes ship within 24 hours.