induwara.lkinduwara.lk
induwara.lkAI · Training compute

LLM Training Compute (FLOPs) Calculator

Estimate the FLOPs, petaFLOP/s-days, GPU-hours, and dollar cost to pre-train a transformer language model from scratch. Enter the parameter count and tokens, and the tool applies the 6ND rule, checks the Chinchilla-optimal token count, and prices the run. Runs entirely in your browser.

By Induwara AshinsanaUpdated Jun 9, 2026
Training compute estimate
6ND · Kaplan + Chinchilla
Reference runs

D is set to 20 × N = 140B tokens (Hoffmann et al. 2022).

Model FLOPs Utilization — real runs hit 30–55%.

Total compute (FLOPs)
5.88×1021
PetaFLOP/s-days
68.06
GPU-hours
4,129
Est. cost (USD)
$10,321.87
Chinchilla optimality
Compute-optimal

Compute-optimal tokens for 7B params = 20 × 7B = 140B. Your D is 1× the optimum. Tokens are within ±25% of the Chinchilla optimum (20 tokens/param).

Wall-clock time

Cluster sizeWall-clock (ideal)
1 GPU172 days
8 GPUs22 days
64 GPUs2.7 days
512 GPUs8.1 hrs

Magnitude check vs. real runs

RunFLOPs (6ND)PF/s-daysvs. published
GPT-3 175B
Undertrained
3.15×10233,645.83+0.32%
Chinchilla 70B
Compute-optimal
5.88×10236,805.56+2.08%
Llama-2 7B
Over-trained (inference-efficient)
8.40×1022972.22

The 6ND estimate reproduces GPT-3's published 3.14×10²³ FLOPs (≈3,640 PF/s-days) to within ~0.3% — the same idea as cross-checking a tax figure against the regulator's own formula.

Compute uses C ≈ 6·N·D from Kaplan et al. (2020) and the 20-tokens-per-parameter optimum from Hoffmann et al. (2022, Chinchilla). GPU peaks are dense BF16/FP16 from NVIDIA datasheets; estimates assume your stated MFU and ideal scaling. This is the dense 6ND approximation — MoE, attention-FLOPs, and very long context can shift the real number ±10–30%.

How it works

Every figure comes from two published scaling-law results, applied as closed-form arithmetic in your browser — nothing is trained, uploaded, or downloaded.

1. Total compute — the 6ND rule

Training a dense transformer costs about six floating-point operations per parameter per token: C ≈ 6 × N × D, where N is the parameter count and D is the number of training tokens. The factor 6 is two FLOPs for the forward pass plus four for the backward pass (the backward pass costs roughly twice the forward pass). This is the standard estimate from Kaplan et al.'s Scaling Laws for Neural Language Models (2020).

2. PetaFLOP/s-days

To compare runs, total FLOPs are divided by the compute in one petaFLOP/s-day: 1 PF/s-day = 10¹⁵ × 86,400 = 8.64×10¹⁹ FLOPs. This is the unit OpenAI used to report GPT-3 at ≈3,640 petaFLOP/s-days.

3. Chinchilla-optimal tokens

Hoffmann et al. (2022) showed that for a fixed compute budget, model size and dataset size should grow together — about D ≈ 20 × N training tokens per parameter. The tool computes that optimum and flags your run as undertrained (well below it), compute-optimal (within ±25%), or over-trained (well above it — a deliberate choice when you want a smaller model that is cheaper to serve, as with Llama-2).

4. GPU-hours, wall-clock, and cost

GPU-hours are the compute divided by the GPU's effective throughput: GPU-hours = C / (peak FLOP/s × MFU) / 3600. Peak throughput is the dense BF16/FP16 tensor figure from the NVIDIA datasheet (A100 312 TFLOP/s, H100 SXM 989 TFLOP/s), and MFU (Model FLOPs Utilization) is the fraction of peak a real run achieves — typically 30–55%. Wall-clock time divides GPU-hours by the number of GPUs (ideal scaling), and cost multiplies GPU-hours by your price per GPU-hour.

5. Cross-check against published runs

As an independent check, the 6ND formula is applied to three documented runs. For GPT-3 and Chinchilla — where the papers state a compute number — the tool shows the percentage difference. The 6ND estimate reproduces GPT-3's published 3.14×10²³ FLOPs to within about 0.3%, the same idea as cross-checking a tax figure against the regulator's own formula. This is the dense approximation: Mixture-of-Experts, the attention-FLOPs correction, and very long context can shift the real number by ±10–30%.

Worked examples

Reproduce GPT-3 — 175B params, 300B tokens

  1. Total compute: 6 × 175e9 × 300e9 = 3.15×10²³ FLOPs (published 3.14×10²³, +0.3%)
  2. PetaFLOP/s-days: 3.15×10²³ / 8.64×10¹⁹ = 3,646 (published ≈ 3,640 ✓)
  3. Chinchilla optimum: 20 × 175e9 = 3.5×10¹² → D is 0.09× the optimum ⇒ undertrained
  4. A100 @ 30% MFU: 3.15×10²³ / (312e12 × 0.30) / 3600 ≈ 934,829 GPU-hours
  5. Cost at $1.50/GPU-hour ≈ $1.40M

7B model — Chinchilla-optimal mode

  1. Tokens: 20 × 7e9 = 1.4×10¹¹ (140B), ratio 1.0 ⇒ compute-optimal
  2. Total compute: 6 × 7e9 × 1.4e11 = 5.88×10²¹ FLOPs
  3. PetaFLOP/s-days: 5.88×10²¹ / 8.64×10¹⁹ = 68.1
  4. H100 @ 40% MFU: 5.88×10²¹ / (989e12 × 0.40) / 3600 ≈ 4,129 GPU-hours
  5. Cost at $2.50/GPU-hour ≈ $10,300

Edge case — Llama-2 7B was deliberately over-trained

  1. Real Llama-2 7B used ~2T tokens, not the 140B Chinchilla optimum
  2. Total compute: 6 × 7e9 × 2e12 = 8.4×10²² FLOPs ; 972 PF/s-days
  3. Ratio: 2e12 / 1.4e11 = 14.3× the optimum ⇒ over-trained
  4. Why: more training data makes a small model stronger, so it is cheaper to serve at inference

Frequently asked questions

Sources & references

Formulas, constants, and GPU peak-throughput figures were last verified against the sources above on 2026-06-09. The output is an order-of-magnitude planning estimate, not a guarantee — real runs vary ±10–30% with architecture, context length, and achieved MFU.

Related tools

Rate this tool
Be the first to rate

Comments & feedback

Spotted a bug or want an improvement? Tell us — our team reviews every comment, and good ideas get built. Comments are public and anonymous.

Found a bug, edge case, or want another GPU or reference run added?

Email me at [email protected] — most fixes ship within 24 hours.