LLM Training Compute (FLOPs) Calculator
Estimate the FLOPs, petaFLOP/s-days, GPU-hours, and dollar cost to pre-train a transformer language model from scratch. Enter the parameter count and tokens, and the tool applies the 6ND rule, checks the Chinchilla-optimal token count, and prices the run. Runs entirely in your browser.
How it works
Every figure comes from two published scaling-law results, applied as closed-form arithmetic in your browser — nothing is trained, uploaded, or downloaded.
1. Total compute — the 6ND rule
Training a dense transformer costs about six floating-point operations per parameter per token: C ≈ 6 × N × D, where N is the parameter count and D is the number of training tokens. The factor 6 is two FLOPs for the forward pass plus four for the backward pass (the backward pass costs roughly twice the forward pass). This is the standard estimate from Kaplan et al.'s Scaling Laws for Neural Language Models (2020).
2. PetaFLOP/s-days
To compare runs, total FLOPs are divided by the compute in one petaFLOP/s-day: 1 PF/s-day = 10¹⁵ × 86,400 = 8.64×10¹⁹ FLOPs. This is the unit OpenAI used to report GPT-3 at ≈3,640 petaFLOP/s-days.
3. Chinchilla-optimal tokens
Hoffmann et al. (2022) showed that for a fixed compute budget, model size and dataset size should grow together — about D ≈ 20 × N training tokens per parameter. The tool computes that optimum and flags your run as undertrained (well below it), compute-optimal (within ±25%), or over-trained (well above it — a deliberate choice when you want a smaller model that is cheaper to serve, as with Llama-2).
4. GPU-hours, wall-clock, and cost
GPU-hours are the compute divided by the GPU's effective throughput: GPU-hours = C / (peak FLOP/s × MFU) / 3600. Peak throughput is the dense BF16/FP16 tensor figure from the NVIDIA datasheet (A100 312 TFLOP/s, H100 SXM 989 TFLOP/s), and MFU (Model FLOPs Utilization) is the fraction of peak a real run achieves — typically 30–55%. Wall-clock time divides GPU-hours by the number of GPUs (ideal scaling), and cost multiplies GPU-hours by your price per GPU-hour.
5. Cross-check against published runs
As an independent check, the 6ND formula is applied to three documented runs. For GPT-3 and Chinchilla — where the papers state a compute number — the tool shows the percentage difference. The 6ND estimate reproduces GPT-3's published 3.14×10²³ FLOPs to within about 0.3%, the same idea as cross-checking a tax figure against the regulator's own formula. This is the dense approximation: Mixture-of-Experts, the attention-FLOPs correction, and very long context can shift the real number by ±10–30%.
Worked examples
Frequently asked questions
Sources & references
- Kaplan et al. (2020) — Scaling Laws for Neural Language Models (the 6ND rule)
- Hoffmann et al. (2022) — Training Compute-Optimal Large Language Models (Chinchilla, 20 tokens/param)
- Brown et al. (2020) — Language Models are Few-Shot Learners (GPT-3 compute, Table D.1)
- NVIDIA — H100 datasheet (peak BF16/FP16 tensor throughput)
- NVIDIA — A100 datasheet (peak BF16/FP16 tensor throughput)
Formulas, constants, and GPU peak-throughput figures were last verified against the sources above on 2026-06-09. The output is an order-of-magnitude planning estimate, not a guarantee — real runs vary ±10–30% with architecture, context length, and achieved MFU.
Related tools
Comments & feedback
Spotted a bug or want an improvement? Tell us — our team reviews every comment, and good ideas get built. Comments are public and anonymous.
Found a bug, edge case, or want another GPU or reference run added?
Email me at [email protected] — most fixes ship within 24 hours.