Transformer Parameter Count Calculator
Work out the exact parameter count of a GPT-style transformer from its architecture — vocab, hidden size, layers, feed-forward width, and head config — and see it split into embedding, attention, feed-forward, and norm shares. GPT-2's 124M and GPT-3's 175B both verified. No signup, formulas cited.
Have a parameter count already? Feed it into the LLM VRAM Calculator to size a GPU, or the AI Training Compute Calculator to estimate training cost.
How it works
The calculator follows the closed-form parameter breakdown in EleutherAI's Transformer Math 101 and the original architecture papers. Write h for hidden size (d_model), L for the number of layers, V for vocabulary size, and d_ff for the feed-forward inner size. The total is the sum of three groups.
Embeddings. The token embedding matrix holds V · h weights. Learned absolute positions add n_ctx · h; rotary (RoPE) and ALiBi positions are computed rather than learned, so they add nothing. The output unembedding adds another V · h unless it is tied to the input embedding, in which case it is free.
Per layer (×L). Attention has four projections. With standard multi-head attention the query, key, value, and output matrices are each h×h, giving 4 · h² weights. Under grouped-query attention the key and value projections shrink to h · (n_kv · head_dim), where head_dim = h / n_heads. The feed-forward block is two linear layers, so 2 · h · d_ff weights. Each layer has two normalization layers: LayerNorm carries a weight and a bias (4h per layer), while RMSNorm carries weight only (2h). Biases on the linear layers add a small linear-in-h term when enabled.
Totals and the scaling rule. The grand total is embeddings plus L times the per-layer count plus one final norm. For comparison, the calculator also prints the Kaplan et al. (2020) non-embedding approximation N ≈ 12 · L · h², which comes from assuming d_ff = 4h and multi-head attention (4h² for attention plus 8h² for the feed-forward gives 12h² per layer). It holds within a fraction of a percent for GPT-style models, but reads high for grouped-query models because their attention is smaller than 4h². The tool reports the signed deviation so you can see exactly how close the rule of thumb is for your configuration.
Worked examples
Frequently asked questions
Sources & references
- EleutherAI — Transformer Math 101 (per-component parameter breakdown)
- Vaswani et al. 2017 — Attention Is All You Need (transformer block, FFN = two linear layers)
- Radford et al. 2019 — GPT-2 (architecture behind the 124M reconciliation)
- Brown et al. 2020 — GPT-3 (Table 2.1, the 175B configuration)
- Kaplan et al. 2020 — Scaling Laws (the 12·L·d_model² non-embedding approximation)
Formulas and presets were last cross-checked against published parameter counts on 2026-06-11. Every built-in preset reproduces its model card's reported size.
Related tools
Comments & feedback
Spotted a bug or want an improvement? Tell us — our team reviews every comment, and good ideas get built. Comments are public and anonymous.
Found a bug, edge case, or want to suggest an improvement?
Email me at [email protected] — most fixes ship within 24 hours.