induwara.lk
induwara.lkAI · Machine Learning

Transformer Parameter Count Calculator

Work out the exact parameter count of a GPT-style transformer from its architecture — vocab, hidden size, layers, feed-forward width, and head config — and see it split into embedding, attention, feed-forward, and norm shares. GPT-2's 124M and GPT-3's 175B both verified. No signup, formulas cited.

By Induwara AshinsanaUpdated Jun 11, 2026
Transformer parameter count
Decoder-only · dense
Presets

Number of distinct tokens in the tokenizer.

Model width. Heads must divide this.

Stacked transformer blocks.

Inner width of the feed-forward layer.

Must divide the hidden size evenly.

Equal to heads = standard MHA; fewer = grouped-query.

Only adds parameters with learned positions.

Bias in linear layers
Tied embeddings
Total parameters
124.4 M
124,439,808
Non-embedding
85.1 M
68.35% of total
Embedding params
39.4 M
31.65% of total
12·L·h² estimate
84.9 M
+0.14% vs exact
Embeddings 31.65%Attention 22.78%Feed-forward 45.54%Norms 0.03%

Parameter breakdown

ComponentParametersShare
Token embedding38,597,37631.02%
Positional embedding786,4320.63%
Output unembedding00%
Attention (all layers)28,348,41622.78%
Feed-forward (all layers)56,669,18445.54%
Norms (all layers + final)38,4000.03%
Total124,439,808100%

Per layer: attention 2,362,368, feed-forward 4,722,432, norms 3,072 = 7,087,872 parameters, repeated 12×.

Formulas follow . Dense decoder-only models; 2-matrix FFN.

Have a parameter count already? Feed it into the LLM VRAM Calculator to size a GPU, or the AI Training Compute Calculator to estimate training cost.

How it works

The calculator follows the closed-form parameter breakdown in EleutherAI's Transformer Math 101 and the original architecture papers. Write h for hidden size (d_model), L for the number of layers, V for vocabulary size, and d_ff for the feed-forward inner size. The total is the sum of three groups.

Embeddings. The token embedding matrix holds V · h weights. Learned absolute positions add n_ctx · h; rotary (RoPE) and ALiBi positions are computed rather than learned, so they add nothing. The output unembedding adds another V · h unless it is tied to the input embedding, in which case it is free.

Per layer (×L). Attention has four projections. With standard multi-head attention the query, key, value, and output matrices are each h×h, giving 4 · h² weights. Under grouped-query attention the key and value projections shrink to h · (n_kv · head_dim), where head_dim = h / n_heads. The feed-forward block is two linear layers, so 2 · h · d_ff weights. Each layer has two normalization layers: LayerNorm carries a weight and a bias (4h per layer), while RMSNorm carries weight only (2h). Biases on the linear layers add a small linear-in-h term when enabled.

Totals and the scaling rule. The grand total is embeddings plus L times the per-layer count plus one final norm. For comparison, the calculator also prints the Kaplan et al. (2020) non-embedding approximation N ≈ 12 · L · h², which comes from assuming d_ff = 4h and multi-head attention (4h² for attention plus 8h² for the feed-forward gives 12h² per layer). It holds within a fraction of a percent for GPT-style models, but reads high for grouped-query models because their attention is smaller than 4h². The tool reports the signed deviation so you can see exactly how close the rule of thumb is for your configuration.

Worked examples

GPT-2 Small — 124,439,808 parameters

V=50,257 · h=768 · L=12 · d_ff=3,072 · MHA · learned pos · tied

  1. Token embedding: 50,257 × 768 = 38,597,376
  2. Positional (learned): 1,024 × 768 = 786,432
  3. Output: 0 (tied to input embedding)
  4. Per layer — attention 4 × 768² + biases = 2,362,368
  5. Per layer — feed-forward 2 × 768 × 3,072 + biases = 4,722,432
  6. Per layer — 2 LayerNorms = 4 × 768 = 3,072 → layer total 7,087,872
  7. Stack: 7,087,872 × 12 = 85,054,464; final norm = 1,536
  8. Total: 38,597,376 + 786,432 + 85,054,464 + 1,536 = 124,439,808

GPT-3 175B — 174.6B parameters

V=50,257 · h=12,288 · L=96 · d_ff=49,152 · MHA · learned pos · tied

  1. Embeddings: 50,257 × 12,288 + 2,048 × 12,288 = 642,723,840
  2. Per layer — attention 4 × 12,288² + biases = 604,028,928
  3. Per layer — feed-forward 2 × 12,288 × 49,152 + biases = 1,208,020,992
  4. Per layer — 2 LayerNorms = 49,152 → layer total 1,812,099,072
  5. Stack: 1,812,099,072 × 96 = 173,961,510,912; final norm = 24,576
  6. Total: 174,604,259,328 ≈ 174.6B (rounds to the reported 175B)
  7. 12·L·h² check: 12 × 96 × 12,288² = 173,946,175,488 → 0.01% under exact

Llama 3 8B (GQA) — 8.03B parameters

V=128,256 · h=4,096 · L=32 · 8 KV heads · RoPE · RMSNorm · no bias · untied

  1. head_dim = 4,096 / 32 = 128; KV width = 8 × 128 = 1,024
  2. Attention 2 × 4,096² + 2 × 4,096 × 1,024 = 41,943,040 (GQA shrinks K, V)
  3. Feed-forward (SwiGLU as 2-matrix, d_ff=21,504) = 2 × 4,096 × 21,504 = 176,160,768
  4. RMSNorms: 2 × 4,096 = 8,192 → layer total 218,112,000
  5. Stack: 218,112,000 × 32 = 6,979,584,000
  6. Embeddings (untied): 2 × 128,256 × 4,096 = 1,050,673,152; final norm 4,096
  7. Total: 8,030,261,248 ≈ 8.03B (matches the published count)

Frequently asked questions

Sources & references

Related tools

Rate this tool
Be the first to rate

Comments & feedback

Spotted a bug or want an improvement? Tell us — our team reviews every comment, and good ideas get built. Comments are public and anonymous.

Found a bug, edge case, or want to suggest an improvement?

Email me at [email protected] — most fixes ship within 24 hours.