How do you calculate the number of parameters in a transformer?

Add the embedding parameters (vocab × hidden, plus learned positions) to the per-layer parameters times the layer count. Each layer holds attention projections (about 4·h² for standard multi-head attention), a two-matrix feed-forward block (2·h·d_ff), and two normalization layers. This calculator sums every term exactly and shows the breakdown.

Why does GPT-2 small have 124 million parameters?

GPT-2 small uses vocab 50,257, hidden size 768, 12 layers, d_ff 3072, and ties its input and output embeddings. Token embedding is 38.6M, the 12 layers add 85.1M, positions add 0.8M, and the final norm adds a few thousand — totalling 124,439,808. Load the GPT-2 Small preset to see every line.

What is the 12·L·d_model² rule for transformer parameters?

It is the non-embedding approximation from Kaplan et al. (2020). With d_ff = 4·h and standard multi-head attention, each layer's weights come to roughly 4·h² (attention) + 8·h² (feed-forward) = 12·h², so the stack is about 12·L·h². It ignores biases, norms, and embeddings, and overshoots models using grouped-query attention.

Do token embeddings count toward a model's parameter count?

Yes. The token embedding matrix (vocab × hidden) is a trainable weight and is always counted in the reported total. People sometimes quote a separate "non-embedding" count for scaling-law comparisons, because embeddings dominate small models but become a rounding error in large ones. This tool shows both numbers.

How does tied embedding (weight sharing) change the parameter count?

When the input embedding and the output projection share one weight matrix, you save one vocab × hidden matrix. For GPT-2 small that is 38.6M parameters — about 31% of the model. GPT-2, GPT-3, and many others tie; some models keep them separate. Toggle "Tied embeddings" to compare.

Does this handle grouped-query attention (GQA) and RoPE?

Yes. Set KV heads below the head count and the K and V projections shrink to h·(n_kv·head_dim), which is how Llama 3 and Mistral cut their key/value memory. Choosing RoPE or ALiBi sets positional parameters to zero, since those schemes are computed, not learned.

What about SwiGLU and Mixture-of-Experts models?

This version models the standard two-matrix feed-forward. Gated MLPs like SwiGLU use three matrices (≈1.5× the feed-forward parameters), so enter an effective d_ff of 1.5× the reported intermediate size — that is exactly what the Llama presets do. Mixture-of-Experts (separate active vs total counts) is out of scope for this dense-model version.

How accurate is the result against real model cards?

Every built-in preset reproduces its published count exactly: GPT-2 Small returns 124,439,808 and GPT-3 175B returns 174.6B, which rounds to the reported 175B. The Llama presets land on 6.74B and 8.03B once SwiGLU is entered as an effective d_ff. The formulas were last verified on 2026-06-11.

AI · Machine Learning

Transformer Parameter Count Calculator

Work out the exact parameter count of a GPT-style transformer from its architecture — vocab, hidden size, layers, feed-forward width, and head config — and see it split into embedding, attention, feed-forward, and norm shares. GPT-2's 124M and GPT-3's 175B both verified. No signup, formulas cited.

By Induwara Ashinsana— Executive Director, Ryzera TechnologiesUpdated Jun 11, 2026

Transformer parameter count

Decoder-only · dense

Presets

Vocabulary size (V)

Number of distinct tokens in the tokenizer.

Hidden size / d_model (h)

Model width. Heads must divide this.

Number of layers (L)

Stacked transformer blocks.

FFN intermediate size (d_ff)

Inner width of the feed-forward layer.

Attention heads

Must divide the hidden size evenly.

KV heads (GQA)

Equal to heads = standard MHA; fewer = grouped-query.

Max context (n_ctx)

Only adds parameters with learned positions.

Positional scheme

Normalization

Bias in linear layers

Tied embeddings

Total parameters

124.4 M

124,439,808

Non-embedding

85.1 M

68.35% of total

Embedding params

39.4 M

31.65% of total

12·L·h² estimate

84.9 M

+0.14% vs exact

Embeddings 31.65%Attention 22.78%Feed-forward 45.54%Norms 0.03%

Parameter breakdown

Component	Parameters	Share
Token embedding	38,597,376	31.02%
Positional embedding	786,432	0.63%
Output unembedding	0	0%
Attention (all layers)	28,348,416	22.78%
Feed-forward (all layers)	56,669,184	45.54%
Norms (all layers + final)	38,400	0.03%
Total	124,439,808	100%

Per layer: attention 2,362,368, feed-forward 4,722,432, norms 3,072 = 7,087,872 parameters, repeated 12×.

Formulas follow . Dense decoder-only models; 2-matrix FFN.

Have a parameter count already? Feed it into the LLM VRAM Calculator to size a GPU, or the AI Training Compute Calculator to estimate training cost.

How it works

The calculator follows the closed-form parameter breakdown in EleutherAI's Transformer Math 101 and the original architecture papers. Write h for hidden size (d_model), L for the number of layers, V for vocabulary size, and d_ff for the feed-forward inner size. The total is the sum of three groups.

Embeddings. The token embedding matrix holds V · h weights. Learned absolute positions add n_ctx · h; rotary (RoPE) and ALiBi positions are computed rather than learned, so they add nothing. The output unembedding adds another V · h unless it is tied to the input embedding, in which case it is free.

Per layer (×L). Attention has four projections. With standard multi-head attention the query, key, value, and output matrices are each h×h, giving 4 · h² weights. Under grouped-query attention the key and value projections shrink to h · (n_kv · head_dim), where head_dim = h / n_heads. The feed-forward block is two linear layers, so 2 · h · d_ff weights. Each layer has two normalization layers: LayerNorm carries a weight and a bias (4h per layer), while RMSNorm carries weight only (2h). Biases on the linear layers add a small linear-in-h term when enabled.

Totals and the scaling rule. The grand total is embeddings plus L times the per-layer count plus one final norm. For comparison, the calculator also prints the Kaplan et al. (2020) non-embedding approximation N ≈ 12 · L · h², which comes from assuming d_ff = 4h and multi-head attention (4h² for attention plus 8h² for the feed-forward gives 12h² per layer). It holds within a fraction of a percent for GPT-style models, but reads high for grouped-query models because their attention is smaller than 4h². The tool reports the signed deviation so you can see exactly how close the rule of thumb is for your configuration.

Worked examples

GPT-2 Small — 124,439,808 parameters

V=50,257 · h=768 · L=12 · d_ff=3,072 · MHA · learned pos · tied

Token embedding: 50,257 × 768 = 38,597,376
Positional (learned): 1,024 × 768 = 786,432
Output: 0 (tied to input embedding)
Per layer — attention 4 × 768² + biases = 2,362,368
Per layer — feed-forward 2 × 768 × 3,072 + biases = 4,722,432
Per layer — 2 LayerNorms = 4 × 768 = 3,072 → layer total 7,087,872
Stack: 7,087,872 × 12 = 85,054,464; final norm = 1,536
Total: 38,597,376 + 786,432 + 85,054,464 + 1,536 = 124,439,808

GPT-3 175B — 174.6B parameters

V=50,257 · h=12,288 · L=96 · d_ff=49,152 · MHA · learned pos · tied

Embeddings: 50,257 × 12,288 + 2,048 × 12,288 = 642,723,840
Per layer — attention 4 × 12,288² + biases = 604,028,928
Per layer — feed-forward 2 × 12,288 × 49,152 + biases = 1,208,020,992
Per layer — 2 LayerNorms = 49,152 → layer total 1,812,099,072
Stack: 1,812,099,072 × 96 = 173,961,510,912; final norm = 24,576
Total: 174,604,259,328 ≈ 174.6B (rounds to the reported 175B)
12·L·h² check: 12 × 96 × 12,288² = 173,946,175,488 → 0.01% under exact

Llama 3 8B (GQA) — 8.03B parameters

V=128,256 · h=4,096 · L=32 · 8 KV heads · RoPE · RMSNorm · no bias · untied

head_dim = 4,096 / 32 = 128; KV width = 8 × 128 = 1,024
Attention 2 × 4,096² + 2 × 4,096 × 1,024 = 41,943,040 (GQA shrinks K, V)
Feed-forward (SwiGLU as 2-matrix, d_ff=21,504) = 2 × 4,096 × 21,504 = 176,160,768
RMSNorms: 2 × 4,096 = 8,192 → layer total 218,112,000
Stack: 218,112,000 × 32 = 6,979,584,000
Embeddings (untied): 2 × 128,256 × 4,096 = 1,050,673,152; final norm 4,096
Total: 8,030,261,248 ≈ 8.03B (matches the published count)

Frequently asked questions

Sources & references

Formulas and presets were last cross-checked against published parameter counts on 2026-06-11. Every built-in preset reproduces its model card's reported size.

Related tools

LiveAI

Perplexity Calculator

Compute language-model perplexity from token probabilities, cross-entropy loss, or log-likelihood, with nats and bits-per-token conversions. Step-by-step, matches PyTorch, runs entirely in the browser.

Open tool

LiveAI

AI Audio Token Cost Calc

Convert an audio clip's duration (or a measured audio_tokens count) into the exact audio input tokens GPT-4o-audio and Gemini bill, then price it per request and per month in USD and LKR. Gemini's fixed 32 tokens/second rule is cited; compares all four models side by side. Runs in your browser, no signup.

Open tool

LiveAI

AI Video Token Cost Calc

Estimate how many input tokens a video costs when you send it into a multimodal LLM — Gemini's native per-second tokenization versus frame-sampling into GPT-4o and Claude — priced per video and per month in USD and LKR. Runs in your browser; no video is uploaded.

Open tool

Rate this tool

Be the first to rate

Comments & feedback

Spotted a bug or want an improvement? Tell us — our team reviews every comment, and good ideas get built. Comments are public and anonymous.

Found a bug, edge case, or want to suggest an improvement?

Email me at [email protected] — most fixes ship within 24 hours.