How many active parameters does DeepSeek-V3 use per token?

DeepSeek-V3 activates about 37 billion of its 671 billion total parameters for each token — roughly 5.5%. Only a shared trunk plus 8 of its 256 routed experts fire per token, so it decodes at the speed of a ~37B dense model while still needing all 671B in memory.

Why does an MoE model need so much VRAM if only a few experts run?

Routing is per-token and unpredictable: the next token might need any expert. All experts must therefore sit in VRAM ready to fire — you cannot leave most of them on disk without paging on almost every token. VRAM is set by total parameters; speed is set by active parameters.

Is Mixtral 8x7B as fast as a 13B model or a 47B model?

Speed-wise it behaves like a ~13B dense model: it activates 12.9B parameters per token (top-2 of 8 experts plus the shared trunk). Memory-wise it behaves like a 47B model: all 46.7B parameters must load, needing about 93 GB in FP16 or 23 GB in INT4.

How much GPU memory do I need to run DeepSeek-V3 or Llama 4?

For weights alone: DeepSeek-V3 needs ~671 GB in FP8 (about nine 80 GB GPUs) or ~336 GB in INT4. Llama 4 Scout (109B) needs ~218 GB in FP16 or ~55 GB in INT4. Add KV cache and activation memory on top — use the LLM VRAM Calculator for the full figure.

What is the difference between total and active parameters in an LLM?

Total parameters are every weight in the model — they set the memory and disk footprint. Active parameters are only those that run for a given token; in an MoE that is the shared layers plus the top-k selected experts. In a dense model the two are equal; in an MoE the active count is far smaller.

How does the tool calculate active parameters for a custom config?

Per-expert size = (total − shared) ÷ number of experts. Active per token = shared + top-k × per-expert. Active % = active ÷ total × 100. It matches the Mixtral paper's method: for Mixtral 8x7B a 1.63B shared trunk plus two 5.63B experts gives 12.9B active.

What precision should I pick for the VRAM estimate?

Use the precision you will actually run. FP16/BF16 uses 2 bytes per parameter, FP8 uses 1, and INT4 quantisation uses 0.5. DeepSeek-V3 ships native FP8. INT4 roughly quarters the FP16 footprint with a small quality trade-off, which is how large MoE models fit on fewer GPUs.

Does the VRAM number include the KV cache?

No. This tool reports weights-only VRAM (parameters × bytes per parameter). KV cache, activations, CUDA context and fragmentation add more, and they scale with context length and batch size. For the complete inference footprint, use the LLM VRAM Calculator and add its figure on top.

AI · LLM sizing

Mixture-of-Experts (MoE) Calculator

Find an MoE model's active parameters per token (its speed) and the total VRAM you need to load it (its memory footprint). Presets for Mixtral, DeepSeek-V3, Llama 4, DBRX, Qwen and Grok, or enter a custom config. No signup, sources cited below.

By Induwara Ashinsana— Executive Director, Ryzera TechnologiesUpdated Jul 1, 2026

Mixture-of-Experts sizing

Model-card verified

MoE model

Total and active parameters are the vendor's published figures. Source: Mistral AI, arXiv:2401.04088.

Weight precision

Active params / token

12.9B

Decides speed & per-token compute

Active fraction

27.6%

of 46.7B total

VRAM to load (FP16 / BF16)

93.4 GB

All experts must fit in memory

80 GB GPUs to load

2×

Weights only, before KV cache

Speed-vs-memory verdict

Computes about as fast as a ~12.9B dense model, but must be stored like a ~46.7B dense model — so you need 2× 80 GB GPUs at FP16 / BF16 just to load the weights. It is memory-bound, not compute-bound.

Parameter breakdown

Total parameters (load into VRAM)	46.7B
Shared / dense trunk (always active)(derived)	1.6B
Per-expert size (8 experts)	5.6B
Active per token (shared + top-2)	12.9B
Inference FLOPs per token (≈ 2 × active)	25.8 GFLOP

Precision	Bytes / param	VRAM to load
FP16 / BF16	2	93.4 GB
FP8	1	46.7 GB
INT4 (4-bit)	0.5	23.4 GB

MoE models compared

Model	Total	Active	Active %	VRAM (FP16 / BF16)
Mixtral 8x7B	46.7B	12.9B	27.6%	93.4 GB
Mixtral 8x22B	141B	39B	27.7%	282 GB
DeepSeek-V3	671B	37B	5.5%	1,342 GB
DeepSeek-V2	236B	21B	8.9%	472 GB
Llama 4 Scout	109B	17B	15.6%	218 GB
Llama 4 Maverick	400B	17B	4.3%	800 GB
Databricks DBRX	132B	36B	27.3%	264 GB
Qwen2 57B-A14B	57B	14B	24.6%	114 GB
Qwen1.5-MoE-A2.7B	14.3B	2.7B	18.9%	28.6 GB
Grok-1	314B	86B	27.4%	628 GB

Totals and active counts are vendor-published figures (see sources). VRAM is weights-only at the selected precision.

VRAM shown is weights only (params × bytes-per-param). Add KV cache, activations and overhead with the LLM VRAM Calculator. Formulas from Mixtral (arXiv:2401.04088), DeepSeek-V3 (arXiv:2412.19437) and EleutherAI Transformer Math 101 — cited in full below.

How it works

A Mixture-of-Experts (MoE) transformer replaces each dense feed-forward block with many parallel expert blocks and a router that sends every token to only a few of them. That splits a model into two numbers that are usually equal in a dense model but very different here: the total parameters you must fit in memory, and the far smaller active parameters that actually run per token. This tool computes both, plus the VRAM to load the weights.

Let T be total parameters, E the number of experts, k the experts activated per token (top-k routing), and S the shared / dense parameters (attention, embeddings and any always-on layers). All in billions.

Per-expert size: per_expert = (T − S) / E. The expert parameters are the routed FFN blocks; everything else is shared (Mixtral paper §2).
Active parameters per token: active = S + k × per_expert. Only the shared trunk plus the top-k routed experts run for any given token (Mixtral §2.1; DeepSeek-V3 report §2).
Active fraction: active% = active / T × 100.
VRAM to load weights: vram_GB = T × bytes_per_param, where bytes per parameter is 2 for FP16/BF16, 1 for FP8 and 0.5 for INT4 (EleutherAI Transformer Math 101). Every expert must reside in memory because routing is per-token and unpredictable — you cannot leave experts on disk without paging on nearly every token.
Compute intuition: inference FLOPs per token ≈ 2 × active, so decode throughput tracks the active count, not the total. That is why a large MoE feels fast but is memory-hungry — it is memory-bound, not compute-bound.

For the built-in presets the tool uses each vendor's published total and active counts, so preset numbers match the model cards to the decimal. It then back-solves the implied shared trunk from S = (E·active − k·T) / (E − k) and recomputes active with the formula above as a cross-check — for Mixtral 8x7B this reproduces 12.9B exactly. Custom configs use the S / E / k formula path directly.

Worked examples

Mixtral 8x7B (FP16) — preset

46.7B total · 8 experts · top-2

Published active = 12.9B (Mistral, arXiv:2401.04088)
Active % = 12.9 / 46.7 × 100 = 27.6%
Derived shared = (8×12.9 − 2×46.7) / (8−2) = 9.8 / 6 = 1.63B
Per-expert = (46.7 − 1.63) / 8 = 5.63B
Cross-check active = 1.63 + 2×5.63 = 12.9B ✓
VRAM FP16 = 46.7 × 2 = 93.4 GB; INT4 = 46.7 × 0.5 = 23.35 GB
Verdict: fast like a ~13B model, stored like a ~47B model

DeepSeek-V3 (FP8) — preset

671B total · 256 experts · top-8

Published active = 37B (DeepSeek-V3, arXiv:2412.19437)
Active % = 37 / 671 × 100 = 5.5%
VRAM FP8 (native) = 671 × 1 = 671 GB
GPUs to load @ 80 GB = ceil(671 / 80) = 9 GPUs
VRAM FP16 = 1,342 GB; INT4 = 335.5 GB
Verdict: decode speed of a ~37B model, footprint of a 671B model

Custom config — exercises the formula

46.5B total · 8 experts · top-2 · 2.5B shared · FP16

Per-expert = (46.5 − 2.5) / 8 = 44 / 8 = 5.5B
Active = 2.5 + 2 × 5.5 = 13.5B
Active % = 13.5 / 46.5 × 100 = 29.0%
VRAM FP16 = 46.5 × 2 = 93.0 GB
All four outputs reconcile with the methodology formulas ✓

Frequently asked questions

Sources & references

Preset total and active counts are the published figures from each vendor's model card or paper, last cross-checked on 2026-07-01. Grok-1's active count is the vendor's approximate figure. VRAM is weights-only and excludes KV cache, activations and runtime overhead.

Related tools

LiveAI

LLM VRAM Calculator

Estimate the GPU VRAM needed to run or fine-tune any open LLM (Llama 3, Mistral, Qwen, Gemma, DeepSeek) at a given precision, context, and batch size — and check whether it fits your GPU. Formulas cited, runs in your browser.

Open tool

LiveAI

AI Token Counter

Count tokens for any text against GPT-5, GPT-4o, Claude 4.x, Gemini 3, and Llama 4. See how much of each model's context window you'll use before sending. Runs entirely in your browser, no signup, sources cited.

Open tool

LiveAI

AI Temperature Calc

Interactive visualizer for LLM temperature, top-p (nucleus), and top-k sampling. Drag the sliders and watch the next-token softmax probabilities sharpen, flatten, and truncate, with the exact math token by token. Runs in your browser, no API key, sources cited.

Open tool

Rate this tool

Be the first to rate

Comments & feedback

Spotted a bug or want an improvement? Tell us — our team reviews every comment, and good ideas get built. Comments are public and anonymous.

Found a bug, edge case, or want to suggest an improvement?

Email me at [email protected] — most fixes ship within 24 hours.