Mixture-of-Experts (MoE) Calculator
Find an MoE model's active parameters per token (its speed) and the total VRAM you need to load it (its memory footprint). Presets for Mixtral, DeepSeek-V3, Llama 4, DBRX, Qwen and Grok, or enter a custom config. No signup, sources cited below.
How it works
A Mixture-of-Experts (MoE) transformer replaces each dense feed-forward block with many parallel expert blocks and a router that sends every token to only a few of them. That splits a model into two numbers that are usually equal in a dense model but very different here: the total parameters you must fit in memory, and the far smaller active parameters that actually run per token. This tool computes both, plus the VRAM to load the weights.
Let T be total parameters, E the number of experts, k the experts activated per token (top-k routing), and S the shared / dense parameters (attention, embeddings and any always-on layers). All in billions.
- Per-expert size:
per_expert = (T − S) / E. The expert parameters are the routed FFN blocks; everything else is shared (Mixtral paper §2). - Active parameters per token:
active = S + k × per_expert. Only the shared trunk plus the top-k routed experts run for any given token (Mixtral §2.1; DeepSeek-V3 report §2). - Active fraction:
active% = active / T × 100. - VRAM to load weights:
vram_GB = T × bytes_per_param, where bytes per parameter is 2 for FP16/BF16, 1 for FP8 and 0.5 for INT4 (EleutherAI Transformer Math 101). Every expert must reside in memory because routing is per-token and unpredictable — you cannot leave experts on disk without paging on nearly every token. - Compute intuition: inference FLOPs per token ≈
2 × active, so decode throughput tracks the active count, not the total. That is why a large MoE feels fast but is memory-hungry — it is memory-bound, not compute-bound.
For the built-in presets the tool uses each vendor's published total and active counts, so preset numbers match the model cards to the decimal. It then back-solves the implied shared trunk from S = (E·active − k·T) / (E − k) and recomputes active with the formula above as a cross-check — for Mixtral 8x7B this reproduces 12.9B exactly. Custom configs use the S / E / k formula path directly.
Worked examples
Frequently asked questions
Sources & references
- Mistral AI — “Mixtral of Experts” (arXiv:2401.04088): 46.7B/12.9B active, top-2-of-8
- DeepSeek-V3 Technical Report (arXiv:2412.19437): 671B total / 37B active, native FP8
- Meta — Llama 4 model cards (Scout 109B/17B, Maverick 400B/17B)
- EleutherAI — “Transformer Math 101”: bytes-per-parameter by precision, FLOPs ≈ 2 × active
Preset total and active counts are the published figures from each vendor's model card or paper, last cross-checked on 2026-07-01. Grok-1's active count is the vendor's approximate figure. VRAM is weights-only and excludes KV cache, activations and runtime overhead.
Related tools
Comments & feedback
Spotted a bug or want an improvement? Tell us — our team reviews every comment, and good ideas get built. Comments are public and anonymous.
Found a bug, edge case, or want to suggest an improvement?
Email me at [email protected] — most fixes ship within 24 hours.