induwara.lk
induwara.lkAI · LLM sizing

Mixture-of-Experts (MoE) Calculator

Find an MoE model's active parameters per token (its speed) and the total VRAM you need to load it (its memory footprint). Presets for Mixtral, DeepSeek-V3, Llama 4, DBRX, Qwen and Grok, or enter a custom config. No signup, sources cited below.

By Induwara AshinsanaUpdated Jul 1, 2026
Mixture-of-Experts sizing
Model-card verified

Total and active parameters are the vendor's published figures. Source: Mistral AI, arXiv:2401.04088.

Weight precision
Active params / token
12.9B
Decides speed & per-token compute
Active fraction
27.6%
of 46.7B total
VRAM to load (FP16 / BF16)
93.4 GB
All experts must fit in memory
80 GB GPUs to load
Weights only, before KV cache
Speed-vs-memory verdict

Computes about as fast as a ~12.9B dense model, but must be stored like a ~46.7B dense model — so you need 2× 80 GB GPUs at FP16 / BF16 just to load the weights. It is memory-bound, not compute-bound.

Parameter breakdown

Total parameters (load into VRAM)46.7B
Shared / dense trunk (always active)(derived)1.6B
Per-expert size (8 experts)5.6B
Active per token (shared + top-2)12.9B
Inference FLOPs per token (≈ 2 × active)25.8 GFLOP
PrecisionBytes / paramVRAM to load
FP16 / BF16293.4 GB
FP8146.7 GB
INT4 (4-bit)0.523.4 GB

MoE models compared

ModelTotalActiveActive %VRAM (FP16 / BF16)
Mixtral 8x7B46.7B12.9B27.6%93.4 GB
Mixtral 8x22B141B39B27.7%282 GB
DeepSeek-V3671B37B5.5%1,342 GB
DeepSeek-V2236B21B8.9%472 GB
Llama 4 Scout109B17B15.6%218 GB
Llama 4 Maverick400B17B4.3%800 GB
Databricks DBRX132B36B27.3%264 GB
Qwen2 57B-A14B57B14B24.6%114 GB
Qwen1.5-MoE-A2.7B14.3B2.7B18.9%28.6 GB
Grok-1314B86B27.4%628 GB

Totals and active counts are vendor-published figures (see sources). VRAM is weights-only at the selected precision.

VRAM shown is weights only (params × bytes-per-param). Add KV cache, activations and overhead with the LLM VRAM Calculator. Formulas from Mixtral (arXiv:2401.04088), DeepSeek-V3 (arXiv:2412.19437) and EleutherAI Transformer Math 101 — cited in full below.

How it works

A Mixture-of-Experts (MoE) transformer replaces each dense feed-forward block with many parallel expert blocks and a router that sends every token to only a few of them. That splits a model into two numbers that are usually equal in a dense model but very different here: the total parameters you must fit in memory, and the far smaller active parameters that actually run per token. This tool computes both, plus the VRAM to load the weights.

Let T be total parameters, E the number of experts, k the experts activated per token (top-k routing), and S the shared / dense parameters (attention, embeddings and any always-on layers). All in billions.

  1. Per-expert size: per_expert = (T − S) / E. The expert parameters are the routed FFN blocks; everything else is shared (Mixtral paper §2).
  2. Active parameters per token: active = S + k × per_expert. Only the shared trunk plus the top-k routed experts run for any given token (Mixtral §2.1; DeepSeek-V3 report §2).
  3. Active fraction: active% = active / T × 100.
  4. VRAM to load weights: vram_GB = T × bytes_per_param, where bytes per parameter is 2 for FP16/BF16, 1 for FP8 and 0.5 for INT4 (EleutherAI Transformer Math 101). Every expert must reside in memory because routing is per-token and unpredictable — you cannot leave experts on disk without paging on nearly every token.
  5. Compute intuition: inference FLOPs per token ≈ 2 × active, so decode throughput tracks the active count, not the total. That is why a large MoE feels fast but is memory-hungry — it is memory-bound, not compute-bound.

For the built-in presets the tool uses each vendor's published total and active counts, so preset numbers match the model cards to the decimal. It then back-solves the implied shared trunk from S = (E·active − k·T) / (E − k) and recomputes active with the formula above as a cross-check — for Mixtral 8x7B this reproduces 12.9B exactly. Custom configs use the S / E / k formula path directly.

Worked examples

Mixtral 8x7B (FP16) — preset

46.7B total · 8 experts · top-2

  1. Published active = 12.9B (Mistral, arXiv:2401.04088)
  2. Active % = 12.9 / 46.7 × 100 = 27.6%
  3. Derived shared = (8×12.9 − 2×46.7) / (8−2) = 9.8 / 6 = 1.63B
  4. Per-expert = (46.7 − 1.63) / 8 = 5.63B
  5. Cross-check active = 1.63 + 2×5.63 = 12.9B ✓
  6. VRAM FP16 = 46.7 × 2 = 93.4 GB; INT4 = 46.7 × 0.5 = 23.35 GB
  7. Verdict: fast like a ~13B model, stored like a ~47B model

DeepSeek-V3 (FP8) — preset

671B total · 256 experts · top-8

  1. Published active = 37B (DeepSeek-V3, arXiv:2412.19437)
  2. Active % = 37 / 671 × 100 = 5.5%
  3. VRAM FP8 (native) = 671 × 1 = 671 GB
  4. GPUs to load @ 80 GB = ceil(671 / 80) = 9 GPUs
  5. VRAM FP16 = 1,342 GB; INT4 = 335.5 GB
  6. Verdict: decode speed of a ~37B model, footprint of a 671B model

Custom config — exercises the formula

46.5B total · 8 experts · top-2 · 2.5B shared · FP16

  1. Per-expert = (46.5 − 2.5) / 8 = 44 / 8 = 5.5B
  2. Active = 2.5 + 2 × 5.5 = 13.5B
  3. Active % = 13.5 / 46.5 × 100 = 29.0%
  4. VRAM FP16 = 46.5 × 2 = 93.0 GB
  5. All four outputs reconcile with the methodology formulas ✓

Frequently asked questions

Sources & references

Preset total and active counts are the published figures from each vendor's model card or paper, last cross-checked on 2026-07-01. Grok-1's active count is the vendor's approximate figure. VRAM is weights-only and excludes KV cache, activations and runtime overhead.

Related tools

Rate this tool
Be the first to rate

Comments & feedback

Spotted a bug or want an improvement? Tell us — our team reviews every comment, and good ideas get built. Comments are public and anonymous.

Found a bug, edge case, or want to suggest an improvement?

Email me at [email protected] — most fixes ship within 24 hours.