induwara.lkinduwara.lk
induwara.lkAI · GPU memory

LLM GPU Memory (VRAM) Calculator

Work out the GPU VRAM you need to run or fine-tune an open large language model — at any precision, context length, and batch size — and see whether it fits your RTX, T4, or A100 before you download a single weight. Runs entirely in your browser.

By Induwara AshinsanaUpdated Jun 5, 2026
GPU memory estimate
Formulas cited · EleutherAI + QLoRA
Popular setups
Total estimated VRAM
5.24 GiB
Planning range 4.45 GiB6.03 GiB (±15%)
Fit check

Fits on RTX 4090 24 GB with 18.8 GiB headroom.

Memory breakdown

ComponentMemoryShare
Model weights3.74 GiB71.37%
KV cache0.50 GiB9.54%
Activations & runtime overhead1.00 GiB19.09%
Total5.24 GiB100%

Cross-check: Hugging Face's rule of thumb (inference ≈ 1.2× weights, KV cache excluded) gives 4.49 GiB for the weights+overhead share — in line with the 4.74 GiB this tool computes for those two rows.

Same setup at FP16 / INT8 / INT4

FP16 · 2 B/param
18.4 GiB
Fits RTX 4090 24 GB
INT8 · 1 B/param
9.47 GiB
Fits RTX 4090 24 GB
INT4 · 0.5 B/param
5.24 GiB
Fits RTX 4090 24 GB

Quantizing from FP16 to INT4 roughly quarters the weight memory — the single biggest lever for fitting a bigger model on a smaller card.

Estimates use closed-form formulas from EleutherAI's Transformer Math 101 and the QLoRA paper, cross-checked against Hugging Face's memory estimator. Real usage varies ±10–20% with kernels, fragmentation, and framework overhead. Model specs verified from official model cards.

How it works

Every figure is computed in bytesfrom published transformer-memory formulas, then shown in GiB (1 GiB = 2³⁰ bytes). The calculator never downloads or runs a model — it reads each model's parameter count and attention shape from its official model card and applies the closed-form equations below.

1. Model weights

Weight memory is the parameter count P times bytes-per-weight: FP32 = 4, FP16/BF16 = 2, INT8 = 1, and INT4/Q4 = 0.5. So an 8B model is ~16 GiB at FP16 but only ~4 GiB at 4-bit. (EleutherAI, Transformer Math 101.)

2. KV cache (inference)

During generation the model caches a key and value for every token and every layer: 2 × batch × tokens × layers × (KV heads × head dim) × 2 bytes. Grouped-query attention (GQA) — used by Llama 3, Mistral, and Qwen — uses far fewer KV heads than query heads, which is why their cache stays small even at long context. DeepSeek 7B keeps full multi-head attention, so its cache is larger for the same size.

3. Training memory

Each trainable parameter under mixed-precision AdamW costs 16 bytes: a 2-byte FP16 weight, a 2-byte FP16 gradient, a 4-byte FP32 master copy, and 8 bytes of Adam moment and variance. 8-bit Adam cuts the optimizer states to give 10 bytes/param; plain SGD drops the moments for 8 bytes/param. Full fine-tuning applies this to every parameter (≈120 GiB for 8B). LoRA freezes the base in your chosen precision and trains a small adapter (~0.5% of parameters) at those rates. QLoRA freezes the base at 4-bit and trains the adapter on top, which is how an 8B fine-tune drops to single-digit GiB. (EleutherAI; Dettmers et al., QLoRA.)

4. Activations & overhead

For training, activation memory with gradient checkpointing is estimated as 2 × batch × tokens × hidden size × layers, plus a fixed ~1 GiB CUDA-context floor. For inference the tool follows Hugging Face's rule of thumb and adds the larger of 1 GiB or 20% of the weight size. Because real frameworks vary, the headline shows a ±15% planning range — size for the top of it.

5. Fit check & cross-check

The total is compared against the chosen GPU's VRAM; if it overflows, the tool reports how many cards you need. As an independent check, the same inference estimate is run against Hugging Face's ≈ 1.2× model sizeheuristic and the two are shown side by side — the same idea as cross-checking a tax figure against the regulator's own formula.

Worked examples

Run Llama 3 8B at 4-bit on a 12 GB card

  1. Weights: 8.03B × 0.5 bytes = 4.015 GB = 3.74 GiB
  2. KV cache: 2 × 1 × 4096 × 32 layers × 1024 (8 KV heads × 128) × 2 = 536,870,912 B = 0.50 GiB
  3. Activations & overhead: max(1 GiB, 0.2 × 3.74) = 1.00 GiB
  4. Total ≈ 5.24 GiB → fits an RTX 3060 12 GB with ~6.8 GiB to spare ✅

Full fine-tune vs. QLoRA — Llama 3 8B, AdamW, 2048 tokens

  1. Full: weights 2×8.03B + grads 2×8.03B + optimizer 12×8.03B = 119.7 GiB
  2. Full: + activations 0.50 GiB + 1 GiB CUDA = 121.2 GiB → ceil(121.2 / 80) = 2× A100 80 GB ❌ on one card
  3. QLoRA: 4-bit base 3.74 GiB + adapter weights/grads/optimizer ≈ 0.60 GiB
  4. QLoRA: + activations 0.50 GiB + 1 GiB CUDA = 5.84 GiB → fits a free Colab T4 16 GB ✅

Edge case — Llama 3 70B at FP16 with 32k context

  1. Weights: 70.55B × 2 bytes = 131.4 GiB
  2. KV cache: 2 × 1 × 32768 × 80 layers × 1024 × 2 = 10,737,418,240 B = 10.0 GiB
  3. Activations & overhead: max(1 GiB, 0.2 × 131.4) = 26.3 GiB
  4. Total ≈ 167.7 GiB → ceil(167.7 / 80) = 3× A100 80 GB

Frequently asked questions

Sources & references

Model parameter counts, attention shapes, and GPU VRAM capacities were last verified against the sources above on 2026-06-05. Estimates are guidance, not a guarantee — plan for the upper end of the ±15% range.

Related tools

Rate this tool
Be the first to rate

Comments & feedback

Spotted a bug or want an improvement? Tell us — our team reviews every comment, and good ideas get built. Comments are public and anonymous.

Found a bug, edge case, or want another model preset added?

Email me at [email protected] — most fixes ship within 24 hours.