LLM GPU Memory (VRAM) Calculator
Work out the GPU VRAM you need to run or fine-tune an open large language model — at any precision, context length, and batch size — and see whether it fits your RTX, T4, or A100 before you download a single weight. Runs entirely in your browser.
How it works
Every figure is computed in bytesfrom published transformer-memory formulas, then shown in GiB (1 GiB = 2³⁰ bytes). The calculator never downloads or runs a model — it reads each model's parameter count and attention shape from its official model card and applies the closed-form equations below.
1. Model weights
Weight memory is the parameter count P times bytes-per-weight: FP32 = 4, FP16/BF16 = 2, INT8 = 1, and INT4/Q4 = 0.5. So an 8B model is ~16 GiB at FP16 but only ~4 GiB at 4-bit. (EleutherAI, Transformer Math 101.)
2. KV cache (inference)
During generation the model caches a key and value for every token and every layer: 2 × batch × tokens × layers × (KV heads × head dim) × 2 bytes. Grouped-query attention (GQA) — used by Llama 3, Mistral, and Qwen — uses far fewer KV heads than query heads, which is why their cache stays small even at long context. DeepSeek 7B keeps full multi-head attention, so its cache is larger for the same size.
3. Training memory
Each trainable parameter under mixed-precision AdamW costs 16 bytes: a 2-byte FP16 weight, a 2-byte FP16 gradient, a 4-byte FP32 master copy, and 8 bytes of Adam moment and variance. 8-bit Adam cuts the optimizer states to give 10 bytes/param; plain SGD drops the moments for 8 bytes/param. Full fine-tuning applies this to every parameter (≈120 GiB for 8B). LoRA freezes the base in your chosen precision and trains a small adapter (~0.5% of parameters) at those rates. QLoRA freezes the base at 4-bit and trains the adapter on top, which is how an 8B fine-tune drops to single-digit GiB. (EleutherAI; Dettmers et al., QLoRA.)
4. Activations & overhead
For training, activation memory with gradient checkpointing is estimated as 2 × batch × tokens × hidden size × layers, plus a fixed ~1 GiB CUDA-context floor. For inference the tool follows Hugging Face's rule of thumb and adds the larger of 1 GiB or 20% of the weight size. Because real frameworks vary, the headline shows a ±15% planning range — size for the top of it.
5. Fit check & cross-check
The total is compared against the chosen GPU's VRAM; if it overflows, the tool reports how many cards you need. As an independent check, the same inference estimate is run against Hugging Face's ≈ 1.2× model sizeheuristic and the two are shown side by side — the same idea as cross-checking a tax figure against the regulator's own formula.
Worked examples
Frequently asked questions
Sources & references
- EleutherAI — Transformer Math 101 (inference, KV cache, and training memory formulas)
- Dettmers et al. — QLoRA: Efficient Finetuning of Quantized LLMs
- Hugging Face — Accelerate Model Memory Estimator (cross-check multipliers)
- Meta — Llama 3 8B model card (parameters, layers, GQA config)
Model parameter counts, attention shapes, and GPU VRAM capacities were last verified against the sources above on 2026-06-05. Estimates are guidance, not a guarantee — plan for the upper end of the ±15% range.
Related tools
Comments & feedback
Spotted a bug or want an improvement? Tell us — our team reviews every comment, and good ideas get built. Comments are public and anonymous.
Found a bug, edge case, or want another model preset added?
Email me at [email protected] — most fixes ship within 24 hours.