How much VRAM do I need to run a 7B or 8B model?

At 4-bit (Q4) an 8B model needs about 5–6 GiB for short contexts — it fits a 12 GB card or a free T4. At FP16 the same model needs roughly 18–20 GiB, so you want a 24 GB RTX 4090. The KV cache adds memory as your context length grows, which the calculator shows as its own row.

How much GPU memory does fine-tuning an LLM take?

Full fine-tuning with AdamW needs about 16 bytes per parameter — roughly 120 GiB for an 8B model (weights + gradients + fp32 master + optimizer states), so it spans 2× A100 80 GB. LoRA and QLoRA only train a tiny adapter, dropping the requirement by an order of magnitude.

What is QLoRA and how much VRAM does it save?

QLoRA freezes the base model at 4-bit (NF4) and trains small LoRA adapters on top. For an 8B model that is roughly 6 GiB total instead of ~120 GiB for full fine-tuning — which is why QLoRA fits a free Colab or Kaggle T4 (16 GB) while full fine-tuning does not.

Can I run an LLM on an RTX 3060 12 GB or a free Colab T4?

Yes for inference of 7B–8B models at INT4/Q4 (about 5–6 GiB), and for QLoRA fine-tuning of those models on a T4 (about 6–10 GiB). A 13B–14B model at INT4 also fits with care. FP16 inference of 8B (~18 GiB) and any full fine-tune will not fit these cards.

Why is the answer an estimate and not exact?

Real VRAM use depends on the framework, attention kernel (FlashAttention vs. eager), memory fragmentation, and CUDA context — typically ±10–20% around the formula. The calculator shows a planning range and a CUDA-overhead row so you size for the upper bound, not the floor.

Does this download or run the model?

No. Every number is pure arithmetic computed in your browser from published parameter counts and transformer memory formulas. Nothing is uploaded, downloaded, or executed — you get the answer before spending bandwidth on a model that might not fit.

Which models and GPUs are supported?

Presets cover Llama 3 8B/70B, Mistral 7B, Qwen2.5 7B/14B, Gemma 2 9B, and DeepSeek 7B, plus a Custom mode where you enter parameters and attention shape. GPUs include the RTX 3060 12 GB, RTX 4090 24 GB, T4 16 GB, and A100 40/80 GB. Closed models (GPT-4, Claude, Gemini) are excluded because their sizes are unpublished.

AI · GPU memory

LLM GPU Memory (VRAM) Calculator

Work out the GPU VRAM you need to run or fine-tune an open large language model — at any precision, context length, and batch size — and see whether it fits your RTX, T4, or A100 before you download a single weight. Runs entirely in your browser.

By Induwara Ashinsana— Executive Director, Ryzera TechnologiesUpdated Jun 5, 2026

GPU memory estimate

Formulas cited · EleutherAI + QLoRA

Popular setups

Model

Mode

Precision / quantization

Target GPU

Context length (tokens)

Batch size

Total estimated VRAM

5.24 GiB

Planning range 4.45 GiB – 6.03 GiB (±15%)

Fit check

Fits on RTX 4090 24 GB with 18.8 GiB headroom.

Memory breakdown

Component	Memory	Share
Model weights	3.74 GiB	71.37%
KV cache	0.50 GiB	9.54%
Activations & runtime overhead	1.00 GiB	19.09%
Total	5.24 GiB	100%

Cross-check: Hugging Face's rule of thumb (inference ≈ 1.2× weights, KV cache excluded) gives 4.49 GiB for the weights+overhead share — in line with the 4.74 GiB this tool computes for those two rows.

Same setup at FP16 / INT8 / INT4

FP16 · 2 B/param

18.4 GiB

Fits RTX 4090 24 GB

INT8 · 1 B/param

9.47 GiB

Fits RTX 4090 24 GB

INT4 · 0.5 B/param

5.24 GiB

Fits RTX 4090 24 GB

Quantizing from FP16 to INT4 roughly quarters the weight memory — the single biggest lever for fitting a bigger model on a smaller card.

Estimates use closed-form formulas from EleutherAI's Transformer Math 101 and the QLoRA paper, cross-checked against Hugging Face's memory estimator. Real usage varies ±10–20% with kernels, fragmentation, and framework overhead. Model specs verified from official model cards.

How it works

Every figure is computed in bytesfrom published transformer-memory formulas, then shown in GiB (1 GiB = 2³⁰ bytes). The calculator never downloads or runs a model — it reads each model's parameter count and attention shape from its official model card and applies the closed-form equations below.

1. Model weights

Weight memory is the parameter count P times bytes-per-weight: FP32 = 4, FP16/BF16 = 2, INT8 = 1, and INT4/Q4 = 0.5. So an 8B model is ~16 GiB at FP16 but only ~4 GiB at 4-bit. (EleutherAI, Transformer Math 101.)

2. KV cache (inference)

During generation the model caches a key and value for every token and every layer: 2 × batch × tokens × layers × (KV heads × head dim) × 2 bytes. Grouped-query attention (GQA) — used by Llama 3, Mistral, and Qwen — uses far fewer KV heads than query heads, which is why their cache stays small even at long context. DeepSeek 7B keeps full multi-head attention, so its cache is larger for the same size.

3. Training memory

Each trainable parameter under mixed-precision AdamW costs 16 bytes: a 2-byte FP16 weight, a 2-byte FP16 gradient, a 4-byte FP32 master copy, and 8 bytes of Adam moment and variance. 8-bit Adam cuts the optimizer states to give 10 bytes/param; plain SGD drops the moments for 8 bytes/param. Full fine-tuning applies this to every parameter (≈120 GiB for 8B). LoRA freezes the base in your chosen precision and trains a small adapter (~0.5% of parameters) at those rates. QLoRA freezes the base at 4-bit and trains the adapter on top, which is how an 8B fine-tune drops to single-digit GiB. (EleutherAI; Dettmers et al., QLoRA.)

4. Activations & overhead

For training, activation memory with gradient checkpointing is estimated as 2 × batch × tokens × hidden size × layers, plus a fixed ~1 GiB CUDA-context floor. For inference the tool follows Hugging Face's rule of thumb and adds the larger of 1 GiB or 20% of the weight size. Because real frameworks vary, the headline shows a ±15% planning range — size for the top of it.

5. Fit check & cross-check

The total is compared against the chosen GPU's VRAM; if it overflows, the tool reports how many cards you need. As an independent check, the same inference estimate is run against Hugging Face's ≈ 1.2× model sizeheuristic and the two are shown side by side — the same idea as cross-checking a tax figure against the regulator's own formula.

Worked examples

Run Llama 3 8B at 4-bit on a 12 GB card

Weights: 8.03B × 0.5 bytes = 4.015 GB = 3.74 GiB
KV cache: 2 × 1 × 4096 × 32 layers × 1024 (8 KV heads × 128) × 2 = 536,870,912 B = 0.50 GiB
Activations & overhead: max(1 GiB, 0.2 × 3.74) = 1.00 GiB
Total ≈ 5.24 GiB → fits an RTX 3060 12 GB with ~6.8 GiB to spare ✅

Full fine-tune vs. QLoRA — Llama 3 8B, AdamW, 2048 tokens

Full: weights 2×8.03B + grads 2×8.03B + optimizer 12×8.03B = 119.7 GiB
Full: + activations 0.50 GiB + 1 GiB CUDA = 121.2 GiB → ceil(121.2 / 80) = 2× A100 80 GB ❌ on one card
QLoRA: 4-bit base 3.74 GiB + adapter weights/grads/optimizer ≈ 0.60 GiB
QLoRA: + activations 0.50 GiB + 1 GiB CUDA = 5.84 GiB → fits a free Colab T4 16 GB ✅

Edge case — Llama 3 70B at FP16 with 32k context

Weights: 70.55B × 2 bytes = 131.4 GiB
KV cache: 2 × 1 × 32768 × 80 layers × 1024 × 2 = 10,737,418,240 B = 10.0 GiB
Activations & overhead: max(1 GiB, 0.2 × 131.4) = 26.3 GiB
Total ≈ 167.7 GiB → ceil(167.7 / 80) = 3× A100 80 GB

Frequently asked questions

Sources & references

Model parameter counts, attention shapes, and GPU VRAM capacities were last verified against the sources above on 2026-06-05. Estimates are guidance, not a guarantee — plan for the upper end of the ±15% range.

Related tools

LiveAI

AI Inference Speed Calculator

Estimate how fast an LLM will run on a given GPU — decode tokens/second, prefill throughput, time-to-first-token, and total generation time — from model size, quantization, and published memory bandwidth. Formulas and GPU specs cited.

Open tool

LiveAI

AI Agent Cost Calculator

Estimate the real per-run, daily, and monthly cost of a multi-step LLM agent across Claude, GPT, and Gemini. Models the context accumulation single-call calculators miss — each tool result is re-sent every step — with caching, in USD and LKR.

Open tool

LiveAI

LLM KV Cache Calculator

Calculate the GPU KV-cache memory an LLM uses per token, per sequence, and in total from its architecture, context length, batch size and precision — plus how many concurrent requests fit a given GPU. Formula cited, runs in your browser.

Open tool

Rate this tool

Be the first to rate

Comments & feedback

Spotted a bug or want an improvement? Tell us — our team reviews every comment, and good ideas get built. Comments are public and anonymous.

Found a bug, edge case, or want another model preset added?

Email me at [email protected] — most fixes ship within 24 hours.

Memory breakdown

Same setup at FP16 / INT8 / INT4

How it works

Worked examples

Frequently asked questions

How much VRAM do I need to run a 7B or 8B model?

How much GPU memory does fine-tuning an LLM take?

What is QLoRA and how much VRAM does it save?

How much VRAM does the KV cache use at long context?

Can I run an LLM on an RTX 3060 12 GB or a free Colab T4?

Why is the answer an estimate and not exact?

Does this download or run the model?

Which models and GPUs are supported?

Sources & references

Related tools

AI Inference Speed Calculator

AI Agent Cost Calculator

LLM KV Cache Calculator

Comments & feedback