How do you calculate the size of an LLM in GB?

Multiply the parameter count by the bytes used per parameter, then divide by 10⁹ for gigabytes. Bytes per parameter = bits-per-weight ÷ 8. So a 7B model at FP16 (16 bits = 2 bytes) is 7,000,000,000 × 2 = 14,000,000,000 bytes = 14 GB. Quantization lowers the bits per weight and shrinks the file.

How much smaller is a 4-bit quantized model than FP16?

FP16 stores 16 bits per weight; a 4-bit quant stores about 4–4.83. Q4_K_M (4.83 bits) is 4.83 ÷ 16 ≈ 30% of the FP16 size — roughly a 3.3× reduction. Plain INT4 (4 bits) is exactly 25% of FP16, a 4× reduction. The comparison table on this page shows the exact percentage for any model.

What is the file size of Llama 3 70B at Q4_K_M?

About 42.3 GB. Q4_K_M averages 4.83 bits per weight, so 70,000,000,000 × 4.83 ÷ 8 = 42.26 × 10⁹ bytes ≈ 42.26 GB (about 39.4 GiB on disk). That matches the ~42–43 GB Q4_K_M GGUF files distributed for 70B models. Adding the ~3% tokenizer/metadata overhead brings it to roughly 43.5 GB.

How many bytes per parameter does FP16 use?

Two bytes. FP16 and BF16 are 16-bit formats, and 16 bits ÷ 8 = 2 bytes per parameter (the PyTorch torch.float16 element size). FP32 uses 4 bytes, FP8 and INT8 use 1 byte, and INT4 uses half a byte. GGUF k-quants fall between these because they add small per-block scale and minimum values.

How long does it take to download a 40 GB model at 25 Mbps?

About 3 hours 33 minutes. Download seconds = file bytes × 8 ÷ (speed in Mbps × 10⁶). For 40 GB: 40 × 10⁹ × 8 ÷ (25 × 10⁶) = 12,800 seconds ≈ 3h 33m. At 25 Mbps — Sri Lanka's median fixed broadband — large 70B quants take several hours, so the smaller quants are worth considering on a metered line.

Is this the same as the VRAM I need to run the model?

No. This tool gives disk and download size — the static file you store. Running the model also needs the weights in GPU memory plus a KV cache, activations, and CUDA overhead, which is larger and grows with context length. For that, use the LLM VRAM Calculator. Disk size is the floor; runtime VRAM is always higher.

What is the difference between GB and GiB here?

GB is decimal (10⁹ bytes) — how Hugging Face and model cards usually quote sizes. GiB is binary (2³⁰ bytes) — what your operating system's disk meter reports. A 140 GB FP16 model shows as about 130 GiB in your file manager. This tool shows both so the number always matches whatever you are comparing against.

Why doesn't a 4-bit model come out to exactly 4 bits per weight?

GGUF k-quants (Q4_K_M, Q5_K_M, Q6_K, etc.) don't store every weight at the same width. They group weights into blocks and keep a small scale and minimum per block, plus they leave some sensitive tensors at higher precision. Averaged across the whole model that works out to about 4.83 bits for Q4_K_M, which is why the file is slightly larger than a flat 4-bit INT4.

AI · Local LLMs

AI Model Size & Quantization Calculator

Find the exact file size of any LLM — FP32, FP16, 8-bit, 4-bit, or a GGUF quant like Q4_K_M — and how long it takes to download on your connection. Enter a parameter count, pick a precision, done. No signup, sources cited.

By Induwara Ashinsana— Executive Director, Ryzera TechnologiesUpdated Jun 12, 2026

Model size & download timedisk, not VRAM

Verified formula

Model size (parameters)

Billion parameters. E.g. Llama 3 70B → 70, unit B.

Precision / quantization

Most popular local-LLM quant — best size/quality trade-off.

Popular models

Download speed (Mbps)

25 Mbps is Sri Lanka's median fixed-broadband speed.

Extra files

File size (GB)

43.53GB

Q4_K_M (GGUF)

On disk (GiB)

40.54GiB

What your OS reports

Download time

3h 52m 10s

At the speed above

Bits per weight

4.83bpw

0.604 bytes/param

Every precision, side by side

Precision	bpw	Size (GB)	FP16	Download
FP32 (full precision)	32	288.4	200%	25h 38m 8s
FP16 / BF16 (half)	16	144.2	100%	12h 49m 4s
FP8 / INT8 (8-bit)	8	72.1	50%	6h 24m 32s
INT4 (4-bit)	4	36.05	25%	3h 12m 16s
Q8_0 (GGUF)	8.5	76.61	53.13%	6h 48m 34s
Q6_K (GGUF)	6.56	59.12	41%	5h 15m 19s
Q5_K_M (GGUF)	5.67	51.1	35.44%	4h 32m 32s
Q4_K_M (GGUF)selected	4.83	43.53	30.19%	3h 52m 10s
Q4_0 (GGUF)	4.55	41.01	28.44%	3h 38m 42s
Q3_K_M (GGUF)	3.91	35.24	24.44%	3h 7m 56s
Q2_K (GGUF)	2.96	26.68	18.5%	2h 22m 17s

Size = parameters × bits-per-weight ÷ 8 (Hugging Face model-memory anatomy). Bits per weight from the PyTorch dtype table and llama.cpp GGUF k-quants. This is disk/download size — not GPU VRAM.

How it works

A model file is just its weights written to disk, so its size follows one identity from Hugging Face's model-memory anatomy: the number of parameters multiplied by the number of bytes used to store each one.

sizeBytes = parameters × (bitsPerWeight ÷ 8)

The only variable that changes between formats is bits per weight. Full precision (FP32) uses 32 bits — 4 bytes — per parameter. The released half-precision weights (FP16 or BF16) use 16 bits, or 2 bytes. Quantization trades a little quality for a much smaller file by storing each weight in fewer bits:

FP32 → 32 bits (4 bytes/param)
FP16 / BF16 → 16 bits (2 bytes/param)
FP8 / INT8 → 8 bits (1 byte/param)
INT4 → 4 bits (0.5 byte/param)
GGUF k-quants → Q8_0 ≈ 8.5, Q6_K ≈ 6.56, Q5_K_M ≈ 5.67, Q4_K_M ≈ 4.83, Q3_K_M ≈ 3.91, Q2_K ≈ 2.96 bits

The floating-point and integer figures come straight from the PyTorch torch.dtype element-size table. The GGUF k-quant numbers are the documented effective bits per weight from llama.cpp — they are not round numbers because each k-quant keeps small per-block scale and minimum values, and leaves a few sensitive tensors at higher precision, so the true average lands a bit above the nominal bit width.

Two presentation details matter. First, sizes appear in both GB (decimal, 10⁹ bytes — how vendors quote sizes) and GiB(binary, 2³⁰ bytes — what your operating system's disk meter shows); a 140 GB model reads as roughly 130 GiB in your file manager. Second, an optional ~3% overhead can be added for the tokenizer, config, and safetensors/GGUF metadata shipped alongside the weights — it is toggleable so the core weights-only math stays exact and easy to verify.

Download time uses the standard link-rate identity: seconds = file bits ÷ link bits-per-second = (sizeBytes × 8) ÷ (speedMbps × 10⁶). The default speed is 25 Mbps, Sri Lanka's median fixed-broadband speed from the Ookla Speedtest Global Index, because on a metered or modest connection the total gigabytes and the wait both decide whether a quant is worth pulling. Remember this is disk and download size only — running a model needs more GPU memory than the file size, which the LLM VRAM Calculator handles.

Worked examples

Llama 3 70B at FP16 vs Q4_K_M

Parameters: 70 B = 70 × 10⁹
FP16 (16 bits): 70e9 × 16 ÷ 8 = 140 × 10⁹ bytes = 140.0 GB (130.4 GiB)
Q4_K_M (4.83 bits): 70e9 × 4.83 ÷ 8 = 42.26 × 10⁹ bytes = 42.26 GB
Relative size: 4.83 ÷ 16 = 30.2% of FP16
Download at 25 Mbps: 42.26e9 × 8 ÷ 25e6 = 13,524 s = 3h 45m 24s

Mistral 7B at Q8_0 (near-lossless)

Parameters: 7 B = 7 × 10⁹
FP16 baseline: 7e9 × 16 ÷ 8 = 14.0 GB (13.04 GiB)
Q8_0 (8.5 bits): 7e9 × 8.5 ÷ 8 = 7.4375 × 10⁹ bytes = 7.44 GB
Relative size: 8.5 ÷ 16 = 53.1% of FP16
Download at 100 Mbps: 7.4375e9 × 8 ÷ 100e6 = 595 s = 9m 55s

Edge case — 2-bit on a tiny disk budget

Goal: fit a 70B model under a 30 GB disk budget
Q2_K (2.96 bits): 70e9 × 2.96 ÷ 8 = 25.9 × 10⁹ bytes = 25.9 GB — fits
Q3_K_M (3.91 bits): 70e9 × 3.91 ÷ 8 = 34.2 GB — over budget
With ~3% overhead, Q2_K becomes 25.9 × 1.03 = 26.7 GB — still fits
Trade-off: Q2_K is the smallest usable quant and loses the most quality

Frequently asked questions

Sources & references

The bits-per-weight constants used here were last cross-checked against the PyTorch dtype table and the llama.cpp quantize README on 2026-06-12. Sizes are computed in bytes and shown in both decimal GB and binary GiB.

Related tools

LiveAI

AI Model Download Time

Estimate how long it takes to download a local AI model (LLM GGUF quant or full-precision weights) at your internet speed, and how much data it uses. Popular model presets with exact Hugging Face file sizes, Sri Lanka broadband presets, and a speed-comparison table. Free, no signup.

Open tool

LiveAI

GGUF Quant Size Calculator

Estimate the on-disk size of any LLM in GGUF format from its parameter count and quant type (Q2_K to Q8_0, the IQ-quants, or F16). Compare bits-per-weight and file size side by side to pick the quant that fits your RAM or VRAM before downloading. Formulas and measured llama.cpp sizes cited.

Open tool

LiveAI

Mac LLM RAM Calculator

Check whether an open LLM (Llama, Mistral, Qwen, Gemma, DeepSeek, Phi) runs on your Apple Silicon Mac. Models the macOS unified-memory GPU cap, KV cache, and quant size to give a verdict, the biggest model that fits, and an estimated tokens/sec. Free, no signup.

Open tool

Rate this tool

Be the first to rate

Comments & feedback

Spotted a bug or want an improvement? Tell us — our team reviews every comment, and good ideas get built. Comments are public and anonymous.

Found a bug, edge case, or want to suggest an improvement?

Email me at [email protected] — most fixes ship within 24 hours.