induwara.lk
induwara.lkAI · Local LLMs

AI Model Size & Quantization Calculator

Find the exact file size of any LLM — FP32, FP16, 8-bit, 4-bit, or a GGUF quant like Q4_K_M — and how long it takes to download on your connection. Enter a parameter count, pick a precision, done. No signup, sources cited.

By Induwara AshinsanaUpdated Jun 12, 2026
Model size & download timedisk, not VRAM
Verified formula

Billion parameters. E.g. Llama 3 70B → 70, unit B.

Most popular local-LLM quant — best size/quality trade-off.

Popular models

25 Mbps is Sri Lanka's median fixed-broadband speed.

Extra files
File size (GB)
43.53GB
Q4_K_M (GGUF)
On disk (GiB)
40.54GiB
What your OS reports
Download time
3h 52m 10s
At the speed above
Bits per weight
4.83bpw
0.604 bytes/param

Every precision, side by side

PrecisionbpwSize (GB) FP16Download
FP32 (full precision)32288.4200%25h 38m 8s
FP16 / BF16 (half)16144.2100%12h 49m 4s
FP8 / INT8 (8-bit)872.150%6h 24m 32s
INT4 (4-bit)436.0525%3h 12m 16s
Q8_0 (GGUF)8.576.6153.13%6h 48m 34s
Q6_K (GGUF)6.5659.1241%5h 15m 19s
Q5_K_M (GGUF)5.6751.135.44%4h 32m 32s
Q4_K_M (GGUF)selected4.8343.5330.19%3h 52m 10s
Q4_0 (GGUF)4.5541.0128.44%3h 38m 42s
Q3_K_M (GGUF)3.9135.2424.44%3h 7m 56s
Q2_K (GGUF)2.9626.6818.5%2h 22m 17s

Size = parameters × bits-per-weight ÷ 8 (Hugging Face model-memory anatomy). Bits per weight from the PyTorch dtype table and llama.cpp GGUF k-quants. This is disk/download size — not GPU VRAM.

How it works

A model file is just its weights written to disk, so its size follows one identity from Hugging Face's model-memory anatomy: the number of parameters multiplied by the number of bytes used to store each one.

sizeBytes = parameters × (bitsPerWeight ÷ 8)

The only variable that changes between formats is bits per weight. Full precision (FP32) uses 32 bits — 4 bytes — per parameter. The released half-precision weights (FP16 or BF16) use 16 bits, or 2 bytes. Quantization trades a little quality for a much smaller file by storing each weight in fewer bits:

  • FP32 → 32 bits (4 bytes/param)
  • FP16 / BF16 → 16 bits (2 bytes/param)
  • FP8 / INT8 → 8 bits (1 byte/param)
  • INT4 → 4 bits (0.5 byte/param)
  • GGUF k-quants → Q8_0 ≈ 8.5, Q6_K ≈ 6.56, Q5_K_M ≈ 5.67, Q4_K_M ≈ 4.83, Q3_K_M ≈ 3.91, Q2_K ≈ 2.96 bits

The floating-point and integer figures come straight from the PyTorch torch.dtype element-size table. The GGUF k-quant numbers are the documented effective bits per weight from llama.cpp — they are not round numbers because each k-quant keeps small per-block scale and minimum values, and leaves a few sensitive tensors at higher precision, so the true average lands a bit above the nominal bit width.

Two presentation details matter. First, sizes appear in both GB (decimal, 10⁹ bytes — how vendors quote sizes) and GiB(binary, 2³⁰ bytes — what your operating system's disk meter shows); a 140 GB model reads as roughly 130 GiB in your file manager. Second, an optional ~3% overhead can be added for the tokenizer, config, and safetensors/GGUF metadata shipped alongside the weights — it is toggleable so the core weights-only math stays exact and easy to verify.

Download time uses the standard link-rate identity: seconds = file bits ÷ link bits-per-second = (sizeBytes × 8) ÷ (speedMbps × 10⁶). The default speed is 25 Mbps, Sri Lanka's median fixed-broadband speed from the Ookla Speedtest Global Index, because on a metered or modest connection the total gigabytes and the wait both decide whether a quant is worth pulling. Remember this is disk and download size only — running a model needs more GPU memory than the file size, which the LLM VRAM Calculator handles.

Worked examples

Llama 3 70B at FP16 vs Q4_K_M

  1. Parameters: 70 B = 70 × 10⁹
  2. FP16 (16 bits): 70e9 × 16 ÷ 8 = 140 × 10⁹ bytes = 140.0 GB (130.4 GiB)
  3. Q4_K_M (4.83 bits): 70e9 × 4.83 ÷ 8 = 42.26 × 10⁹ bytes = 42.26 GB
  4. Relative size: 4.83 ÷ 16 = 30.2% of FP16
  5. Download at 25 Mbps: 42.26e9 × 8 ÷ 25e6 = 13,524 s = 3h 45m 24s

Mistral 7B at Q8_0 (near-lossless)

  1. Parameters: 7 B = 7 × 10⁹
  2. FP16 baseline: 7e9 × 16 ÷ 8 = 14.0 GB (13.04 GiB)
  3. Q8_0 (8.5 bits): 7e9 × 8.5 ÷ 8 = 7.4375 × 10⁹ bytes = 7.44 GB
  4. Relative size: 8.5 ÷ 16 = 53.1% of FP16
  5. Download at 100 Mbps: 7.4375e9 × 8 ÷ 100e6 = 595 s = 9m 55s

Edge case — 2-bit on a tiny disk budget

  1. Goal: fit a 70B model under a 30 GB disk budget
  2. Q2_K (2.96 bits): 70e9 × 2.96 ÷ 8 = 25.9 × 10⁹ bytes = 25.9 GB — fits
  3. Q3_K_M (3.91 bits): 70e9 × 3.91 ÷ 8 = 34.2 GB — over budget
  4. With ~3% overhead, Q2_K becomes 25.9 × 1.03 = 26.7 GB — still fits
  5. Trade-off: Q2_K is the smallest usable quant and loses the most quality

Frequently asked questions

Sources & references

Related tools

Rate this tool
Be the first to rate

Comments & feedback

Spotted a bug or want an improvement? Tell us — our team reviews every comment, and good ideas get built. Comments are public and anonymous.

Found a bug, edge case, or want to suggest an improvement?

Email me at [email protected] — most fixes ship within 24 hours.