AI Model Size & Quantization Calculator
Find the exact file size of any LLM — FP32, FP16, 8-bit, 4-bit, or a GGUF quant like Q4_K_M — and how long it takes to download on your connection. Enter a parameter count, pick a precision, done. No signup, sources cited.
How it works
A model file is just its weights written to disk, so its size follows one identity from Hugging Face's model-memory anatomy: the number of parameters multiplied by the number of bytes used to store each one.
sizeBytes = parameters × (bitsPerWeight ÷ 8)
The only variable that changes between formats is bits per weight. Full precision (FP32) uses 32 bits — 4 bytes — per parameter. The released half-precision weights (FP16 or BF16) use 16 bits, or 2 bytes. Quantization trades a little quality for a much smaller file by storing each weight in fewer bits:
- FP32 → 32 bits (4 bytes/param)
- FP16 / BF16 → 16 bits (2 bytes/param)
- FP8 / INT8 → 8 bits (1 byte/param)
- INT4 → 4 bits (0.5 byte/param)
- GGUF k-quants → Q8_0 ≈ 8.5, Q6_K ≈ 6.56, Q5_K_M ≈ 5.67, Q4_K_M ≈ 4.83, Q3_K_M ≈ 3.91, Q2_K ≈ 2.96 bits
The floating-point and integer figures come straight from the PyTorch torch.dtype element-size table. The GGUF k-quant numbers are the documented effective bits per weight from llama.cpp — they are not round numbers because each k-quant keeps small per-block scale and minimum values, and leaves a few sensitive tensors at higher precision, so the true average lands a bit above the nominal bit width.
Two presentation details matter. First, sizes appear in both GB (decimal, 10⁹ bytes — how vendors quote sizes) and GiB(binary, 2³⁰ bytes — what your operating system's disk meter shows); a 140 GB model reads as roughly 130 GiB in your file manager. Second, an optional ~3% overhead can be added for the tokenizer, config, and safetensors/GGUF metadata shipped alongside the weights — it is toggleable so the core weights-only math stays exact and easy to verify.
Download time uses the standard link-rate identity: seconds = file bits ÷ link bits-per-second = (sizeBytes × 8) ÷ (speedMbps × 10⁶). The default speed is 25 Mbps, Sri Lanka's median fixed-broadband speed from the Ookla Speedtest Global Index, because on a metered or modest connection the total gigabytes and the wait both decide whether a quant is worth pulling. Remember this is disk and download size only — running a model needs more GPU memory than the file size, which the LLM VRAM Calculator handles.
Worked examples
Frequently asked questions
Sources & references
- PyTorch — torch.dtype element sizes (float32/16, bfloat16, float8, int8)
- Hugging Face — Model Memory Anatomy (size = parameters × bytes per parameter)
- llama.cpp — quantize README (GGUF k-quant types and effective bits per weight)
- llama.cpp PR #1684 — k-quants introduction and bits-per-weight notes
- Ookla Speedtest Global Index — Sri Lanka median fixed-broadband speed
The bits-per-weight constants used here were last cross-checked against the PyTorch dtype table and the llama.cpp quantize README on 2026-06-12. Sizes are computed in bytes and shown in both decimal GB and binary GiB.
Related tools
Comments & feedback
Spotted a bug or want an improvement? Tell us — our team reviews every comment, and good ideas get built. Comments are public and anonymous.
Found a bug, edge case, or want to suggest an improvement?
Email me at [email protected] — most fixes ship within 24 hours.