LLM Inference Speed (Tokens per Second) Calculator
Estimate how fast a large language model will generate text on a given GPU — decode tokens/second, time-to-first-token, and total generation time — from the model's size, quantization, and the GPU's published memory bandwidth. No signup, formulas and specs cited below.
How it works
Generating text with a transformer is two separate jobs with two separate bottlenecks, and this calculator models both. The headline number — decode throughput — comes from a simple physical fact: to produce each new token, the GPU must read every active model weight out of memory exactly once. That memory read, not arithmetic, sets the pace. So single-stream decode is memory-bandwidth bound.
- Weight bytes per token. Multiply the active parameter count by the bytes per parameter for your quantization — FP16 = 2, FP8/INT8 = 1, INT4 = 0.5. A dense 70B at INT4 reads 70e9 × 0.5 = 35 GB per token. Mixture-of-Experts models only read their routed (active) experts, so the tool uses the active count, not the total.
- Decode throughput. Peak tokens/second = aggregate memory bandwidth ÷ weight bytes per token. Realistic throughput multiplies that by an efficiency factor (MFU), because no engine sustains 100% of rated bandwidth — measured utilisation is typically 60–85%, so 70% is the default.
- Prefill throughput. Processing your prompt is compute bound, not memory bound, because all prompt tokens run through the network in parallel. A forward pass costs about 2 FLOPs per parameter per token, so prefill tokens/second = aggregate FP16 compute × 45% ÷ (2 × parameters).
- Time to first token and total time. TTFT = prompt tokens ÷ prefill throughput. Total generation time = TTFT + output tokens ÷ decode throughput.
Multiple GPUs (tensor parallel) add bandwidth but spend some of it on cross-GPU communication, so the tool discounts each extra card by 8% (a 0.92 factor per added GPU). The accounting follows the standard transformer inference arithmetic — a forward pass at roughly 2 × params FLOPs per token (Kaplan et al. 2020), and decode reading the full active weight set once per token (kipp.ly). Every estimate is deterministic given its inputs.
This is a single-stream, first-order model. It does not simulate continuous batching, speculative decoding, FlashAttention kernel tuning, or KV-cache spill — those depend on your exact serving stack. It also does not check whether the model fits in VRAM; for that, use the LLM VRAM calculator. The numbers here match real-world single-stream throughput within the usual variance, which is why the cross-check anchors below sit inside measured ranges.
Worked examples
Frequently asked questions
Sources & references
- Carol Chen — Transformer Inference Arithmetic (decode is memory-bandwidth bound)
- Kaplan et al. 2020 — Scaling Laws for Neural Language Models (≈2 FLOPs/param/token)
- Databricks — LLM Inference Performance Engineering (memory-bandwidth utilisation)
- NVIDIA H100 datasheet — memory bandwidth and FP16 TFLOPS
- NVIDIA GeForce RTX 4090 — memory bandwidth specification
GPU specifications and model parameter counts were last cross-checked against the sources above on 2026-06-14. This tool gives a first-principles estimate, not a measured benchmark; real throughput varies with your inference engine, context length, and batch settings.
Related tools
Comments & feedback
Spotted a bug or want an improvement? Tell us — our team reviews every comment, and good ideas get built. Comments are public and anonymous.
Found a bug, edge case, or want to suggest an improvement?
Email me at [email protected] — most fixes ship within 24 hours.