induwara.lk
induwara.lkAI · Hardware

LLM Inference Speed (Tokens per Second) Calculator

Estimate how fast a large language model will generate text on a given GPU — decode tokens/second, time-to-first-token, and total generation time — from the model's size, quantization, and the GPU's published memory bandwidth. No signup, formulas and specs cited below.

By Induwara AshinsanaUpdated Jun 14, 2026
LLM inference speedtokens / second
Bandwidth-bound model

70 B parameters (dense).

4-bit weights (GPTQ/AWQ/GGUF Q4) — ~4× faster decode, small quality cost.

3,350 GB/s · 989 TFLOPS · 80 GB

Each added GPU adds bandwidth but loses ~8% to tensor-parallel comms.

%

Share of peak bandwidth realised in practice (10–95). 70% is a good default.

Input tokens processed during prefill.

Tokens the model generates.

Try a scenario
Decode throughput
67tok/s
Peak 95.7 tok/s × MFU
Prefill throughput
3,179tok/s
Time to first token
315 ms
Total generation
7.78 s
Interactive(> 30 tok/s)Memory-bandwidth bound

Assumptions used

Active weights read / token35 GB
Effective (active) parameters70 B
Aggregate memory bandwidth3,350 GB/s
Aggregate FP16 compute989 TFLOPS
Tensor-parallel efficiency100%
Single-stream, first-order estimate — no batching, speculative decoding, or KV-cache spill. GPU bandwidth and compute are from NVIDIA datasheets; model sizes from official model cards. Specs last verified 2026-06-14. Full sources are listed below the calculator.

How it works

Generating text with a transformer is two separate jobs with two separate bottlenecks, and this calculator models both. The headline number — decode throughput — comes from a simple physical fact: to produce each new token, the GPU must read every active model weight out of memory exactly once. That memory read, not arithmetic, sets the pace. So single-stream decode is memory-bandwidth bound.

  1. Weight bytes per token. Multiply the active parameter count by the bytes per parameter for your quantization — FP16 = 2, FP8/INT8 = 1, INT4 = 0.5. A dense 70B at INT4 reads 70e9 × 0.5 = 35 GB per token. Mixture-of-Experts models only read their routed (active) experts, so the tool uses the active count, not the total.
  2. Decode throughput. Peak tokens/second = aggregate memory bandwidth ÷ weight bytes per token. Realistic throughput multiplies that by an efficiency factor (MFU), because no engine sustains 100% of rated bandwidth — measured utilisation is typically 60–85%, so 70% is the default.
  3. Prefill throughput. Processing your prompt is compute bound, not memory bound, because all prompt tokens run through the network in parallel. A forward pass costs about 2 FLOPs per parameter per token, so prefill tokens/second = aggregate FP16 compute × 45% ÷ (2 × parameters).
  4. Time to first token and total time. TTFT = prompt tokens ÷ prefill throughput. Total generation time = TTFT + output tokens ÷ decode throughput.

Multiple GPUs (tensor parallel) add bandwidth but spend some of it on cross-GPU communication, so the tool discounts each extra card by 8% (a 0.92 factor per added GPU). The accounting follows the standard transformer inference arithmetic — a forward pass at roughly 2 × params FLOPs per token (Kaplan et al. 2020), and decode reading the full active weight set once per token (kipp.ly). Every estimate is deterministic given its inputs.

This is a single-stream, first-order model. It does not simulate continuous batching, speculative decoding, FlashAttention kernel tuning, or KV-cache spill — those depend on your exact serving stack. It also does not check whether the model fits in VRAM; for that, use the LLM VRAM calculator. The numbers here match real-world single-stream throughput within the usual variance, which is why the cross-check anchors below sit inside measured ranges.

Worked examples

Llama 3 70B · INT4 · 1× H100 SXM · MFU 70%

  1. Weight bytes/token: 70e9 × 0.5 = 35e9 B (35 GB)
  2. H100 SXM bandwidth: 3,350 GB/s = 3.35e12 B/s
  3. Peak decode: 3.35e12 ÷ 35e9 = 95.7 tok/s
  4. Realistic decode: 0.70 × 95.7 = 67.0 tok/s
  5. Prefill: 989e12 × 0.45 ÷ (2 × 70e9) = 3,179 tok/s
  6. TTFT (1,000 prompt): 1,000 ÷ 3,179 = 315 ms
  7. Total (500 output): 0.315 + 500 ÷ 67.0 = 7.78 s → Interactive

Mistral 7B · FP16 · 1× RTX 4090 · MFU 70%

  1. Weight bytes/token: 7e9 × 2 = 14e9 B (14 GB)
  2. RTX 4090 bandwidth: 1,008 GB/s
  3. Peak decode: 1,008e9 ÷ 14e9 = 72.0 tok/s
  4. Realistic decode: 0.70 × 72.0 = 50.4 tok/s
  5. Verdict: Interactive — comfortable single-user chat on one 4090

Qwen2.5 72B · FP16 · 2× A100 80GB · MFU 70% (tensor parallel)

  1. Weight bytes/token: 72e9 × 2 = 144e9 B (144 GB)
  2. Tensor-parallel efficiency: 0.92^(2−1) = 0.92
  3. Aggregate bandwidth: 2,039 × 2 × 0.92 = 3,752 GB/s
  4. Realistic decode: 0.70 × 3,752e9 ÷ 144e9 = 18.2 tok/s
  5. Verdict: Usable — FP16 72B is sluggish; INT4 would roughly quadruple it

Frequently asked questions

Sources & references

GPU specifications and model parameter counts were last cross-checked against the sources above on 2026-06-14. This tool gives a first-principles estimate, not a measured benchmark; real throughput varies with your inference engine, context length, and batch settings.

Related tools

Rate this tool
Be the first to rate

Comments & feedback

Spotted a bug or want an improvement? Tell us — our team reviews every comment, and good ideas get built. Comments are public and anonymous.

Found a bug, edge case, or want to suggest an improvement?

Email me at [email protected] — most fixes ship within 24 hours.