How many tokens per second can an RTX 4090 generate?

It depends on the model and quantization. The RTX 4090 has 1,008 GB/s of memory bandwidth, so a 7B model at FP16 (14 GB of weights) gives roughly 1008 ÷ 14 = 72 tok/s peak, or about 50 tok/s at 70% efficiency. A quantized 8B at INT8 runs faster; a 70B won't fit in 24 GB without offload. Pick the exact model and precision above for your number.

Is LLM inference memory-bandwidth bound or compute bound?

Single-stream decode (generating one token at a time) is memory-bandwidth bound: the GPU must read every active weight from memory once per generated token, and that read dominates the time. Prompt processing (prefill) is compute bound because it runs all prompt tokens through the network in parallel. This tool models both and labels which limit applies.

How do I estimate tokens per second for a 70B model?

Multiply the parameters by the bytes per parameter for your quantization (FP16 = 2, INT8 = 1, INT4 = 0.5) to get the weight bytes read per token. Divide GPU memory bandwidth by that figure for the theoretical peak, then multiply by a realistic efficiency (MFU), about 70%. A 70B at INT4 is 35 GB; on an H100 (3,350 GB/s) that's roughly 67 tok/s.

Does quantization (INT4/FP8) make inference faster?

Yes. Because decode speed is set by how many bytes of weights are read per token, halving the bytes per parameter roughly doubles tokens per second. Going from FP16 (2 bytes) to INT4 (0.5 bytes) cuts the weight bytes to a quarter, so decode is about 4× faster — at a small quality cost. The trade-off is why INT4 is the default for large models on single GPUs.

What is time to first token and how is it calculated?

Time to first token (TTFT) is the delay before the model starts replying — the time to process your prompt. It is compute bound: a forward pass costs about 2 FLOPs per parameter per token, so TTFT ≈ prompt tokens ÷ prefill throughput, where prefill throughput = GPU FP16 compute × ~45% ÷ (2 × parameters). Longer prompts and bigger models raise TTFT.

Why is the real speed lower than the theoretical peak?

Real inference stacks never hit 100% of a GPU's rated memory bandwidth. Kernel launch overhead, attention reads of the KV cache, and synchronisation all cost time. Measured memory-bandwidth utilisation (MFU/MBU) is typically 60–85%, so this tool defaults to 70%. You can raise or lower it to match your engine (vLLM, TensorRT-LLM, llama.cpp) and batch settings.

Does using two GPUs double the speed?

Not quite. Splitting a model across GPUs (tensor parallel) adds memory bandwidth but spends some of it on cross-GPU communication. This tool discounts each added GPU by about 8% (a 0.92 factor per extra card), so two GPUs give roughly 1.84× the bandwidth, not 2×. Multi-GPU mainly helps when a model is too large to fit or run quickly on one card.

Does this account for batching or multiple users?

No — this is a single-stream, first-order estimate for one request at a time. Continuous batching raises a server's aggregate tokens/second but not the latency any single user feels, so mixing the two misleads. For batched serving cost and throughput, use the self-hosting cost calculator linked under related tools.

When were the GPU specs and model sizes last verified?

The GPU memory-bandwidth and FP16 compute figures (from NVIDIA datasheets) and the model parameter counts (from official model cards) were last cross-checked on 2026-06-14. Datasheet specs rarely change, but the page is reviewed when new GPUs or model families ship.

AI · Hardware

LLM Inference Speed (Tokens per Second) Calculator

Estimate how fast a large language model will generate text on a given GPU — decode tokens/second, time-to-first-token, and total generation time — from the model's size, quantization, and the GPU's published memory bandwidth. No signup, formulas and specs cited below.

By Induwara Ashinsana— Executive Director, Ryzera TechnologiesUpdated Jun 14, 2026

LLM inference speedtokens / second

Bandwidth-bound model

Model

70 B parameters (dense).

Quantization

4-bit weights (GPTQ/AWQ/GGUF Q4) — ~4× faster decode, small quality cost.

GPU

3,350 GB/s · 989 TFLOPS · 80 GB

GPU count (tensor parallel)

Each added GPU adds bandwidth but loses ~8% to tensor-parallel comms.

Decode efficiency (MFU)

Share of peak bandwidth realised in practice (10–95). 70% is a good default.

Prompt tokens

Input tokens processed during prefill.

Output tokens

Tokens the model generates.

Try a scenario

Decode throughput

67tok/s

Peak 95.7 tok/s × MFU

Prefill throughput

3,179tok/s

Time to first token

315 ms

Total generation

7.78 s

Interactive(> 30 tok/s)Memory-bandwidth bound

Assumptions used

Active weights read / token	35 GB
Effective (active) parameters	70 B
Aggregate memory bandwidth	3,350 GB/s
Aggregate FP16 compute	989 TFLOPS
Tensor-parallel efficiency	100%

Single-stream, first-order estimate — no batching, speculative decoding, or KV-cache spill. GPU bandwidth and compute are from NVIDIA datasheets; model sizes from official model cards. Specs last verified 2026-06-14. Full sources are listed below the calculator.

How it works

Generating text with a transformer is two separate jobs with two separate bottlenecks, and this calculator models both. The headline number — decode throughput — comes from a simple physical fact: to produce each new token, the GPU must read every active model weight out of memory exactly once. That memory read, not arithmetic, sets the pace. So single-stream decode is memory-bandwidth bound.

Weight bytes per token. Multiply the active parameter count by the bytes per parameter for your quantization — FP16 = 2, FP8/INT8 = 1, INT4 = 0.5. A dense 70B at INT4 reads 70e9 × 0.5 = 35 GB per token. Mixture-of-Experts models only read their routed (active) experts, so the tool uses the active count, not the total.
Decode throughput. Peak tokens/second = aggregate memory bandwidth ÷ weight bytes per token. Realistic throughput multiplies that by an efficiency factor (MFU), because no engine sustains 100% of rated bandwidth — measured utilisation is typically 60–85%, so 70% is the default.
Prefill throughput. Processing your prompt is compute bound, not memory bound, because all prompt tokens run through the network in parallel. A forward pass costs about 2 FLOPs per parameter per token, so prefill tokens/second = aggregate FP16 compute × 45% ÷ (2 × parameters).
Time to first token and total time. TTFT = prompt tokens ÷ prefill throughput. Total generation time = TTFT + output tokens ÷ decode throughput.

Multiple GPUs (tensor parallel) add bandwidth but spend some of it on cross-GPU communication, so the tool discounts each extra card by 8% (a 0.92 factor per added GPU). The accounting follows the standard transformer inference arithmetic — a forward pass at roughly 2 × params FLOPs per token (Kaplan et al. 2020), and decode reading the full active weight set once per token (kipp.ly). Every estimate is deterministic given its inputs.

This is a single-stream, first-order model. It does not simulate continuous batching, speculative decoding, FlashAttention kernel tuning, or KV-cache spill — those depend on your exact serving stack. It also does not check whether the model fits in VRAM; for that, use the LLM VRAM calculator. The numbers here match real-world single-stream throughput within the usual variance, which is why the cross-check anchors below sit inside measured ranges.

Worked examples

Llama 3 70B · INT4 · 1× H100 SXM · MFU 70%

Weight bytes/token: 70e9 × 0.5 = 35e9 B (35 GB)
H100 SXM bandwidth: 3,350 GB/s = 3.35e12 B/s
Peak decode: 3.35e12 ÷ 35e9 = 95.7 tok/s
Realistic decode: 0.70 × 95.7 = 67.0 tok/s
Prefill: 989e12 × 0.45 ÷ (2 × 70e9) = 3,179 tok/s
TTFT (1,000 prompt): 1,000 ÷ 3,179 = 315 ms
Total (500 output): 0.315 + 500 ÷ 67.0 = 7.78 s → Interactive

Mistral 7B · FP16 · 1× RTX 4090 · MFU 70%

Weight bytes/token: 7e9 × 2 = 14e9 B (14 GB)
RTX 4090 bandwidth: 1,008 GB/s
Peak decode: 1,008e9 ÷ 14e9 = 72.0 tok/s
Realistic decode: 0.70 × 72.0 = 50.4 tok/s
Verdict: Interactive — comfortable single-user chat on one 4090

Qwen2.5 72B · FP16 · 2× A100 80GB · MFU 70% (tensor parallel)

Weight bytes/token: 72e9 × 2 = 144e9 B (144 GB)
Tensor-parallel efficiency: 0.92^(2−1) = 0.92
Aggregate bandwidth: 2,039 × 2 × 0.92 = 3,752 GB/s
Realistic decode: 0.70 × 3,752e9 ÷ 144e9 = 18.2 tok/s
Verdict: Usable — FP16 72B is sluggish; INT4 would roughly quadruple it

Frequently asked questions

Sources & references

GPU specifications and model parameter counts were last cross-checked against the sources above on 2026-06-14. This tool gives a first-principles estimate, not a measured benchmark; real throughput varies with your inference engine, context length, and batch settings.

Related tools

LiveAI

LLM Speed Comparison

Compare the real-world response speed of hosted LLM APIs — GPT, Claude, Gemini, Llama, DeepSeek, Grok, Mistral. Pick 2–6 models, enter your output length, and rank them by estimated total response time, output tokens/sec, and time-to-first-token. Every figure cited from Artificial Analysis medians.

Open tool

LiveAI

LLM VRAM Calculator

Estimate the GPU VRAM needed to run or fine-tune any open LLM (Llama 3, Mistral, Qwen, Gemma, DeepSeek) at a given precision, context, and batch size — and check whether it fits your GPU. Formulas cited, runs in your browser.

Open tool

LiveAI

AI Audio Token Cost Calc

Convert an audio clip's duration (or a measured audio_tokens count) into the exact audio input tokens GPT-4o-audio and Gemini bill, then price it per request and per month in USD and LKR. Gemini's fixed 32 tokens/second rule is cited; compares all four models side by side. Runs in your browser, no signup.

Open tool

Rate this tool

Be the first to rate

Comments & feedback

Spotted a bug or want an improvement? Tell us — our team reviews every comment, and good ideas get built. Comments are public and anonymous.

Found a bug, edge case, or want to suggest an improvement?

Email me at [email protected] — most fixes ship within 24 hours.