Which LLM is the fastest for streaming responses?

It depends on length. For the first token to appear instantly (a typing indicator), pick the model with the lowest TTFT — usually a lightweight model like Gemini 3 Flash, Claude Haiku 4.5, or Grok 4.1 Fast, all under ~0.6 s. For finishing a long reply quickly, pick the highest output speed — specialised hosts like Cerebras and Groq push 480–525 tokens/sec. Sort the table by 'Total time' for your output length to settle it.

What is a good tokens-per-second speed for an LLM API?

For a chat UI where a human reads the stream, 30–50 tokens/sec already feels faster than most people read. Above ~80 tokens/sec the text outpaces reading entirely. Mainstream flagships sit at 80–170 tokens/sec; specialised inference hosts (Cerebras, Groq) reach 400–525. For batch or agent pipelines where no human waits, higher is always better because it cuts wall-clock time linearly.

What is time-to-first-token (TTFT) and why does it matter?

TTFT is the delay between sending your request and the first token streaming back. It sets how responsive the app feels: a 0.4 s TTFT shows a typing indicator almost instantly, while 2 s feels laggy. TTFT matters most for short replies and interactive UIs; for long generations the per-token throughput dominates the total time instead.

Is Gemini Flash faster than GPT-4o?

On the Artificial Analysis medians used here, Gemini 3 Flash has higher output throughput (≈200 vs ≈131 tokens/sec) while GPT-4o has a slightly lower TTFT (≈0.45 vs ≈0.50 s). So GPT-4o begins a hair sooner, but Gemini Flash finishes any reply longer than a few dozen tokens first. Select both, enter your real output length, and the tool shows the exact winner.

Does a faster model always mean a quicker total response?

No. Total time = TTFT + (tokens ÷ output speed). A model with a tiny TTFT but low throughput wins on very short replies and loses on long ones. The two metrics frequently disagree, so a single 'fastest model' label is misleading — the answer changes with your output length, which is exactly why this tool asks for it.

Why does the same model appear twice with different speeds?

Because the host serving the weights matters as much as the model. DeepSeek V4 on its first-party API and on Together AI are the same model but measure differently because the serving stack, hardware, and batching differ. Llama 4 on Cerebras or Groq is far faster than on a general-purpose host. We store each measured endpoint as its own row so you compare what you would actually call.

Are these speeds guaranteed for my region in Sri Lanka?

No. The figures are independently-measured medians and exclude the network round-trip from your own location, which adds latency that grows with distance to the provider's region. From Sri Lanka, choosing a provider region in Singapore or India will feel faster than us-east. Treat the rankings as a relative guide; your absolute numbers will be a little higher.

Does fastest mean best?

No — speed is one axis. A fast model can be weaker at reasoning, coding, or following instructions. Use this tool to shortlist on latency, then check quality on the LLM Benchmark Comparison and cost on the AI Model Comparison before you ship. The fastest model that is also good enough for your task is the one to pick.

When were these figures last verified?

The output-speed and TTFT medians were transcribed from Artificial Analysis on 2026-06-24. They are a continuously-measured dataset, so the page is refreshed when the source publishes a material change. Every row links to its Artificial Analysis source page.

AI · Developer tools

Fastest LLM — Speed & Latency Comparison

Find the fastest LLM API for your workload. Pick 2–6 models, enter your output length, and rank them by estimated total response time — output throughput (tokens/sec) and time-to-first-token (TTFT), separated so you know what is fast to start versus fast to finish. Every figure is a cited Artificial Analysis median.

By Induwara Ashinsana— Executive Director, Ryzera TechnologiesUpdated Jun 24, 2026

Compare LLM speed16 endpoints · 9 hosts

Artificial Analysis medians · verified 2026-06-24

Pick models to compare (2–6) · 3 selected

Tap to add or remove a model. The same model on a different host is a separate row.

Output length (tokens)

≈ 375 English words. A token is ~4 characters.

Sort by

Quick lengths

Fastest to finish

3.00 s

Gemini 3 Flash

Google AI Studio

1.27 s sooner than GPT-4o (30% faster).

Fast to start

0.45 s TTFT

GPT-4o

Lowest time-to-first-token — the typing indicator feels instant.

Fast to finish

200 tok/s

Gemini 3 Flash

Highest throughput — dominates total time on long outputs.

Heads up: the model that starts fastest isn't the one that finishes fastest at this length. Lower TTFT wins on short replies; higher throughput wins on long ones.

Results for 500-token output

Model · host	Output	TTFT	ms/token	Total
Gemini 3 Flash Google AI Studio	200t/s	0.50s	5.00	3.00 s
GPT-4o OpenAI API	131t/s	0.45s	7.63	4.27 s 1.42× slower
Claude Sonnet 4.6 Anthropic API	104t/s	0.74s	9.62	5.55 s 1.85× slower

“Total” is the estimated wall-clock time for a streaming response of this length. Lower is faster. Figures are medians and exclude network round-trips from your own location.

Output speed and TTFT are independently-measured Artificial Analysis medians, last verified 2026-06-24. Real latency varies by provider, region, prompt length, and server load — these are medians, not guarantees. Estimated total time = TTFT + (tokens ÷ output speed).

How it works

This tool turns two independently-measured serving metrics into the one number you actually care about: how long your request takes from “send” to the last token. The two inputs per endpoint come from Artificial Analysis, which continuously measures every major hosted model across providers and publishes the median (not best-case) figure:

Output speed — median output throughput in tokens per second once the stream is flowing.
Time-to-first-token (TTFT) — median delay before the first token streams back, in seconds.

For an output of N tokens, the estimated total response time is computed with one exact formula:

totalSeconds = TTFT + (N ÷ output speed)

The first term is the latency before anything appears; the second is the time to stream the rest of the answer at the model’s median throughput. Per-token latency is shown as 1000 ÷ output speed milliseconds per token, and the relative-speed bar normalises every row against the fastest selected endpoint. To guard the arithmetic, the data module computes each total a second way — through the per-token latency path — and a self-check reconciles the two to one-millionth of a second before the page can build, the same belt-and-braces approach the site’s tax calculator uses against the IRD’s alternate formula.

Two deliberate design choices keep the answer honest. First, the host is part of each endpoint’s identity: the same weights served by Cerebras, Groq, Together, or a first-party API differ by up to an order of magnitude in throughput, so they appear as separate rows (16 endpoints across 9hosts). Second, the “fastest to start” (lowest TTFT) and “fastest to finish” (highest throughput) verdicts are surfaced separately, because for short replies a low TTFT wins and for long ones throughput wins — a single “fastest model” label hides that trade-off. The figures are medians and exclude the network round-trip from your own location, so treat the ranking as a relative guide rather than a guarantee.

Worked examples

Short chatbot reply — 500 tokens, default trio

Formula: totalSeconds = TTFT + (N ÷ output speed), N = 500
GPT-4o (131 t/s, 0.45 s): 0.45 + 500/131 = 0.45 + 3.817 = 4.27 s
Claude Sonnet 4.6 (104 t/s, 0.74 s): 0.74 + 500/104 = 5.55 s
Gemini 3 Flash (200 t/s, 0.50 s): 0.50 + 500/200 = 3.00 s ← winner
Gap to runner-up GPT-4o: 4.27 − 3.00 = 1.27 s (≈30% faster)

Tiny reply — 50 tokens, low TTFT wins

GPT-4o (131 t/s, 0.45 s): 0.45 + 50/131 = 0.45 + 0.382 = 0.83 s ← winner
GPT-5.5 Mini (168 t/s, 0.61 s): 0.61 + 50/168 = 0.61 + 0.298 = 0.91 s
GPT-4o finishes 0.08 s sooner despite LOWER throughput,
because at 50 tokens the 0.16 s TTFT head start outweighs speed.

Same pair, 500 tokens — throughput overtakes (edge case)

GPT-4o (131 t/s, 0.45 s): 0.45 + 500/131 = 4.27 s
GPT-5.5 Mini (168 t/s, 0.61 s): 0.61 + 500/168 = 3.59 s ← winner
Mini now wins by 0.68 s even though it STARTS 0.16 s later.
Crossover: 0.45 + N/131 = 0.61 + N/168 → N ≈ 95 tokens.
Below ~95 tokens GPT-4o wins; above it, GPT-5.5 Mini wins.

Frequently asked questions

Sources & references

Output-speed and TTFT medians were transcribed from Artificial Analysis on 2026-06-24. They are independently-measured medians that vary by provider, region, prompt length, and server load — no SLA is implied. Each row in the tool links to its Artificial Analysis source page.

Related tools

LiveAI

AI Inference Speed Calculator

Estimate how fast an LLM will run on a given GPU — decode tokens/second, prefill throughput, time-to-first-token, and total generation time — from model size, quantization, and published memory bandwidth. Formulas and GPU specs cited.

Open tool

LiveAI

AI Reasoning Model Compare

Side-by-side comparison of the major AI reasoning models — OpenAI o3 & GPT-5 thinking, Claude extended thinking, Gemini thinking, DeepSeek-R1, Grok 4 and Qwen3 — by reasoning-token pricing, context window, knowledge cutoff and AIME / GPQA / SWE-bench scores. Sort, filter, pick by use case, and estimate a task's true cost once hidden thinking tokens are counted. Sources cited, no signup.

Open tool

LiveAI

AI Max Output Tokens

Look up the maximum output (completion) tokens for every current LLM — Claude, GPT-4o, Gemini, Llama and more — and check whether your desired response fits in a single API call or needs chunking. Per-model caps cited from vendor docs, separate from the context window.

Open tool

Speed is one axis. Compare model quality on benchmarks, project pricing and capabilities, or estimate self-hosted GPU throughput before you ship.

Rate this tool

Be the first to rate

Comments & feedback

Spotted a bug or want an improvement? Tell us — our team reviews every comment, and good ideas get built. Comments are public and anonymous.

A median looks off, or want another model or host added?

Email me at [email protected] — most fixes ship within 24 hours.