What is the maximum output tokens for GPT-4o?

GPT-4o caps a single response at 16,384 output tokens — roughly 12,288 English words. That is separate from its 128K-token context window, which covers your prompt plus the reply. To produce a longer answer you must split the job across multiple calls.

How many tokens can Claude output in one response?

Claude Opus 4.8, 4.7 and 4.6 allow up to 128,000 output tokens per call (~96,000 words). Claude Sonnet 4.6 and Haiku 4.5 allow 64,000 output tokens. These are the highest single-call output caps of any current model.

What is the difference between context window and max output tokens?

The context window is the total token budget for one request — your prompt, any documents, and the model's reply all share it. Max output tokens is a separate, smaller cap on just the reply. A model with a 1M context window might still only output 64K–128K tokens in one call, so a huge window does not mean a huge single answer.

Why does my model stop generating before it finishes?

It almost certainly hit the max output token cap — the API returns a stop reason of max_tokens (OpenAI) or end_turn vs max_tokens (Anthropic). Either raise the max_tokens parameter (up to the model's documented limit), pick a model with a higher output cap, or split the response into chunks and stitch them together.

How many words is 128K tokens?

Using OpenAI's average of 1 token ≈ 0.75 words, 128,000 output tokens is about 96,000 words — a 380-page book. Exact counts vary by tokeniser and language; for an exact count of real text, use a dedicated token counter.

Do reasoning tokens count toward the output limit?

On reasoning models (OpenAI o3/o4-mini, DeepSeek-R1) the hidden chain-of-thought consumes part of the output budget before any visible answer is produced. Plan for that: reserve headroom so the model has room to both think and reply, or the visible answer can be truncated.

Are open-weight models like Llama limited the same way?

No. Open-weight models (Llama, Mistral) publish no separate output cap — the only hard limit is the context window. In practice the host you run on (Together, Groq, Ollama, etc.) sets its own default max_tokens, which is often much lower, so check your provider's settings.

How current are these numbers?

Every cap was cross-checked against the vendor's official model documentation on 2026-06-13, and each row in the table links its source. LLM limits change often; if a value looks out of date, email me and I will verify and update it.

AI · Developer reference

AI Model Max Output Tokens Lookup

The max output tokens for every current LLM — Claude, GPT, Gemini, Llama — in one table. Pick a model, enter how long a reply you want, and see instantly whether it fits in a single API call or needs chunking. Sources cited, no signup.

By Induwara Ashinsana— Executive Director, Ryzera TechnologiesUpdated Jun 13, 2026

Will my reply fit in one call?Claude Opus 4.8

Limits verified 2026-06-13

Model

Max output: 128,000 tokens · context 1M.

Desired response length

Converted at ~1.33 tokens per word (OpenAI average).

Try

Verdict

Fits in one call

The whole reply can be generated in a single API call.

Requested length

2,660 tok

≈ 2,000 words

Model max output

128,000 tok

≈ 96,000 words per call

Of the cap used

125,340 tokens to spare

Output cap used2,660 / 128,000 tokens

2% of Claude Opus 4.8's single-call output cap.

Max output tokens by model

Sort

24 of 24 models can return your requested length in a single call.

Model	Max output	≈ words	Context	Your request	Source
Claude Opus 4.8 Anthropic	128,000	96,000	1M	2%	docs
Claude Opus 4.7 Anthropic	128,000	96,000	1M	2%	docs
Claude Opus 4.6 Anthropic	128,000	96,000	1M	2%	docs
Claude Sonnet 4.6 Anthropic	64,000	48,000	1M	4%	docs
Claude Haiku 4.5 Anthropic	64,000	48,000	200K	4%	docs
GPT-5 OpenAI	128,000	96,000	400K	2%	docs
OpenAI o3 OpenAI	100,000	75,000	200K	3%	docs
OpenAI o4-mini OpenAI	100,000	75,000	200K	3%	docs
GPT-4.1 OpenAI	32,768	24,576	1M	8%	docs
GPT-4.1 mini OpenAI	32,768	24,576	1M	8%	docs
GPT-4o OpenAI	16,384	12,288	128K	16%	docs
GPT-4o mini OpenAI	16,384	12,288	128K	16%	docs
GPT-4 Turbo OpenAI	4,096	3,072	128K	65%	docs
GPT-3.5 Turbo OpenAI	4,096	3,072	16K	65%	docs
Gemini 2.5 Pro Google	65,536	49,152	1M	4%	docs
Gemini 2.5 Flash Google	65,536	49,152	1M	4%	docs
Gemini 2.0 Flash Google	8,192	6,144	1M	32%	docs
Gemini 1.5 Pro Google	8,192	6,144	2M	32%	docs
Gemini 1.5 Flash Google	8,192	6,144	1M	32%	docs
Llama 4 Maverickopen Meta	1,000,000	750,000	1M	0%	docs
Llama 3.3 70Bopen Meta	128,000	96,000	128K	2%	docs
DeepSeek-V3 DeepSeek	8,192	6,144	128K	32%	docs
DeepSeek-R1 DeepSeek	8,192	6,144	128K	32%	docs
Mistral Large 2open Mistral	128,000	96,000	128K	2%	docs

24models listed. "open" = open-weight model with no separate output cap; its limit is the context window and the host may cap it lower.

Max-output caps are vendor-documented values verified on 2026-06-13 (each row links its source below). Only the word↔token bridge is an average (1 token ≈ 0.75 words ≈ 4 characters; 1 word ≈ 1.33 tokens, OpenAI guidance). Max output is the single-call completion cap — separate from, and smaller than, the context window.

How it works

Every large language model has two different token limits, and developers constantly confuse them. The context windowis the total budget for one request — your prompt, any attached documents, and the model's reply all share it. The max output tokens limit is a separate, smaller cap on just the completion. A model can have a one-million token context window and still refuse to write more than 64,000 tokens in a single answer. This tool is about that second number — the one that produces a max_tokens stop reason and cuts your generation off mid-sentence.

The numbers themselves are not estimated. Each model's output cap is taken from the vendor's own documentation — Anthropic's models overview, OpenAI's models reference, Google's Gemini API model list, and the Llama and Mistral model cards — and every row in the table above links straight to its source. Open-weight models (Llama, Mistral) publish no separate output cap at all: their only hard limit is the context window, so they are flagged "open" and the host you run on usually applies its own lower default.

The fit check uses four small, deterministic steps:

Convert your length to tokens.If you enter tokens, they are used as-is. If you enter words, they are converted with OpenAI's published average of about 1.33 tokens per word: tokens = ceil(words × 1.33).
Compare to the cap. The reply fits in one call when requestedTokens ≤ maxOutputTokens.
Count chunks when it does not fit. chunks = ceil(requestedTokens / maxOutputTokens) — how many calls you would split the job into.
Show headroom. The bar reads min(100, round(requestedTokens / maxOutputTokens × 100))% of the cap used.

The "≈ words" columns invert the same ratio: words = floor(maxOutputTokens × 0.75). Only this word↔token bridge is an approximation — exact counts depend on each model's tokeniser and your specific text. The fit verdict is cross-checked two independent ways (in the token domain and the word domain) so the answer is consistent for realistic inputs. The output caps are exact, cited figures.

Worked examples

5,000-word article on GPT-4o

Fits in one call

Tokens: ceil(5,000 × 1.33) = 6,650
GPT-4o max output: 16,384 tokens
6,650 ≤ 16,384 → fits in one call
Headroom: round(6,650 / 16,384 × 100) = 41% used
Spare: 16,384 − 6,650 = 9,734 tokens left

200-page book (~100,000 words) on Claude Opus 4.8

Too long — needs 2 calls

Tokens: ceil(100,000 × 1.33) = 133,000
Opus 4.8 max output: 128,000 tokens
133,000 > 128,000 → does not fit
Chunks: ceil(133,000 / 128,000) = 2
Even the highest-output model needs ≥ 2 calls — chunk it.

At the boundary: 64,000 tokens on Claude Sonnet 4.6

Fits exactly

Sonnet 4.6 max output: 64,000 tokens
Request 64,000 tokens: 64,000 ≤ 64,000 → fits (boundary inclusive)
Headroom: 100% used — zero tokens to spare
Request 65,000 tokens: 65,000 > 64,000 → ceil(65,000 / 64,000) = 2 calls

Frequently asked questions

Sources & references

The 24 model caps on this page were last cross-checked against the vendor documentation above on 2026-06-13. LLM limits change often; the page is reviewed when new models ship or vendors revise their docs.

Related tools

LiveAI

AI Video Token Cost Calc

Estimate how many input tokens a video costs when you send it into a multimodal LLM — Gemini's native per-second tokenization versus frame-sampling into GPT-4o and Claude — priced per video and per month in USD and LKR. Runs in your browser; no video is uploaded.

Open tool

LiveAI

Context Window Calculator

Pick an LLM (GPT, Claude, Gemini, Llama, DeepSeek), enter your text in words, characters, A4 pages, code lines or tokens, and instantly see whether it fits the model's context window, the percentage used, and how many tokens remain for the reply.

Open tool

LiveAI

AI Audio Token Cost Calc

Convert an audio clip's duration (or a measured audio_tokens count) into the exact audio input tokens GPT-4o-audio and Gemini bill, then price it per request and per month in USD and LKR. Gemini's fixed 32 tokens/second rule is cited; compares all four models side by side. Runs in your browser, no signup.

Open tool

Rate this tool

Be the first to rate

Comments & feedback

Spotted a bug or want an improvement? Tell us — our team reviews every comment, and good ideas get built. Comments are public and anonymous.

Spotted an output limit that has changed, or a model I should add?

Email me at [email protected] — most updates ship within 24 hours.