AI Reasoning Model Comparison
Compare 10 reasoning LLMs from 6 providers side by side — reasoning-token pricing, context window, knowledge cutoff and AIME / GPQA / SWE-bench scores — then estimate what a single task costs once hidden thinking tokens are counted. Free, no signup, every figure sourced.
How it works
This is mostly a curated data-lookup tool — the comparison table, the “pick by use case” helper and the filters all read from one hand-verified dataset of reasoning models. The only computation is the optional per-task cost estimate, which has two parts.
Visible cost (exact). Every provider publishes a price per million input and output tokens. The visible cost of a request is:
cost = input ÷ 1,000,000 × price_in + output ÷ 1,000,000 × price_out
Hidden reasoning cost (estimate).Reasoning models also generate hidden “thinking” tokens. OpenAI, Anthropic and Google all bill these at the outputtoken rate (cited in each vendor's reasoning guide below), so they can quietly multiply the bill. Hidden tokens can't be predicted exactly without running the model, so we estimate them with a transparent heuristic: a base budget of 2,000 tokens scaled by an effort multiplier — low ×0.25, medium ×1.0, high ×2.5. Models that always reason and can't be metered (DeepSeek-R1, Grok 4) use a documented typical budget instead. The estimated total is:
total = input ÷ 1e6 × price_in + (output + reasoning) ÷ 1e6 × price_out
Sorting and filtering are pure operations on the stored rows — a stable sort on the chosen column (price, context, AIME, GPQA or SWE-bench) and boolean predicate matching for the provider, controllable-effort and open-weight filters. Benchmark columns are verbatim vendor-reported figures with a source link per model; we never recompute them, and missing figures sink to the bottom rather than counting as zero. No request leaves your browser — the whole dataset ships with the page, last verified 2026-06-27.
Worked examples
Frequently asked questions
Sources & references
- OpenAI — Reasoning guide (reasoning_effort, reasoning-token billing)
- OpenAI — API pricing
- Anthropic — Extended thinking docs
- Anthropic — API pricing
- Google — Gemini thinking docs
- Google — Gemini API pricing
- DeepSeek — API pricing & R1 model card
- xAI — Grok models & pricing
- Alibaba Cloud — Qwen (Model Studio) models & pricing
Prices and benchmark scores were last cross-checked against these sources on 2026-06-27 (June 2026 vendor snapshot). Benchmark figures are vendor-reported and not independently re-run. The page is reviewed on each major reasoning-model release.
Related tools
Comments & feedback
Spotted a bug or want an improvement? Tell us — our team reviews every comment, and good ideas get built. Comments are public and anonymous.
Spotted a stale price, a new model, or a wrong benchmark?
Email me at [email protected] — refreshes usually ship within 24 hours.