induwara.lk
induwara.lkAI · Developer tools

AI Reasoning Model Comparison

Compare 10 reasoning LLMs from 6 providers side by side — reasoning-token pricing, context window, knowledge cutoff and AIME / GPQA / SWE-bench scores — then estimate what a single task costs once hidden thinking tokens are counted. Free, no signup, every figure sourced.

By Induwara AshinsanaUpdated Jun 27, 2026
Compare AI reasoning models10 models · 6 providers

Pick by use case

Highest AIME 2025 score.

#1OpenAI

GPT-5 (thinking)

Frontier general reasoning with a 400K context and coding lead.

AIME 2025
94.6%
Vendor source
#2xAI

Grok 4

Reasoning with live web / X search baked into the model.

AIME 2025
93.3%
Vendor source
#3OpenAI

o4-mini

Best price-for-reasoning — near-o3 math at a fraction of the cost.

AIME 2025
92.7%
Vendor source

Estimate a task's cost

≈ 750 words per 1,000 tokens.

The reply you actually keep.

Applied to models with an effort dial.

Cheapest estimated total: Gemini 2.5 Flash at $0.0066 per task (medium effort).

ModelVisible costEst. total
Gemini 2.5 Flash Cheapest
Google
$0.0016$0.0066
DeepSeek-R1
DeepSeek
$0.0016$0.0104
o4-mini
OpenAI
$0.0033$0.0121
Qwen3-235B (thinking)
Qwen
$0.0021$0.0133
o3
OpenAI
$0.006$0.022
Gemini 2.5 Pro
Google
$0.0063$0.0263
GPT-5 (thinking)
OpenAI
$0.0063$0.0263
Claude Sonnet 4.5
Anthropic
$0.0105$0.0405
Claude Opus 4.5
Anthropic
$0.0175$0.0675
Grok 4
xAI
$0.0105$0.0705

“Est. total” adds a heuristic hidden-reasoning estimate and is not a quote. Always-on models (e.g. DeepSeek-R1, Grok 4) use their documented typical thinking budget and ignore the effort dial.

Full comparison (10 of 10)

Provider
DeepSeek-R1
DeepSeek · cutoff 2024-07
$0.55$2.1987.5%
Gemini 2.5 Flash
Google · cutoff 2025-01
$0.30$2.578.0%
Qwen3-235B (thinking)
Qwen · cutoff 2024-12
$0.70$2.892.3%
o4-mini
OpenAI · cutoff 2024-06
$1.1$4.492.7%
o3
OpenAI · cutoff 2024-06
$2$888.9%
Gemini 2.5 Pro
Google · cutoff 2025-01
$1.25$1088.0%
GPT-5 (thinking)
OpenAI · cutoff 2024-10
$1.25$1094.6%
Claude Sonnet 4.5
Anthropic · cutoff 2025-03
$3$1587.0%
Grok 4
xAI · cutoff 2024-11
$3$1593.3%
Claude Opus 4.5
Anthropic · cutoff 2025-03
$5$2589.0%

AIME = AIME 2025, GPQA = GPQA Diamond, SWE-b = SWE-bench Verified. Scores are vendor-reported, in percent. “—” means the vendor did not publish that figure.

Static comparison — no API key, no proxy, no logging.

Choosing a model here sends nothing to any AI provider. Every price, context window and benchmark traces to the vendor's own docs (see Sources below). Reviewed on each major model release.

How it works

This is mostly a curated data-lookup tool — the comparison table, the “pick by use case” helper and the filters all read from one hand-verified dataset of reasoning models. The only computation is the optional per-task cost estimate, which has two parts.

Visible cost (exact). Every provider publishes a price per million input and output tokens. The visible cost of a request is:

cost = input ÷ 1,000,000 × price_in + output ÷ 1,000,000 × price_out

Hidden reasoning cost (estimate).Reasoning models also generate hidden “thinking” tokens. OpenAI, Anthropic and Google all bill these at the outputtoken rate (cited in each vendor's reasoning guide below), so they can quietly multiply the bill. Hidden tokens can't be predicted exactly without running the model, so we estimate them with a transparent heuristic: a base budget of 2,000 tokens scaled by an effort multiplier — low ×0.25, medium ×1.0, high ×2.5. Models that always reason and can't be metered (DeepSeek-R1, Grok 4) use a documented typical budget instead. The estimated total is:

total = input ÷ 1e6 × price_in + (output + reasoning) ÷ 1e6 × price_out

Sorting and filtering are pure operations on the stored rows — a stable sort on the chosen column (price, context, AIME, GPQA or SWE-bench) and boolean predicate matching for the provider, controllable-effort and open-weight filters. Benchmark columns are verbatim vendor-reported figures with a source link per model; we never recompute them, and missing figures sink to the bottom rather than counting as zero. No request leaves your browser — the whole dataset ships with the page, last verified 2026-06-27.

Worked examples

Visible cost only — o4-mini

10,000 input + 2,000 visible output tokens ($1.10 / $4.40 per 1M)

  1. Input: 10,000 ÷ 1,000,000 × $1.10 = $0.011
  2. Output: 2,000 ÷ 1,000,000 × $4.40 = $0.0088
  3. Visible cost = $0.011 + $0.0088 = $0.0198 ≈ 1.98¢

With hidden reasoning — high effort

Same task and prices, reasoning effort = high

  1. Reasoning tokens ≈ 2,000 base × 2.5 (high) = 5,000
  2. Output billed = 2,000 visible + 5,000 hidden = 7,000
  3. Output cost: 7,000 ÷ 1,000,000 × $4.40 = $0.0308
  4. Estimated total = $0.011 + $0.0308 = $0.0418 ≈ 4.18¢
  5. That is ~2.1× the visible-only cost — the hidden-token tax.

Filter + use-case pick — cheapest open-weight reasoner

Toggle 'Open weight', then 'Pick by use case → Math'

  1. Open-weight filter leaves DeepSeek-R1 and Qwen3-235B (thinking).
  2. Sort by Out $/M ascending → DeepSeek-R1 ($2.19) on top.
  3. Math use-case ranks by AIME 2025 → Qwen3-235B (92.3%) edges R1 (87.5%).
  4. So: cheapest open reasoner = DeepSeek-R1; strongest open math = Qwen3.

Frequently asked questions

Sources & references

Prices and benchmark scores were last cross-checked against these sources on 2026-06-27 (June 2026 vendor snapshot). Benchmark figures are vendor-reported and not independently re-run. The page is reviewed on each major reasoning-model release.

Related tools

Rate this tool
Be the first to rate

Comments & feedback

Spotted a bug or want an improvement? Tell us — our team reviews every comment, and good ideas get built. Comments are public and anonymous.

Spotted a stale price, a new model, or a wrong benchmark?

Email me at [email protected] — refreshes usually ship within 24 hours.