What is the best LLM for coding right now?

Sort the table by SWE-bench Verified — the benchmark that asks a model to fix real GitHub issues so the repo's own tests pass. On the cited figures, Claude Opus 4.5 leads on SWE-bench Verified, with Claude Sonnet 4.5 and GPT-5 close behind. Pick the models your budget allows, then read the per-cell source to confirm the setting matches your use case.

What does SWE-bench Verified measure?

SWE-bench gives a model a real bug report from a popular Python repository and asks it to write a patch. The patch is scored only by whether the project's existing test suite passes. "Verified" is a 500-task subset that OpenAI and the SWE-bench team hand-checked to remove broken or ambiguous tasks, so it is the cleanest signal of real-world coding skill.

Is GPQA Diamond harder than MMLU?

Yes. MMLU and MMLU-Pro test broad multiple-choice knowledge. GPQA Diamond is 198 of the hardest PhD-written science questions — ones that domain experts get right only about two-thirds of the time and non-experts barely beat chance on, even with the internet. A high GPQA score signals genuine reasoning rather than recalled facts.

Which AI model scores highest on math benchmarks (AIME / MATH)?

Sort by AIME or MATH. Reasoning-tuned models dominate here: Claude Opus 4.5 and Grok 4 report the highest AIME figures, while DeepSeek R1 posts a near-perfect MATH-500 score. Note vendors report different AIME years (2024 vs 2025), shown in each cell's setting — compare like with like before drawing conclusions.

How are LLM benchmark scores actually measured?

Most are pass@1 accuracy: one attempt, scored right or wrong, averaged over the test set. Some vendors report with chain-of-thought prompting, extra test-time compute, or tool access, which inflates the number. This tool stores each figure's setting and flags any score reported off the canonical setting with a "†", excluding it from the composite so you never compare a tools-on number against a tools-off one.

How is the composite score calculated?

The composite is the plain unweighted average of a model's scores across the benchmarks that every currently-selected model reports on the canonical setting (the "common set"). Change the selection and the common set — and the composite — recompute. This keeps every comparison apples-to-apples: a model is never rewarded for skipping a hard benchmark.

Why do some cells show "—" instead of a number?

A dash means the vendor never published that benchmark for that model. It is deliberately not treated as a zero, because "didn't report" is not the same as "scored zero." Benchmarks with a dash for any selected model are dropped from the composite for everyone, so the missing data never penalises a rival.

How current are these numbers, and where do they come from?

Every score is transcribed verbatim from the vendor's own model card or launch post — OpenAI, Anthropic, Google DeepMind, Meta, DeepSeek and xAI — and each cell links back to its source. The full set was last reconciled against those cards on 2026-06-13. It is a manual, curated dataset, so there are no stale scraped numbers, but a brand-new release may take a few days to be added.

Should I just pick the model with the highest composite?

Not blindly. The composite is a quick overall signal, but the right model depends on your task: sort by SWE-bench for coding agents, AIME for math, GPQA for science reasoning, MMMU for vision. Also weigh price and context window — those live in the companion AI Model Comparison tool, which this page links to both ways.

AI · Benchmarks

AI LLM Benchmark Comparison

Compare 7 leading language models across 7 published benchmarks — MMLU-Pro, GPQA Diamond, SWE-bench Verified, HumanEval, AIME, MATH and MMMU. Pick 2–6 models, sort by any score, and get an apples-to-apples composite. Every figure is cited from the vendor's own model card.

By Induwara Ashinsana— Executive Director, Ryzera TechnologiesUpdated Jun 13, 2026

Compare LLM benchmarks7 models · 7 benchmarks

Vendor-cited · verified 2026-06-13

Pick models to compare (2–6) · 3 selected

Tap to add or remove a model.

Sort by

Show columns

Sort benchmark only

Best overall

Claude Opus 4.5

Highest composite over the shared benchmark set

Best for coding

Claude Opus 4.5

SWE-bench Verified

Best for math

Claude Opus 4.5

AIME

Best for reasoning

Claude Opus 4.5

GPQA Diamond

Best for vision

GPT-5

MMMU

Composite ranking

#1 leads #2 by 1.2 pts

1.Claude Opus 4.5

86.5%

2.GPT-5

85.3%

3.Gemini 2.5 Pro

80.8%

Full score table

Model	Composite	MMLU-Pro	GPQA	SWE-bench	HumanEval	AIME	MATH	MMMU
Claude Opus 4.5 Anthropic	86.5	88.0	87.0	80.9	—	96.0	—	80.7
GPT-5 OpenAI	85.3	87.0	85.7	74.9	—	94.6	—	84.2
Gemini 2.5 Pro Google	80.8	86.2	84.0	63.8	—	88.0	90.0	82.0

Composite = unweighted mean over the 5 benchmarks every selected model reports on-setting (MMLU-Pro, GPQA, SWE-bench, AIME, MMMU). A “†” marks an off-setting figure: shown for transparency, excluded from the composite.

Every figure is a vendor-published headline number, transcribed verbatim with its reported setting — tap any score to open the model card it came from. “—” means the vendor never published that benchmark (never counted as zero). Scores last reconciled against the source cards on 2026-06-13.

How it works

This page is, first and foremost, a lookup table of figures the model vendors have published themselves. No score is invented, smoothed, or re-measured by us — each one is transcribed verbatim from an OpenAI, Anthropic, Google DeepMind, Meta, DeepSeek or xAI model card, stored with the exact setting it was reported under, and linked back to its source. Two small, deterministic computations sit on top of that table.

The composite. For the models you currently have selected, the tool finds the common set— the benchmarks for which every selected model has a figure on that benchmark's canonical setting. The composite for each model is the plain unweighted mean of its scores over that common set: composite = (1 / |B|) · Σ score(model, b) for every benchmark b in the common set B. Because the common set is recomputed every time you change the selection, the comparison stays apples-to-apples: if you add a model that never published HumanEval, HumanEval drops out of the composite for everyone, so the missing data penalises no one. A benchmark with a dash for any selected model is excluded — a dash means “not reported,” never a zero.

The lead delta. On whichever benchmark you sort by, the tool reports the gap between first and second place in percentage points: lead = score(rank 1) − score(rank 2). A two-point lead on SWE-bench Verified is a much smaller real-world difference than the headline ranking suggests, and the delta makes that explicit.

Settings matter.Vendors sometimes report the same benchmark on different settings — pass@1 with no tools versus a score boosted by test-time tools or extra compute. Each benchmark here has one canonical setting (for example, SWE-bench Verified is pass@1 with no internet access). Any figure reported on a different setting is shown in the table for transparency but marked with a “†” and excluded from the composite, so a tools-on number is never silently compared against a tools-off one. The benchmarks themselves are defined by their original papers: MMLU-Pro (TIGER-Lab), GPQA (Rein et al.), SWE-bench Verified (the SWE-bench team and OpenAI), HumanEval (Chen et al.), AIME (the Mathematical Association of America), MATH (Hendrycks et al.) and MMMU (Yue et al.) — all linked under Sources below. For pricing, context windows and modality flags rather than capability scores, see the companion AI Model Comparison tool.

Worked examples

Composite over the common benchmark set

Select three models that all report {MMLU-Pro, GPQA, SWE-bench, HumanEval}.
Model A scores 88.0 / 84.0 / 72.0 / 96.0.
Composite(A) = (88.0 + 84.0 + 72.0 + 96.0) / 4 = 340.0 / 4 = 85.0
Model B at 86.0 / 82.0 / 70.0 / 94.0 → 332.0 / 4 = 83.0
Model C at 85.0 / 80.0 / 65.0 / 92.0 → 322.0 / 4 = 80.5
Podium: A (85.0), B (83.0), C (80.5). SWE-bench lead = 72.0 − 70.0 = 2.0 pts.

The common set shrinks when a model lacks a benchmark

Add Model D, which never published HumanEval.
Common set across A/B/C/D becomes {MMLU-Pro, GPQA, SWE-bench} — HumanEval drops.
A = (88.0 + 84.0 + 72.0) / 3 = 244.0 / 3 = 81.33
D = (87.0 + 83.0 + 71.0) / 3 = 241.0 / 3 = 80.33
B = 238.0 / 3 = 79.33, C = 230.0 / 3 = 76.67
New podium: A, D, B. D's missing HumanEval no longer penalises anyone.

A real selection — GPT-5 vs Claude Opus 4.5 vs Gemini 2.5 Pro

All three report MMLU-Pro, GPQA, SWE-bench, AIME and MMMU on-setting.
Common set = those 5 benchmarks (none has a dash across the three).
Sort by SWE-bench Verified: Claude Opus 4.5 (80.9) leads GPT-5 (74.9).
Lead delta = 80.9 − 74.9 = 6.0 pts — a clear coding edge.
Sort by AIME: Claude Opus 4.5 (96.0) edges GPT-5 (94.6) and Gemini 2.5 Pro (88.0).

Frequently asked questions

Sources & references

Each score in the table links to the exact vendor model card it was transcribed from. The benchmark definitions below are the canonical papers and pages describing what each test measures.

Vendor score cards: OpenAI (openai.com/index), Anthropic (anthropic.com/news), Google DeepMind (deepmind.google/models/gemini), Meta (ai.meta.com/blog), DeepSeek (api-docs.deepseek.com/news) and xAI (x.ai/news). All figures were last reconciled against these sources on 2026-06-13.

Related tools

LiveAI

AI Model Compare

Compare the latest LLMs side by side — GPT-5, Claude 4.5, Gemini 2.5, Llama 4, DeepSeek, Grok, Mistral. Context windows, input and output pricing, vision, function calling, training cutoff. Project monthly cost for your workload. Sources cited.

Open tool

LiveAI

AI Reasoning Model Compare

Side-by-side comparison of the major AI reasoning models — OpenAI o3 & GPT-5 thinking, Claude extended thinking, Gemini thinking, DeepSeek-R1, Grok 4 and Qwen3 — by reasoning-token pricing, context window, knowledge cutoff and AIME / GPQA / SWE-bench scores. Sort, filter, pick by use case, and estimate a task's true cost once hidden thinking tokens are counted. Sources cited, no signup.

Open tool

LiveAI

AI Context Windows

Compare the input context window of every current LLM — GPT-5, Claude, Gemini, Llama 4, DeepSeek, Grok, Mistral — ranked by tokens and translated into words, A4 pages and lines of code. Sortable, filterable, copy as CSV or Markdown, every figure cited to the provider's own docs.

Open tool

Rate this tool

Be the first to rate

Comments & feedback

Spotted a bug or want an improvement? Tell us — our team reviews every comment, and good ideas get built. Comments are public and anonymous.

Spotted a score that's out of date or a new model worth adding?

Email me at [email protected] — most updates ship within 24 hours.