induwara.lk
induwara.lkAI · Benchmarks

AI LLM Benchmark Comparison

Compare 7 leading language models across 7 published benchmarks — MMLU-Pro, GPQA Diamond, SWE-bench Verified, HumanEval, AIME, MATH and MMMU. Pick 2–6 models, sort by any score, and get an apples-to-apples composite. Every figure is cited from the vendor's own model card.

By Induwara AshinsanaUpdated Jun 13, 2026
Compare LLM benchmarks7 models · 7 benchmarks
Vendor-cited · verified 2026-06-13
Pick models to compare (26) · 3 selected

Tap to add or remove a model.

Best overall
Claude Opus 4.5
Highest composite over the shared benchmark set
Best for coding
Claude Opus 4.5
SWE-bench Verified
Best for math
Claude Opus 4.5
AIME
Best for reasoning
Claude Opus 4.5
GPQA Diamond
Best for vision
GPT-5
MMMU

Composite ranking

#1 leads #2 by 1.2 pts
1.Claude Opus 4.5
86.5%
2.GPT-5
85.3%
3.Gemini 2.5 Pro
80.8%

Full score table

ModelCompositeMMLU-ProGPQASWE-benchHumanEvalAIMEMATHMMMU
Claude Opus 4.5
Anthropic
86.588.087.080.996.080.7
GPT-5
OpenAI
85.387.085.774.994.684.2
Gemini 2.5 Pro
Google
80.886.284.063.888.090.082.0

Composite = unweighted mean over the 5 benchmarks every selected model reports on-setting (MMLU-Pro, GPQA, SWE-bench, AIME, MMMU). A “†” marks an off-setting figure: shown for transparency, excluded from the composite.

Every figure is a vendor-published headline number, transcribed verbatim with its reported setting — tap any score to open the model card it came from. “—” means the vendor never published that benchmark (never counted as zero). Scores last reconciled against the source cards on 2026-06-13.

How it works

This page is, first and foremost, a lookup table of figures the model vendors have published themselves. No score is invented, smoothed, or re-measured by us — each one is transcribed verbatim from an OpenAI, Anthropic, Google DeepMind, Meta, DeepSeek or xAI model card, stored with the exact setting it was reported under, and linked back to its source. Two small, deterministic computations sit on top of that table.

The composite. For the models you currently have selected, the tool finds the common set— the benchmarks for which every selected model has a figure on that benchmark's canonical setting. The composite for each model is the plain unweighted mean of its scores over that common set: composite = (1 / |B|) · Σ score(model, b) for every benchmark b in the common set B. Because the common set is recomputed every time you change the selection, the comparison stays apples-to-apples: if you add a model that never published HumanEval, HumanEval drops out of the composite for everyone, so the missing data penalises no one. A benchmark with a dash for any selected model is excluded — a dash means “not reported,” never a zero.

The lead delta. On whichever benchmark you sort by, the tool reports the gap between first and second place in percentage points: lead = score(rank 1) − score(rank 2). A two-point lead on SWE-bench Verified is a much smaller real-world difference than the headline ranking suggests, and the delta makes that explicit.

Settings matter.Vendors sometimes report the same benchmark on different settings — pass@1 with no tools versus a score boosted by test-time tools or extra compute. Each benchmark here has one canonical setting (for example, SWE-bench Verified is pass@1 with no internet access). Any figure reported on a different setting is shown in the table for transparency but marked with a “†” and excluded from the composite, so a tools-on number is never silently compared against a tools-off one. The benchmarks themselves are defined by their original papers: MMLU-Pro (TIGER-Lab), GPQA (Rein et al.), SWE-bench Verified (the SWE-bench team and OpenAI), HumanEval (Chen et al.), AIME (the Mathematical Association of America), MATH (Hendrycks et al.) and MMMU (Yue et al.) — all linked under Sources below. For pricing, context windows and modality flags rather than capability scores, see the companion AI Model Comparison tool.

Worked examples

Composite over the common benchmark set

  1. Select three models that all report {MMLU-Pro, GPQA, SWE-bench, HumanEval}.
  2. Model A scores 88.0 / 84.0 / 72.0 / 96.0.
  3. Composite(A) = (88.0 + 84.0 + 72.0 + 96.0) / 4 = 340.0 / 4 = 85.0
  4. Model B at 86.0 / 82.0 / 70.0 / 94.0 → 332.0 / 4 = 83.0
  5. Model C at 85.0 / 80.0 / 65.0 / 92.0 → 322.0 / 4 = 80.5
  6. Podium: A (85.0), B (83.0), C (80.5). SWE-bench lead = 72.0 − 70.0 = 2.0 pts.

The common set shrinks when a model lacks a benchmark

  1. Add Model D, which never published HumanEval.
  2. Common set across A/B/C/D becomes {MMLU-Pro, GPQA, SWE-bench} — HumanEval drops.
  3. A = (88.0 + 84.0 + 72.0) / 3 = 244.0 / 3 = 81.33
  4. D = (87.0 + 83.0 + 71.0) / 3 = 241.0 / 3 = 80.33
  5. B = 238.0 / 3 = 79.33, C = 230.0 / 3 = 76.67
  6. New podium: A, D, B. D's missing HumanEval no longer penalises anyone.

A real selection — GPT-5 vs Claude Opus 4.5 vs Gemini 2.5 Pro

  1. All three report MMLU-Pro, GPQA, SWE-bench, AIME and MMMU on-setting.
  2. Common set = those 5 benchmarks (none has a dash across the three).
  3. Sort by SWE-bench Verified: Claude Opus 4.5 (80.9) leads GPT-5 (74.9).
  4. Lead delta = 80.9 − 74.9 = 6.0 pts — a clear coding edge.
  5. Sort by AIME: Claude Opus 4.5 (96.0) edges GPT-5 (94.6) and Gemini 2.5 Pro (88.0).

Frequently asked questions

Sources & references

Each score in the table links to the exact vendor model card it was transcribed from. The benchmark definitions below are the canonical papers and pages describing what each test measures.

Vendor score cards: OpenAI (openai.com/index), Anthropic (anthropic.com/news), Google DeepMind (deepmind.google/models/gemini), Meta (ai.meta.com/blog), DeepSeek (api-docs.deepseek.com/news) and xAI (x.ai/news). All figures were last reconciled against these sources on 2026-06-13.

Related tools

Rate this tool
Be the first to rate

Comments & feedback

Spotted a bug or want an improvement? Tell us — our team reviews every comment, and good ideas get built. Comments are public and anonymous.

Spotted a score that's out of date or a new model worth adding?

Email me at [email protected] — most updates ship within 24 hours.