AI LLM Benchmark Comparison
Compare 7 leading language models across 7 published benchmarks — MMLU-Pro, GPQA Diamond, SWE-bench Verified, HumanEval, AIME, MATH and MMMU. Pick 2–6 models, sort by any score, and get an apples-to-apples composite. Every figure is cited from the vendor's own model card.
How it works
This page is, first and foremost, a lookup table of figures the model vendors have published themselves. No score is invented, smoothed, or re-measured by us — each one is transcribed verbatim from an OpenAI, Anthropic, Google DeepMind, Meta, DeepSeek or xAI model card, stored with the exact setting it was reported under, and linked back to its source. Two small, deterministic computations sit on top of that table.
The composite. For the models you currently have selected, the tool finds the common set— the benchmarks for which every selected model has a figure on that benchmark's canonical setting. The composite for each model is the plain unweighted mean of its scores over that common set: composite = (1 / |B|) · Σ score(model, b) for every benchmark b in the common set B. Because the common set is recomputed every time you change the selection, the comparison stays apples-to-apples: if you add a model that never published HumanEval, HumanEval drops out of the composite for everyone, so the missing data penalises no one. A benchmark with a dash for any selected model is excluded — a dash means “not reported,” never a zero.
The lead delta. On whichever benchmark you sort by, the tool reports the gap between first and second place in percentage points: lead = score(rank 1) − score(rank 2). A two-point lead on SWE-bench Verified is a much smaller real-world difference than the headline ranking suggests, and the delta makes that explicit.
Settings matter.Vendors sometimes report the same benchmark on different settings — pass@1 with no tools versus a score boosted by test-time tools or extra compute. Each benchmark here has one canonical setting (for example, SWE-bench Verified is pass@1 with no internet access). Any figure reported on a different setting is shown in the table for transparency but marked with a “†” and excluded from the composite, so a tools-on number is never silently compared against a tools-off one. The benchmarks themselves are defined by their original papers: MMLU-Pro (TIGER-Lab), GPQA (Rein et al.), SWE-bench Verified (the SWE-bench team and OpenAI), HumanEval (Chen et al.), AIME (the Mathematical Association of America), MATH (Hendrycks et al.) and MMMU (Yue et al.) — all linked under Sources below. For pricing, context windows and modality flags rather than capability scores, see the companion AI Model Comparison tool.
Worked examples
Frequently asked questions
Sources & references
Each score in the table links to the exact vendor model card it was transcribed from. The benchmark definitions below are the canonical papers and pages describing what each test measures.
- MMLU-Pro — Graduate-level multitask knowledge across 14 subjects — a harder, 10-option rebuild of MMLU that resists guessing.
- GPQA Diamond — 448 PhD-written biology, physics and chemistry questions experts get right ~65% of the time; the Diamond subset is the hardest, most-vetted slice.
- SWE-bench Verified — Real GitHub issues from popular Python repos; the model must produce a patch that makes the project's own test suite pass. Verified = a 500-task human-validated subset.
- HumanEval — 164 hand-written Python programming problems scored by whether generated code passes hidden unit tests. The classic code-generation benchmark.
- AIME — American Invitational Mathematics Examination — competition problems with integer answers, well above typical school maths. A standard hard-math benchmark.
- MATH — 12,500 competition mathematics problems with step-by-step solutions across seven topics; the MATH-500 subset is the commonly reported slice.
- MMMU — Massive Multi-discipline Multimodal Understanding — 11.5K college-level questions that pair text with charts, diagrams and images. The headline vision-reasoning test.
- SWE-bench — official leaderboard and task definition
Vendor score cards: OpenAI (openai.com/index), Anthropic (anthropic.com/news), Google DeepMind (deepmind.google/models/gemini), Meta (ai.meta.com/blog), DeepSeek (api-docs.deepseek.com/news) and xAI (x.ai/news). All figures were last reconciled against these sources on 2026-06-13.
Related tools
Comments & feedback
Spotted a bug or want an improvement? Tell us — our team reviews every comment, and good ideas get built. Comments are public and anonymous.
Spotted a score that's out of date or a new model worth adding?
Email me at [email protected] — most updates ship within 24 hours.