AI Model Comparison — GPT, Claude, Gemini, Llama side by side
18 of the most-used LLMs in one table — context window, input and output pricing, vision, audio, function calling, training cutoff. Pick three to compare side by side and project the monthly cost at your workload. Every figure cites the vendor source.
How it works
Picking an AI model is a multi-axis decision: input price, output price, context window, output cap, modalities, training cutoff, and whether the weights are downloadable. Most comparison pages on the web pick two of those axes and call it done. This page lays out all of them at once for the 18 models that almost every Sri Lankan developer, student, or startup ends up considering — drawn from 7 vendors.
1. The pricing formula
Every commercial LLM API charges separately for input and output tokens. The cost of one call is:
usd_per_call = (input_tokens ÷ 1,000,000) × input_$/M + (output_tokens ÷ 1,000,000) × output_$/M
Monthly cost is then per-call multiplied by the number of calls you make per month. The cost projection in the tool above uses this formula directly — no caching credits, no batch discounts, no enterprise rates. That is the published list price, which is what you actually pay on most plans.
2. Reasoning models bill hidden tokens
Models with extended thinking (OpenAI o1, o3-mini, Claude with extended thinking, DeepSeek R1) emit a chain-of-thought before the visible answer. The vendor bills all of those tokens as output. A 200-word visible reply can consume 2,000–5,000 output tokens on a hard problem. When you compare a reasoning model to a chat model on this page, multiply the reasoning model's output number by 3–10× before drawing conclusions.
3. Context window vs output cap
Context window is the size of the prompt the model can read. Output max is the size of the reply the model can write. These are different limits. Gemini 2.5 Pro reads 2M tokens but only emits ~65K. Llama 4 Maverick advertises 1M context with an 8K output cap. For long-form generation (article drafts, code refactors), the output cap is the one that bites first.
4. Capability flags
Vision, audio, function calling, reasoning, and open weights are independent dimensions. A model can support any combination. The comparison table marks each capability with a chip; struck-through chips mean the model lacks that capability. The vendor docs URL on each row is the authoritative source — capabilities sometimes ship in API tiers behind allowlists or paid plans.
5. Cross-check
The data module exports a deterministic verifyWorkedExamples() function that recomputes seven hand-derived test cases — including zero-input edges, the 1M-token boundary, and a 10⁹-token large input — to assert the cost math matches the file-header arithmetic to within a millionth of a cent. A second integrity check asserts unique ids, non-negative prices, positive context windows, and non-empty positioning notes. Both run at typecheck time. If a row drifts during a quarterly update, the build fails.
Worked examples
Frequently asked questions
Sources & references
- OpenAI — API pricing
- OpenAI — Models documentation (context windows, capabilities)
- Anthropic — Claude API pricing
- Anthropic — Claude model cards
- Google — Gemini API pricing
- Google — Gemini model documentation
- Together AI — Llama 4 hosted pricing
- Meta — Llama model cards
- DeepSeek — API pricing
- xAI — Grok models and pricing
- Mistral AI — La Plateforme pricing
The dataset and its cited sources were last cross-checked on 2026-05-12. Pricing and capability flags are reviewed quarterly and whenever a vendor publishes a substantive pricing or model update.
Related tools
Comments & feedback
Spotted a bug or want an improvement? Tell us — our team reviews every comment, and good ideas get built. Comments are public and anonymous.
Spot a stale price, a missing model, or a misclaimed capability?
Email me at [email protected] — most fixes ship within 24 hours.