Which AI model hallucinates the least?

On the Vectara HHEM snapshot used here (2026-05-11), Finix S1 32B by Ant Group has the lowest measured hallucination rate at 1.8%, followed by GPT-5.4 nano at 3.1%. Rankings shift with each leaderboard refresh, so check the live table for the current order.

What is a hallucination rate for an LLM?

It is the share of answered summaries that contain at least one claim not supported by the source document. Vectara measures it by giving each model a fixed set of documents to summarise, then scoring every summary with its HHEM factual-consistency model. A 4% rate means roughly 4 of every 100 summaries invent or distort a fact.

How is AI hallucination measured?

Vectara runs each model over 7,700+ articles (50–24,000 words each), asking only for a faithful summary. The HHEM model then scores each summary for factual consistency against its source. The hallucination rate is computed over answered prompts only, which is why the answer rate is shown alongside — a model that skips hard documents can look artificially accurate.

Does a bigger AI model hallucinate less?

Not reliably. The leaderboard regularly shows small models out-scoring much larger ones on factual consistency, because faithfulness depends on training and decoding choices, not just parameter count. Use the ranked table here rather than assuming the largest or newest model is the most accurate.

Why do the rates here differ from benchmark scores like MMLU?

MMLU, GPQA and SWE-bench measure reasoning, knowledge and coding skill. Hallucination rate measures something different: factual faithfulness when summarising a document you provide. A model can ace reasoning benchmarks and still invent details in a summary, so the two should be read together — not interchangeably.

Are these numbers live or a snapshot?

They are a dated static snapshot taken from the Vectara HHEM leaderboard on 2026-05-11 (scoring model HHEM-2.3), last re-verified on 2026-06-21. The tool makes no network call, so it stays fast and deterministic. For the absolute latest figures, open the leaderboard linked in the sources below.

How do I use the expected-wrong-summaries number?

Enter the number of documents your app will summarise. The tool multiplies each model's hallucination rate by that batch size to estimate how many summaries would contain a fabricated fact. It is a planning estimate based on the benchmark average, not a guarantee for your specific documents.

Should I pick a model on hallucination rate alone?

No. Read it with the answer rate (does the model refuse hard prompts?), plus price and latency from the related cost and speed tools. A low hallucination rate paired with a low answer rate can mean the model dodges difficult documents rather than summarising them faithfully.

AI · Model comparison

AI Hallucination Rate Comparison

Compare how often leading AI models invent facts. Pick two LLMs and a document count to see each one's hallucination rate, factual consistency, and the expected number of wrong summaries — every figure taken from the Vectara HHEM leaderboard. No signup, sources cited below.

By Induwara Ashinsana— Executive Director, Ryzera TechnologiesUpdated Jun 21, 2026

Compare AI hallucination ratesHHEM-2.3

Vectara HHEM · snapshot 2026-05-11

Model A

Model B

Documents to summarise (batch size)

How many documents your app will summarise. We estimate how many summaries each model would get factually wrong at that scale.

Quick presets

GPT-5.4 nano

OpenAI

Fewer

Hallucination rate

3.1%

Factual consistency

96.9%

Answer rate

100.0%

Avg summary

144.4 words

Expected wrong summaries

31of 1,000

Llama 3.3 70B Instruct Turbo

Full leaderboard snapshot

#
1	Finix S1 32B Ant Group	1.8%	98.2%	99.5%
2	GPT-5.4 nano OpenAI	3.1%	96.9%	100.0%
3	Gemini 2.5 Flash-Lite Google	3.3%	96.7%	99.5%
4	Phi-4 Microsoft	3.7%	96.3%	80.7%
5	Llama 3.3 70B Instruct Turbo Meta	4.1%	95.9%	99.5%
6	Arctic Instruct Snowflake	4.3%	95.7%	62.7%
7	Gemma 3 12B Google	4.4%	95.6%	97.4%
8	Mistral Large (24.11) Mistral AI	4.5%	95.5%	99.9%
9	Qwen3 8B Alibaba Qwen	4.8%	95.2%	99.9%
10	Nova Pro Amazon	5.1%	94.9%	99.3%
11	Nova 2 Lite Amazon	5.1%	94.9%	99.6%
12	Mistral Small (25.01) Mistral AI	5.1%	94.9%	97.9%
13	Granite 4.0 H Small IBM	5.2%	94.8%	100.0%
14	Gemma 4 26B A4B Google	5.2%	94.8%	99.8%
15	Jamba Mini 2 AI21 Labs	5.3%	94.7%	99.6%
16	DeepSeek V3.2 Exp DeepSeek	5.3%	94.7%	96.6%
17	Qwen3 14B Alibaba Qwen	5.4%	94.6%	99.9%
18	Nova Micro Amazon	5.5%	94.5%	100.0%
19	DeepSeek V3.1 DeepSeek	5.5%	94.5%	94.5%
20	GPT-5.4 mini OpenAI	5.5%	94.5%	100.0%
21	GPT-4.1 OpenAI	5.6%	94.4%	99.9%
22	Qwen3 4B Alibaba Qwen	5.7%	94.3%	99.9%
23	Grok 3 xAI	5.8%	94.2%	93.0%
24	Qwen3 32B Alibaba Qwen	5.9%	94.1%	99.9%
25	Nova Lite Amazon	6.1%	93.9%	99.9%
26	DeepSeek V3 DeepSeek	6.1%	93.9%	97.5%
27	DeepSeek V3.2 DeepSeek	6.3%	93.7%	92.6%
28	Gemma 3 4B Google	6.4%	93.6%	67.3%
29	Command R+ (08-2024) Cohere	6.9%	93.1%	95.0%
30	Trinity Large Preview Arcee AI	6.9%	93.1%	99.0%

Every rate is a published value from the Vectara HHEM leaderboard (HHEM-2.3, snapshot 2026-05-11). Expected counts and relative reduction are arithmetic over those values — no opinion, no scoring.

How it works

Every number on this page comes from the Vectara Hughes Hallucination Evaluation Model (HHEM) leaderboard, the most-cited public benchmark for LLM factual faithfulness. It does not measure reasoning or coding skill — it measures one thing: when you hand a model a document and ask for a summary, how often does the summary contain a claim the document never made?

The protocol is fixed and reproducible. Each model is given the same corpus of 7,700+ articles (50–24,000 words each)and asked only to summarise. Vectara's HHEM model then scores every summary for factual consistency against its source. The published figures, scored with HHEM-2.3, are:

Hallucination Rate — the share of answered summaries with at least one unsupported claim. Lower is better.
Factual Consistency Rate — the complement, FCR = 100 − HallucinationRate. This page recomputes it from the hallucination rate as an internal cross-check, so the two columns can never silently disagree.
Answer Rate — how often the model actually produced a summary instead of refusing. A low answer rate can flatter the hallucination rate, because skipped documents are never scored.

From those cited values the tool derives just two figures, with plain arithmetic and no scoring weights:

Expected wrong summaries for a batch of N documents: E = round( (HallucinationRate ÷ 100) × N ). This turns an abstract percentage into a concrete count for the scale you actually operate at.
Relative hallucination reduction of the more accurate model over the other: ((HR_high − HR_low) ÷ HR_high) × 100. A drop from 4.1% to 3.1% is a 24% relative reduction, even though the absolute gap is just one percentage point.

Because the data is a dated snapshot (2026-05-11), the tool makes no network call and renders instantly. When Vectara refreshes the leaderboard, this snapshot and its LAST_VERIFIED date are updated together.

Worked examples

GPT-5.4 nano vs Llama 3.3 70B Instruct Turbo, 1,000 documents

GPT-5.4 nano: rate 3.1% → round(0.031 × 1000) = 31 wrong summaries
Llama 3.3 70B Instruct Turbo: rate 4.1% → round(0.041 × 1000) = 41 wrong summaries
Fewer wrong summaries: 10
Relative reduction: (4.1 − 3.1) ÷ 4.1 × 100 = 24.4%
Factual-consistency cross-check: 100 − 3.1 = 96.9% and 100 − 4.1 = 95.9% ✓

Gemini 2.5 Flash-Lite vs Grok 3, 500 documents

Gemini 2.5 Flash-Lite: rate 3.3% → round(0.033 × 500) = round(16.5) = 17
Grok 3: rate 5.8% → round(0.058 × 500) = 29
Fewer wrong summaries: 12
Relative reduction: (5.8 − 3.3) ÷ 5.8 × 100 = 43.1%

Edge case — small batch hides the gap (Finix S1 32B, 1 document)

Finix S1 32B: rate 1.8% → round(0.018 × 1) = round(0.018) = 0
Worst snapshot model (6.9%): round(0.069 × 1) = round(0.069) = 0
At N = 1 both round to zero — the rate only bites at scale.
This is why the tool asks for your real batch size instead of judging on a single summary.

Frequently asked questions

Sources & references

The model rates on this page were last cross-checked against the Vectara HHEM leaderboard on 2026-06-21. Figures are a dated snapshot and are refreshed when Vectara updates the leaderboard. Vendor and model names are trademarks of their respective owners.

Related tools

LiveAI

AI Data Privacy Compare

Does ChatGPT, Claude, Gemini, Copilot, Meta AI, Grok, DeepSeek or Mistral train on your data? A side-by-side reference across consumer, API and enterprise tiers — training, retention, human review, opt-out and Zero-Data-Retention — every cell cited to the official policy.

Open tool

LiveAI

LLM Benchmark Compare

Compare the top large language models by their published benchmark scores — MMLU-Pro, GPQA Diamond, SWE-bench Verified, HumanEval, AIME, MATH and MMMU. Pick 2–6 models, sort by any benchmark, and get an apples-to-apples composite. Every figure cited from the vendor's own model card.

Open tool

LiveAI

AI Knowledge Cutoff

Look up any major LLM's training-data knowledge cutoff and release date, straight from the provider's model card, with how stale its knowledge is today.

Open tool

Rate this tool

Be the first to rate

Comments & feedback

Spotted a bug or want an improvement? Tell us — our team reviews every comment, and good ideas get built. Comments are public and anonymous.

Spotted a model that should be added, or a number that looks off?

Email me at [email protected] — most fixes ship within 24 hours.