induwara.lk
induwara.lkAI · Model comparison

AI Hallucination Rate Comparison

Compare how often leading AI models invent facts. Pick two LLMs and a document count to see each one's hallucination rate, factual consistency, and the expected number of wrong summaries — every figure taken from the Vectara HHEM leaderboard. No signup, sources cited below.

By Induwara AshinsanaUpdated Jun 21, 2026
Compare AI hallucination ratesHHEM-2.3
Vectara HHEM · snapshot 2026-05-11

How many documents your app will summarise. We estimate how many summaries each model would get factually wrong at that scale.

Quick presets
GPT-5.4 nano
OpenAI
Fewer
Hallucination rate
3.1%
Factual consistency
96.9%
Answer rate
100.0%
Avg summary
144.4 words
Expected wrong summaries
31of 1,000
Llama 3.3 70B Instruct Turbo
Meta
Hallucination rate
4.1%
Factual consistency
95.9%
Answer rate
99.5%
Avg summary
64.6 words
Expected wrong summaries
41of 1,000
Relative hallucination reduction
24.4%

GPT-5.4 nano hallucinates 24.4% less often than Llama 3.3 70B Instruct Turbo (3.1% vs 4.1%).

Fewer wrong summaries
10

Across 1,000 documents, GPT-5.4 nano is expected to produce 10 fewer factually-wrong summaries than Llama 3.3 70B Instruct Turbo.

Full leaderboard snapshot

#
1
Finix S1 32B
Ant Group
1.8%98.2%99.5%
2
GPT-5.4 nano
OpenAI
3.1%96.9%100.0%
3
Gemini 2.5 Flash-Lite
Google
3.3%96.7%99.5%
4
Phi-4
Microsoft
3.7%96.3%80.7%
5
Llama 3.3 70B Instruct Turbo
Meta
4.1%95.9%99.5%
6
Arctic Instruct
Snowflake
4.3%95.7%62.7%
7
Gemma 3 12B
Google
4.4%95.6%97.4%
8
Mistral Large (24.11)
Mistral AI
4.5%95.5%99.9%
9
Qwen3 8B
Alibaba Qwen
4.8%95.2%99.9%
10
Nova Pro
Amazon
5.1%94.9%99.3%
11
Nova 2 Lite
Amazon
5.1%94.9%99.6%
12
Mistral Small (25.01)
Mistral AI
5.1%94.9%97.9%
13
Granite 4.0 H Small
IBM
5.2%94.8%100.0%
14
Gemma 4 26B A4B
Google
5.2%94.8%99.8%
15
Jamba Mini 2
AI21 Labs
5.3%94.7%99.6%
16
DeepSeek V3.2 Exp
DeepSeek
5.3%94.7%96.6%
17
Qwen3 14B
Alibaba Qwen
5.4%94.6%99.9%
18
Nova Micro
Amazon
5.5%94.5%100.0%
19
DeepSeek V3.1
DeepSeek
5.5%94.5%94.5%
20
GPT-5.4 mini
OpenAI
5.5%94.5%100.0%
21
GPT-4.1
OpenAI
5.6%94.4%99.9%
22
Qwen3 4B
Alibaba Qwen
5.7%94.3%99.9%
23
Grok 3
xAI
5.8%94.2%93.0%
24
Qwen3 32B
Alibaba Qwen
5.9%94.1%99.9%
25
Nova Lite
Amazon
6.1%93.9%99.9%
26
DeepSeek V3
DeepSeek
6.1%93.9%97.5%
27
DeepSeek V3.2
DeepSeek
6.3%93.7%92.6%
28
Gemma 3 4B
Google
6.4%93.6%67.3%
29
Command R+ (08-2024)
Cohere
6.9%93.1%95.0%
30
Trinity Large Preview
Arcee AI
6.9%93.1%99.0%

Every rate is a published value from the Vectara HHEM leaderboard (HHEM-2.3, snapshot 2026-05-11). Expected counts and relative reduction are arithmetic over those values — no opinion, no scoring.

How it works

Every number on this page comes from the Vectara Hughes Hallucination Evaluation Model (HHEM) leaderboard, the most-cited public benchmark for LLM factual faithfulness. It does not measure reasoning or coding skill — it measures one thing: when you hand a model a document and ask for a summary, how often does the summary contain a claim the document never made?

The protocol is fixed and reproducible. Each model is given the same corpus of 7,700+ articles (50–24,000 words each)and asked only to summarise. Vectara's HHEM model then scores every summary for factual consistency against its source. The published figures, scored with HHEM-2.3, are:

  • Hallucination Rate — the share of answered summaries with at least one unsupported claim. Lower is better.
  • Factual Consistency Rate — the complement, FCR = 100 − HallucinationRate. This page recomputes it from the hallucination rate as an internal cross-check, so the two columns can never silently disagree.
  • Answer Rate — how often the model actually produced a summary instead of refusing. A low answer rate can flatter the hallucination rate, because skipped documents are never scored.

From those cited values the tool derives just two figures, with plain arithmetic and no scoring weights:

  1. Expected wrong summaries for a batch of N documents: E = round( (HallucinationRate ÷ 100) × N ). This turns an abstract percentage into a concrete count for the scale you actually operate at.
  2. Relative hallucination reduction of the more accurate model over the other: ((HR_high − HR_low) ÷ HR_high) × 100. A drop from 4.1% to 3.1% is a 24% relative reduction, even though the absolute gap is just one percentage point.

Because the data is a dated snapshot (2026-05-11), the tool makes no network call and renders instantly. When Vectara refreshes the leaderboard, this snapshot and its LAST_VERIFIED date are updated together.

Worked examples

GPT-5.4 nano vs Llama 3.3 70B Instruct Turbo, 1,000 documents

  1. GPT-5.4 nano: rate 3.1% → round(0.031 × 1000) = 31 wrong summaries
  2. Llama 3.3 70B Instruct Turbo: rate 4.1% → round(0.041 × 1000) = 41 wrong summaries
  3. Fewer wrong summaries: 10
  4. Relative reduction: (4.1 − 3.1) ÷ 4.1 × 100 = 24.4%
  5. Factual-consistency cross-check: 100 − 3.1 = 96.9% and 100 − 4.1 = 95.9% ✓

Gemini 2.5 Flash-Lite vs Grok 3, 500 documents

  1. Gemini 2.5 Flash-Lite: rate 3.3% → round(0.033 × 500) = round(16.5) = 17
  2. Grok 3: rate 5.8% → round(0.058 × 500) = 29
  3. Fewer wrong summaries: 12
  4. Relative reduction: (5.8 − 3.3) ÷ 5.8 × 100 = 43.1%

Edge case — small batch hides the gap (Finix S1 32B, 1 document)

  1. Finix S1 32B: rate 1.8% → round(0.018 × 1) = round(0.018) = 0
  2. Worst snapshot model (6.9%): round(0.069 × 1) = round(0.069) = 0
  3. At N = 1 both round to zero — the rate only bites at scale.
  4. This is why the tool asks for your real batch size instead of judging on a single summary.

Frequently asked questions

Sources & references

The model rates on this page were last cross-checked against the Vectara HHEM leaderboard on 2026-06-21. Figures are a dated snapshot and are refreshed when Vectara updates the leaderboard. Vendor and model names are trademarks of their respective owners.

Related tools

Rate this tool
Be the first to rate

Comments & feedback

Spotted a bug or want an improvement? Tell us — our team reviews every comment, and good ideas get built. Comments are public and anonymous.

Spotted a model that should be added, or a number that looks off?

Email me at [email protected] — most fixes ship within 24 hours.