Published benchmark results for AI hallucination rates vary widely depending on who is measuring, what task type is being evaluated, and what counts as a 'hallucination' in the methodology. A model that scores well on a medical knowledge benchmark may still produce citation errors on research queries. Context matters as much as the headline number.

What benchmarks typically measure

Most published hallucination benchmarks test factual recall — whether a model correctly answers questions about documented facts. Common benchmark types include: TruthfulQA (tests whether models give truthful answers to questions that often prompt misleading responses), MMLU (measures knowledge across academic domains), FActScoring (measures factual precision in AI-generated summaries), and HaluEval (evaluates hallucinations in reasoning-intensive tasks). Each benchmark captures a different type of error. A model's performance on one benchmark does not necessarily predict its performance on others.

Key findings from published research

Studies from university research groups and AI safety organizations have found error rates for factual queries ranging from around 3% to over 27%, depending heavily on the task domain. Citation fabrication rates are particularly high — some studies have found that when AI models are asked to provide academic references, a substantial portion of the generated citations cannot be verified against published academic databases. Rates vary by query type: general factual questions, domain-specific technical queries, and citation generation all produce different error profiles.

Why model scores don't translate to personal risk

A model with a 5% hallucination rate on a benchmark still means 1 in 20 responses contains a factual error — and that error may appear in the specific fact you happened to ask about. Benchmarks test populations of queries; your individual query may fall into a category where the error rate is higher. Domain-specific queries, requests for specific citations, and questions about niche topics all tend to have higher error rates than general knowledge questions.

The case for verification regardless of model choice

No currently available AI model has a zero hallucination rate on practical tasks. The specific rate varies by model, version, query type, and domain — but the appropriate response to any non-zero rate in high-stakes contexts is independent verification. This is true regardless of which model you use.

AI Hallucination Rate by Model

What benchmarks typically measure

Key findings from published research

Why model scores don't translate to personal risk

The case for verification regardless of model choice

Verify AI responses automatically