Home » Reducing AI Hallucinations » Hallucination Rates Compared

AI Hallucination Rates by Model Compared

Hallucination rates vary significantly across models, benchmarks, and question types. Frontier models from OpenAI, Anthropic, and Google typically hallucinate on 3% to 10% of factual questions in controlled benchmarks, while smaller and open-source models range from 8% to 25%. These numbers change dramatically depending on the domain, question type, and whether grounding context is provided, which makes raw model comparisons less useful than understanding what factors drive the rates for your specific use case.

How Hallucination Rates Are Measured

There is no single standard benchmark for hallucination. Different evaluation frameworks measure different aspects of factual accuracy, and a model that scores well on one benchmark may score poorly on another. The most common approaches are factual accuracy benchmarks like TruthfulQA that test whether models produce correct answers to tricky questions designed to elicit common misconceptions, summarization faithfulness benchmarks like SummEval and FRANK that measure whether summaries add, omit, or distort information from the source, and RAG-specific benchmarks like RAGAS and ARES that measure how well models stay grounded in retrieved context.

The numbers that get cited in marketing materials and headlines are usually best-case results from benchmarks chosen to make the model look good. Real-world hallucination rates depend heavily on the domain, the question type, the amount and quality of grounding context, and how "hallucination" is defined for the evaluation. A model that hallucinates on only 3% of general knowledge questions might hallucinate on 20% of domain-specific questions about topics underrepresented in its training data. Treat benchmark numbers as relative indicators (Model A hallucinates less than Model B on this benchmark) rather than absolute predictions of production behavior.

Factors That Affect Hallucination Rates

Model size and architecture matter, but less than you might expect. Larger models generally hallucinate less than smaller ones because they have more capacity to store accurate patterns. But the relationship is not linear: doubling the model size does not halve the hallucination rate. Frontier models from 2025 and 2026 show diminishing returns on factual accuracy from scale alone, suggesting that architectural improvements and training techniques contribute more than raw parameter count beyond a certain threshold.

Training data quality has a large effect that is difficult for external users to measure. Models trained on higher-quality, more carefully filtered data hallucinate less because they absorb fewer errors and contradictions during training. This is one area where closed-source models from well-funded labs have an advantage: they invest heavily in data curation that open-source training runs often cannot match.

Domain specificity is one of the strongest predictors of hallucination rate. All models perform better on topics well-represented in training data (popular programming languages, well-known historical events, basic science) and worse on specialized topics (niche technical specifications, recent events, domain-specific terminology). For a domain-specific application, the relevant hallucination rate is not the model's general benchmark score but its accuracy on your domain's questions, which you need to measure yourself.

Question type matters enormously. Open-ended questions ("explain how X works") have lower hallucination rates because the model has more room for accurate general statements. Closed-ended factual questions ("what is the exact value of X") have higher rates because precision is required and approximation counts as error. Questions requiring specific names, dates, numbers, and citations have the highest hallucination rates across all models.

Context window usage affects rates significantly. Models hallucinate more when answering from long contexts than from short ones, because attention to specific facts degrades as context length increases. A fact placed in a 2,000-token context is more reliably used than the same fact buried in a 100,000-token context. This is independent of whether the model technically supports long context windows; support and reliable use are different things.

Hallucination Rates by Question Type

The most useful way to think about hallucination rates is by question category rather than by model, because the variation between question types is much larger than the variation between models. Factual recall questions (what is X, when did Y happen, who created Z) have the highest hallucination rates because they require precise retrieval from parametric memory. Explanation questions (how does X work, why does Y happen) have lower rates because the model can generate accurate general explanations even when it lacks specific recall. Synthesis questions (compare X and Y, summarize this document) have moderate rates, with most errors coming from misattribution rather than outright fabrication.

Numerical questions deserve special attention because they have consistently high hallucination rates across all models. When asked for specific numbers, dates, percentages, or measurements, models hallucinate at 15% to 30% rates without grounding, compared to 5% to 15% for general factual questions. The model generates numbers that are in the right ballpark but often wrong in the specifics, because it approximates from statistical patterns rather than retrieving exact values. This is why grounding with structured data (knowledge graphs, databases, verified memories with exact values) has such a large impact on numerical accuracy specifically.

The Grounding Effect

The most impactful factor in hallucination rates is not which model you use but whether you provide grounding context. Studies consistently show that adding retrieval grounding reduces hallucination rates by 40% to 70% across all models, which is a larger improvement than switching from any model to any other model. A mid-tier model with good retrieval grounding outperforms a frontier model without grounding on factual accuracy for domain-specific questions.

This finding has practical implications for system design. Teams that spend months evaluating which model hallucinates least would get more hallucination reduction from spending that time building a better retrieval system, knowledge graph, or persistent memory layer. The model matters, but the grounding architecture matters more.

The quality of grounding also matters. Naive RAG (basic vector search with unfiltered retrieval) provides some improvement but leaves significant hallucination risk. Advanced grounding with hybrid search, reranking, knowledge graph verification, and confidence-scored memories reduces hallucination rates much further. A system with cognitive-scored retrieval from Adaptive Recall, which ranks context by relevance, recency, confidence, and entity connections, provides higher-quality grounding than a basic vector search, which translates directly to lower hallucination rates in the generated output.

The grounding effect compounds over time in systems with persistent memory. As the memory store accumulates verified facts from interactions, the grounding context available for each query becomes richer and more domain-specific. Early queries might have sparse grounding and higher hallucination rates. After hundreds of interactions, the memory store covers most common query topics with high-confidence verified facts, and hallucination rates on those topics drop to near the architectural minimum. This temporal improvement is unique to persistent memory systems and does not occur with static knowledge bases.

Open-Source vs Closed-Source Models

Closed-source frontier models (GPT-4 class, Claude 3 class, Gemini class) generally have lower hallucination rates than open-source models of comparable or smaller size. The gap has narrowed significantly as open-source models have improved, but it persists, particularly for factual questions requiring precise recall. The practical difference depends on your application: for many use cases with good grounding in place, the gap between model families is small enough that cost, latency, and deployment flexibility matter more than raw hallucination rate differences.

Open-source models have the advantage of customization. You can fine-tune an open-source model on your domain data, which reduces domain-specific hallucinations more effectively than general-purpose improvements. A fine-tuned open-source model with domain-specific grounding can achieve hallucination rates comparable to or better than a frontier model without fine-tuning, for questions within the fine-tuning domain.

The trade-offs between open and closed models extend beyond raw accuracy. Closed-source models typically have better instruction following, which means they adhere to grounding constraints more reliably. An open-source model might have a 7% baseline hallucination rate but ignore provided context 15% of the time, while a closed-source model might have a 5% baseline rate and ignore context only 5% of the time. In a grounded system, the closed-source model's better instruction following produces a larger net improvement because it makes better use of the grounding you provide.

What This Means for System Design

If your goal is to minimize hallucinations in a production application, the priority order is: first, build high-quality grounding (retrieval, knowledge graph, persistent memory); second, constrain the model's generation to use the grounding (explicit prompt instructions, citation requirements); third, add detection and verification layers (fact-checking, entailment verification); and fourth, choose the best available model within your cost and latency constraints. This order reflects the actual impact of each factor on production hallucination rates, which differs from the popular assumption that the model choice is the most important factor.

Build your evaluation framework early and measure hallucination rates for your specific domain rather than relying on published benchmarks. Your domain-specific rates will differ from general benchmarks, and the model that performs best on general benchmarks may not perform best on your questions. A good evaluation framework also lets you measure the impact of each mitigation layer independently, so you can invest in the layers that provide the most improvement for your specific application.

Reduce hallucinations at the architecture level. Adaptive Recall provides cognitive-scored retrieval, knowledge graph grounding, and confidence-weighted memories that cut fabrication rates regardless of which model you use.

Get Started Free