How to Test AI Output for Factual Accuracy
Before You Start
Accuracy testing for AI is fundamentally different from testing traditional software. Deterministic software either returns the right answer or the wrong one, and a passing test means the code is correct. AI output varies between runs, may be partially correct, and can be accurate on one phrasing of a question while hallucinating on a slightly different phrasing. Your testing framework needs to account for this non-determinism by testing at statistical scale (many questions, multiple runs), measuring partial correctness (not just pass/fail), and covering multiple phrasings of the same underlying question.
Step-by-Step Testing Framework
Create a set of at least 100 questions that represent the queries your system handles in production. For each question, provide the verified correct answer, the source of truth for that answer, and any important nuances (acceptable answer variations, time-sensitive qualifiers, required specificity). Draw questions from actual production queries where possible, supplemented with edge cases you want to cover. Organize the dataset by category: factual questions (dates, names, numbers), relationship questions (how A relates to B), process questions (how to do X), and comparison questions (difference between A and B). Each category tends to have different hallucination rates.
# Evaluation dataset structure
eval_dataset = [
{
"id": "auth-001",
"question": "What authentication method does the API use?",
"ground_truth": "OAuth 2.0 with Bearer tokens",
"source": "API Documentation, Section 2.1",
"category": "factual",
"variants": [
"How does the API handle authentication?",
"What auth does the API support?"
]
},
{
"id": "limit-001",
"question": "What is the API rate limit?",
"ground_truth": "100 requests per minute per API key",
"source": "API Documentation, Section 5.3",
"category": "numerical",
"variants": [
"How many API calls can I make per minute?",
"What are the rate limiting rules?"
]
}
]A single "accuracy" number hides important distinctions between failure types. Track multiple metrics. Claim-level accuracy measures the percentage of factual claims in AI output that are correct. Entity accuracy measures how often the AI gets proper nouns right (names, product names, version numbers). Numerical accuracy measures how often specific numbers, dates, and measurements are correct. Completeness measures how often the AI includes all key facts from the ground truth rather than giving a partial answer. Fabrication rate measures how often the AI adds false claims that were not in the ground truth at all. Each metric tells you something different about where your system fails.
class AccuracyMetrics:
def __init__(self):
self.claim_correct = 0
self.claim_total = 0
self.entity_correct = 0
self.entity_total = 0
self.numerical_correct = 0
self.numerical_total = 0
self.fabrication_count = 0
self.completeness_scores = []
def summary(self):
return {
"claim_accuracy": self.claim_correct / max(self.claim_total, 1),
"entity_accuracy": self.entity_correct / max(self.entity_total, 1),
"numerical_accuracy": self.numerical_correct / max(self.numerical_total, 1),
"fabrication_rate": self.fabrication_count / max(self.claim_total, 1),
"avg_completeness": sum(self.completeness_scores) / max(len(self.completeness_scores), 1)
}For each evaluation question, run the AI system to generate a response, then score the response against the ground truth. Automated scoring works at three levels. First, semantic similarity between the generated answer and the ground truth catches gross inaccuracies. Second, entity extraction and comparison catches specific factual errors (wrong names, wrong numbers, wrong versions). Third, entailment classification checks whether each claim in the generated answer is supported by the ground truth. Combine these into a per-question accuracy score. Use an LLM as an automated judge for nuanced cases where simple matching is insufficient: give the judge the question, ground truth, and generated answer, and ask it to score accuracy on a 1 to 5 scale with justification.
def auto_score(question, ground_truth, ai_response):
# Semantic similarity check
sim = cosine_similarity(embed(ground_truth), embed(ai_response))
# Entity comparison
gt_entities = extract_entities(ground_truth)
resp_entities = extract_entities(ai_response)
entity_overlap = len(gt_entities & resp_entities) / max(len(gt_entities), 1)
# LLM judge for nuanced scoring
judge_prompt = f"""Score the AI response for factual accuracy
compared to the ground truth. Score 1-5.
Question: {question}
Ground truth: {ground_truth}
AI response: {ai_response}
Score (1=wrong, 5=perfectly accurate):"""
judge_score = int(llm.generate(judge_prompt).strip()[0])
return {
"similarity": sim,
"entity_overlap": entity_overlap,
"judge_score": judge_score,
"overall": (sim * 0.2 + entity_overlap * 0.3 +
judge_score / 5 * 0.5)
}Execute your full evaluation dataset whenever you change the system prompt, update the knowledge base, switch models, or modify retrieval parameters. Store results with timestamps so you can track accuracy trends over time. Set minimum accuracy thresholds that block deployment if a change causes a regression. For example, if your baseline claim accuracy is 92%, set a threshold at 88% that triggers an alert if a change drops accuracy by more than 4 points. Compare not just overall accuracy but per-category accuracy, because a prompt change might improve factual accuracy while degrading numerical accuracy.
def run_regression_suite(eval_dataset, system_config):
results = []
for item in eval_dataset:
response = generate_response(
question=item["question"],
config=system_config
)
score = auto_score(
item["question"],
item["ground_truth"],
response
)
results.append({
"id": item["id"],
"category": item["category"],
"scores": score,
"response": response
})
# Check against thresholds
overall = mean([r["scores"]["overall"] for r in results])
by_category = group_and_average(results, "category")
return {
"overall_accuracy": overall,
"by_category": by_category,
"passed": overall >= ACCURACY_THRESHOLD,
"regressions": find_regressions(results, previous_run)
}Automated testing catches known failure modes. Manual review catches unknown ones. Sample 20 to 50 production responses per week and have a human reviewer classify each factual claim as correct, incorrect, or unverifiable. Track the manual accuracy rate alongside your automated metrics. When manual review uncovers a new type of inaccuracy, add questions targeting that failure mode to your evaluation dataset so future automated runs catch it. Over time, this feedback loop between manual review and automated testing builds a comprehensive test suite that covers the failure modes your specific system actually produces.
Interpreting Results
Accuracy numbers only mean something in context. A 95% claim accuracy rate sounds good, but if your system processes 10,000 queries per day and the average response contains 8 claims, that 5% error rate means approximately 4,000 incorrect claims reaching users daily. For high-stakes applications, that is far too many. For low-stakes applications, it might be acceptable. Set your accuracy targets based on the cost of errors in your specific domain, not on abstract benchmarks.
Pay attention to the distribution of errors, not just the average. A system that is 95% accurate overall but 60% accurate on numerical questions has a specific, fixable problem (numerical hallucination) that the overall metric obscures. Category-level metrics reveal these patterns and point you toward targeted fixes rather than general improvements.
Build AI you can trust and verify. Adaptive Recall provides confidence-scored memories and source attribution that make accuracy testing systematic and reliable.
Get Started Free