Home » Entity Extraction and NER » Validate Entities

How to Validate Extracted Entities Against a Source

Validating extracted entities means comparing what your extraction system found against what actually exists in the source text. This produces precision (how many extracted entities are correct), recall (how many real entities were found), and an error analysis that tells you exactly what to fix. Without validation, you are building a knowledge graph on a foundation you cannot trust, and errors in entity extraction compound through every downstream system that uses the graph.

Why Validation Is Not Optional

Entity extraction systems produce errors. NER models miss domain-specific entities they were not trained on. LLMs hallucinate entities that are not in the text. Both systems misclassify entity types, extract partial names, and merge distinct entities that happen to share a name. These errors propagate: a missed entity means missing nodes in the graph, hallucinated entities mean false connections, and wrong types mean unreliable filtering. The only way to know your extraction quality is to measure it.

Validation also reveals where to invest improvement effort. If 80% of your errors are missed domain-specific entities, the fix is broadening your entity type definitions. If 80% are hallucinated entities, the fix is tightening the extraction prompt or raising the confidence threshold. Without error categorization, you are guessing at fixes instead of targeting the actual problems.

Step-by-Step Process

Step 1: Sample extraction results.
Select 100 to 200 text passages and their corresponding extracted entities. Stratify the sample by document type (documentation, chat, code comments) and by entity density (passages with many entities vs few) so the sample represents your actual data distribution. Random sampling is acceptable when data is homogeneous, but stratified sampling gives more reliable metrics when your corpus has diverse content types.
import random def stratified_sample(passages, extracted, n=150): by_type = {} for i, passage in enumerate(passages): doc_type = passage.get("source_type", "unknown") if doc_type not in by_type: by_type[doc_type] = [] by_type[doc_type].append((passage, extracted[i])) sample = [] per_type = max(1, n // len(by_type)) for doc_type, items in by_type.items(): k = min(per_type, len(items)) sample.extend(random.sample(items, k)) while len(sample) < n: remaining = [ item for items in by_type.values() for item in items if item not in sample ] if not remaining: break sample.append(random.choice(remaining)) return sample[:n]
Step 2: Annotate ground truth.
For each sampled passage, have a human annotator mark every entity present in the text with its correct type and canonical name. This creates the reference set that you compare your extraction results against. Use the same entity type definitions that your extraction system uses so the comparison is fair. One annotator is sufficient for initial validation; use two annotators with agreement measurement for high-stakes domains.
Step 3: Calculate precision and recall.
For each passage, compare the extracted entity set against the ground truth entity set. An entity is a true positive if the extracted name matches a ground truth name (exact or fuzzy match with threshold 0.85) and the type is correct. A false positive is an extracted entity with no ground truth match. A false negative is a ground truth entity with no extraction match.
from difflib import SequenceMatcher def evaluate(extracted, ground_truth, threshold=0.85): tp, fp, fn = 0, 0, 0 gt_matched = set() for ext in extracted: matched = False for i, gt in enumerate(ground_truth): if i in gt_matched: continue name_sim = SequenceMatcher( None, ext["name"].lower(), gt["name"].lower() ).ratio() if name_sim >= threshold and ext["type"] == gt["type"]: tp += 1 gt_matched.add(i) matched = True break if not matched: fp += 1 fn = len(ground_truth) - len(gt_matched) precision = tp / (tp + fp) if (tp + fp) > 0 else 0 recall = tp / (tp + fn) if (tp + fn) > 0 else 0 f1 = (2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0) return {"precision": precision, "recall": recall, "f1": f1, "tp": tp, "fp": fp, "fn": fn}
Step 4: Categorize errors.
Every false positive and false negative represents a specific error type. Categorize each error so you can target fixes at the most impactful categories.

Common error categories:

Missed entity (false negative): An entity exists in the text but was not extracted. Sub-categories: missed because the entity type was not in the extraction schema, missed because the entity name was unusual or abbreviated, or missed because it appeared in a complex sentence structure.

Hallucinated entity (false positive): The extraction system reported an entity that does not actually appear in the text, or reported something that is not an entity (a general concept extracted as a specific entity).

Wrong type: The entity was found but classified incorrectly (a team name classified as a person, a service classified as a technology).

Wrong boundary: The entity name was partially correct but the extraction captured too much ("the Redis cluster configuration" instead of "Redis") or too little ("Postgres" instead of "PostgreSQL 15").

Normalization failure: The entity was found but the canonical name is wrong or inconsistent with other extractions of the same entity.

Step 5: Fix the top error categories.
Address the most frequent error types first. For missed entities, broaden entity type definitions in the prompt or lower the confidence threshold. For hallucinations, add explicit instructions like "only extract entities that are explicitly mentioned by name" and raise the confidence threshold. For wrong types, add clearer type definitions with boundary examples. For wrong boundaries, add examples of correct entity boundaries to the prompt.
Step 6: Build continuous monitoring.
Extraction quality can degrade over time as your document corpus evolves. Set up a weekly or monthly sampling process that measures precision, recall, and F1 on fresh samples. Track these metrics in a dashboard. Set alerts when F1 drops below your threshold (85% for most applications). Retrain or update prompts when quality degrades.
Target metrics: For entity extraction, aim for 85%+ F1 in production. For relationship extraction, 75%+ F1 is realistic. First-pass extraction typically achieves 70 to 80% F1; two or three prompt iterations bring it to 85 to 90%.

Adaptive Recall validates and refines entity extraction continuously as part of its memory consolidation process. Entities are extracted, checked against the existing inventory, and corrected over time as more evidence accumulates.

Try It Free