Home » Entity Extraction and NER » Resolve Conflicts

How to Resolve Entity Conflicts and Duplicates

Entity deduplication and conflict resolution is the process of identifying when different extracted names refer to the same real-world entity and merging them into a single canonical entry. "PostgreSQL," "Postgres," "PG," and "the primary database" might all refer to the same thing. Without resolution, your knowledge graph contains multiple disconnected nodes for a single entity, fragmenting the relationships that make graph traversal useful.

Why Deduplication Matters

A knowledge graph with duplicate entities has fragmented connections. If "PostgreSQL" has 10 relationships and "Postgres" has 8 different relationships, but they refer to the same database, your graph is missing the fact that this single entity has 18 connections. Traversal from either name finds only half the relevant information. The problem compounds at scale: a graph with 5% duplicate entities can have 15 to 20% fragmented relationships because each duplicate creates missing connections in both directions.

Step-by-Step Process

Step 1: Detect exact and near duplicates.
Start with the cheapest matching. Case-insensitive exact match catches "Redis" and "redis." Whitespace and punctuation normalization catches "check-out service" and "checkout service." Then apply string similarity (SequenceMatcher, Levenshtein distance) with a threshold of 0.85 to catch minor spelling variations and abbreviation differences.
from difflib import SequenceMatcher import re def normalize_name(name): name = name.lower().strip() name = re.sub(r'[^a-z0-9\s]', ' ', name) name = re.sub(r'\s+', ' ', name) return name def find_near_duplicates(entities, threshold=0.85): groups = {} for entity in entities: normalized = normalize_name(entity["name"]) matched = False for canonical in groups: sim = SequenceMatcher(None, normalized, canonical).ratio() if sim >= threshold: groups[canonical].append(entity) matched = True break if not matched: groups[normalized] = [entity] return groups
Step 2: Apply alias-based matching.
During extraction, you collected aliases for each entity. Check every alias against every canonical name. If an alias matches an existing canonical name, the entities should be merged. Also check common abbreviation patterns: acronyms (American Broadcasting Company / ABC), version suffixes (React / React 18), and organizational prefixes (Google Cloud Storage / Cloud Storage / GCS).
def match_by_aliases(entities): name_to_entity = {} for entity in entities: key = normalize_name(entity["name"]) name_to_entity[key] = entity for alias in entity.get("aliases", []): alias_key = normalize_name(alias) if alias_key in name_to_entity: yield (entity, name_to_entity[alias_key]) else: name_to_entity[alias_key] = entity
Step 3: Use embedding similarity for harder cases.
String matching misses semantic equivalences like "message queue" and "event bus" or "load balancer" and "traffic distributor." Embed entity names using a sentence embedding model and flag pairs with cosine similarity above 0.90 as potential duplicates for review. This step produces candidates, not automatic merges, because high embedding similarity does not always mean same entity.
import numpy as np def find_semantic_duplicates(entities, embeddings, threshold=0.90): candidates = [] for i in range(len(entities)): for j in range(i + 1, len(entities)): sim = np.dot(embeddings[i], embeddings[j]) / ( np.linalg.norm(embeddings[i]) * np.linalg.norm(embeddings[j]) ) if sim >= threshold: candidates.append(( entities[i], entities[j], float(sim) )) return sorted(candidates, key=lambda x: -x[2])
Step 4: Resolve type conflicts.
The same entity might be classified as different types in different contexts. "Kubernetes" might be extracted as "Technology" in one passage and "Infrastructure" in another. Apply majority voting: the type assigned most frequently across all extractions wins. When types are tied, prefer the more specific type (Service over Technology, Infrastructure over Concept). Document your type hierarchy so precedence rules are consistent.
from collections import Counter def resolve_type(entity_group): type_counts = Counter(e["type"] for e in entity_group) if len(type_counts) == 1: return type_counts.most_common(1)[0][0] most_common = type_counts.most_common() if most_common[0][1] > most_common[1][1]: return most_common[0][0] # Tie-breaking: prefer more specific types TYPE_PRIORITY = [ "Service", "Infrastructure", "Library", "Organization", "Person", "Technology", "Concept" ] for ptype in TYPE_PRIORITY: if ptype in type_counts: return ptype return most_common[0][0]
Step 5: Merge into canonical entries.
For each group of duplicates, create a single canonical entity. Use the most complete name as the canonical name (prefer "PostgreSQL" over "PG"). Merge all alias lists. Take the maximum confidence score. Combine all source document references. Redirect all relationships from duplicate entities to the canonical entity.
Step 6: Build a living entity inventory.
Maintain a reference list of all known canonical entities with their aliases. When new text is extracted, check entities against this inventory before creating new nodes. If a newly extracted entity name matches an existing canonical name or alias, link to the existing entity rather than creating a new one. This prevents duplicates from accumulating over time.
LLM-based disambiguation: For genuinely ambiguous cases (is "Mercury" the planet or the internal service?), batch the candidates and ask an LLM to judge, providing the context passages where each mention appeared. This is expensive for large batches but highly accurate for the hardest 5 to 10% of cases that automated methods cannot resolve.

Adaptive Recall handles entity deduplication automatically. As you store memories, entities are matched against the existing inventory, merged when duplicates are detected, and the knowledge graph stays clean without manual resolution.

Try It Free