How to Resolve Entity Conflicts and Duplicates
Why Deduplication Matters
A knowledge graph with duplicate entities has fragmented connections. If "PostgreSQL" has 10 relationships and "Postgres" has 8 different relationships, but they refer to the same database, your graph is missing the fact that this single entity has 18 connections. Traversal from either name finds only half the relevant information. The problem compounds at scale: a graph with 5% duplicate entities can have 15 to 20% fragmented relationships because each duplicate creates missing connections in both directions.
Step-by-Step Process
Start with the cheapest matching. Case-insensitive exact match catches "Redis" and "redis." Whitespace and punctuation normalization catches "check-out service" and "checkout service." Then apply string similarity (SequenceMatcher, Levenshtein distance) with a threshold of 0.85 to catch minor spelling variations and abbreviation differences.
from difflib import SequenceMatcher
import re
def normalize_name(name):
name = name.lower().strip()
name = re.sub(r'[^a-z0-9\s]', ' ', name)
name = re.sub(r'\s+', ' ', name)
return name
def find_near_duplicates(entities, threshold=0.85):
groups = {}
for entity in entities:
normalized = normalize_name(entity["name"])
matched = False
for canonical in groups:
sim = SequenceMatcher(None, normalized, canonical).ratio()
if sim >= threshold:
groups[canonical].append(entity)
matched = True
break
if not matched:
groups[normalized] = [entity]
return groupsDuring extraction, you collected aliases for each entity. Check every alias against every canonical name. If an alias matches an existing canonical name, the entities should be merged. Also check common abbreviation patterns: acronyms (American Broadcasting Company / ABC), version suffixes (React / React 18), and organizational prefixes (Google Cloud Storage / Cloud Storage / GCS).
def match_by_aliases(entities):
name_to_entity = {}
for entity in entities:
key = normalize_name(entity["name"])
name_to_entity[key] = entity
for alias in entity.get("aliases", []):
alias_key = normalize_name(alias)
if alias_key in name_to_entity:
yield (entity, name_to_entity[alias_key])
else:
name_to_entity[alias_key] = entityString matching misses semantic equivalences like "message queue" and "event bus" or "load balancer" and "traffic distributor." Embed entity names using a sentence embedding model and flag pairs with cosine similarity above 0.90 as potential duplicates for review. This step produces candidates, not automatic merges, because high embedding similarity does not always mean same entity.
import numpy as np
def find_semantic_duplicates(entities, embeddings, threshold=0.90):
candidates = []
for i in range(len(entities)):
for j in range(i + 1, len(entities)):
sim = np.dot(embeddings[i], embeddings[j]) / (
np.linalg.norm(embeddings[i]) *
np.linalg.norm(embeddings[j])
)
if sim >= threshold:
candidates.append((
entities[i], entities[j], float(sim)
))
return sorted(candidates, key=lambda x: -x[2])The same entity might be classified as different types in different contexts. "Kubernetes" might be extracted as "Technology" in one passage and "Infrastructure" in another. Apply majority voting: the type assigned most frequently across all extractions wins. When types are tied, prefer the more specific type (Service over Technology, Infrastructure over Concept). Document your type hierarchy so precedence rules are consistent.
from collections import Counter
def resolve_type(entity_group):
type_counts = Counter(e["type"] for e in entity_group)
if len(type_counts) == 1:
return type_counts.most_common(1)[0][0]
most_common = type_counts.most_common()
if most_common[0][1] > most_common[1][1]:
return most_common[0][0]
# Tie-breaking: prefer more specific types
TYPE_PRIORITY = [
"Service", "Infrastructure", "Library",
"Organization", "Person", "Technology", "Concept"
]
for ptype in TYPE_PRIORITY:
if ptype in type_counts:
return ptype
return most_common[0][0]For each group of duplicates, create a single canonical entity. Use the most complete name as the canonical name (prefer "PostgreSQL" over "PG"). Merge all alias lists. Take the maximum confidence score. Combine all source document references. Redirect all relationships from duplicate entities to the canonical entity.
Maintain a reference list of all known canonical entities with their aliases. When new text is extracted, check entities against this inventory before creating new nodes. If a newly extracted entity name matches an existing canonical name or alias, link to the existing entity rather than creating a new one. This prevents duplicates from accumulating over time.
Adaptive Recall handles entity deduplication automatically. As you store memories, entities are matched against the existing inventory, merged when duplicates are detected, and the knowledge graph stays clean without manual resolution.
Try It Free