Coreference Resolution: Why It Matters for NER
The Problem: Graph Fragmentation
Natural language is full of references that point back to previously mentioned entities. Pronouns (it, they, she, he), abbreviated names (the auth service, PG, K8s), definite descriptions (the system, the database, the team), and role references (the maintainer, the lead, the service) all refer to entities that were named more fully elsewhere in the text.
Without coreference resolution, each of these mentions becomes a separate entity in the knowledge graph. A single paragraph about the authentication service might produce four separate nodes: "authentication service," "it," "the auth service," and "the system." Each node gets a subset of the relationships described in the paragraph, but no node gets all of them. Graph traversal from any one of these nodes finds incomplete information.
The magnitude of this problem is significant. In typical technical documentation, 30 to 40% of entity mentions are pronominal references (it, they, this) and another 15 to 20% are abbreviated names or definite descriptions. Without resolution, over half of the relationship information extracted from text is assigned to orphaned nodes that cannot be traversed to from the canonical entity name.
Types of Coreference
Pronominal Coreference
"The checkout service processes orders. It sends payment requests to Stripe." The pronoun "it" refers to "checkout service." This is the most common type of coreference and the one that dedicated coreference resolution models handle best. Modern models resolve pronominal coreference at 85 to 90% accuracy in well-structured text.
Abbreviated Names
"PostgreSQL stores order data. Postgres is configured for read replicas." The abbreviated name "Postgres" refers to the same entity as "PostgreSQL." This type requires either an alias list (mapping known abbreviations to canonical names) or semantic understanding that the two names refer to the same technology. NER models do not handle this type; it requires a separate resolution step.
Definite Descriptions
"The platform team maintains three services. The team meets weekly to review deployments." The definite description "the team" refers to "the platform team." This is harder to resolve because "the team" could refer to any team mentioned in a larger document. Resolving definite descriptions requires tracking which entities are currently "in focus" based on discourse structure.
Role References
"Sarah leads the platform team. The lead approved the deployment." The role reference "the lead" refers to Sarah through her role. This type requires understanding both the entity and its role, and is the hardest to resolve automatically.
Approaches to Coreference Resolution
LLM-Based Resolution
The simplest approach for most applications: include coreference resolution instructions in your entity extraction prompt. Tell the LLM to "resolve all pronouns and abbreviated names to their full canonical entity names before extracting entities." Claude and GPT-4 handle pronominal and abbreviated name coreference well within a single passage. The limitation is that coreference across distant passages (where the antecedent was several paragraphs ago) requires the full document to be in context, which increases cost.
Dedicated Coreference Models
For high-throughput pipelines, dedicated coreference models run locally without API costs. SpaCy's coreferee extension and Hugging Face's coreference models resolve pronominal references at 80 to 87% accuracy. These models run a coreference resolution pass before entity extraction, replacing pronouns with their antecedents so the NER model sees explicit entity names.
Alias-Based Resolution
For abbreviated names, maintain an alias map that links known short forms to canonical names. During entity extraction, look up every extracted name in the alias map and normalize to the canonical form. This handles known abbreviations perfectly but does not help with previously unseen abbreviations. Combining alias-based resolution with LLM-based resolution for novel abbreviations covers most cases.
Practical Implementation
The most effective approach for production systems is a three-stage pipeline:
First, run a dedicated coreference model on the text to resolve pronouns. This is fast and cheap. Replace pronouns with their resolved antecedents in the text, or annotate the text with coreference chains.
Second, run entity extraction on the coreference-resolved text. Because pronouns have been replaced with entity names, the extraction step sees explicit entity references and produces fewer orphaned nodes.
Third, run deduplication with alias matching to merge any remaining duplicates from abbreviated names or definite descriptions that the coreference model did not resolve.
# Example coreference resolution with SpaCy + coreferee
import spacy
import coreferee
nlp = spacy.load("en_core_web_trf")
nlp.add_pipe("coreferee")
def resolve_coreferences(text):
doc = nlp(text)
resolved = list(doc)
for chain in doc._.coref_chains:
canonical = chain[0] # first mention is usually the full name
canonical_text = doc[canonical[0]:canonical[-1]+1].text
for mention in chain[1:]:
for token_idx in mention:
resolved[token_idx] = canonical_text
return ' '.join(str(t) for t in resolved)Adaptive Recall handles coreference resolution as part of its entity extraction pipeline. When a memory is stored, references are resolved to canonical entity names before entities are extracted and added to the knowledge graph. This ensures that relationship information expressed through pronouns and abbreviations is correctly attributed to the right entity node, maintaining graph integrity as the memory system grows.
Adaptive Recall resolves coreferences automatically during entity extraction, so your knowledge graph stays clean and connected.
Try It Free