Home » Entity Extraction and NER for AI

Entity Extraction and NER for AI

Entity extraction is the process of identifying meaningful things in text, people, organizations, technologies, locations, concepts, and the relationships between them. It is the first step in turning unstructured text into structured knowledge that AI systems can reason over. Without entity extraction, a memory system stores text blobs that can only be found by vocabulary overlap. With entity extraction, the system builds a map of what exists, how things connect, and what matters, enabling retrieval that follows relationships instead of matching words.

What Entity Extraction Is and Why It Matters

Entity extraction transforms unstructured text into structured data by identifying the specific things mentioned and how they relate to each other. When a developer writes "the authentication service depends on Redis for session storage and is maintained by the platform team," entity extraction identifies three entities (authentication service, Redis, platform team) and two relationships (depends on for session storage, maintained by). These structured facts become the foundation for knowledge graphs, intelligent retrieval, and reasoning that goes beyond keyword matching.

The reason entity extraction matters for AI applications is that it bridges the gap between how humans write and how machines search. Humans write in natural language full of implicit connections, pronouns, and context that other humans understand effortlessly. Machines need explicit structure. When you store a memory that says "we migrated the checkout flow from Stripe to Braintree last quarter," entity extraction turns that into structured facts: checkout flow uses Braintree (current), checkout flow previously used Stripe (historical), and a migration event occurred. Without this extraction, the only way to find this information is by searching for the exact words, which fails when someone later asks "what payment processor do we use" because neither "payment" nor "processor" appears in the original text.

Entity extraction is the first step in every knowledge graph construction pipeline, every GraphRAG implementation, and every AI system that maintains structured knowledge about its domain. It is also the step where accuracy matters most, because downstream components (relationship extraction, graph traversal, retrieval scoring) amplify both the quality and the errors of the extracted entities. An entity missed during extraction is a node missing from the graph, invisible to traversal no matter how sophisticated the graph algorithms are.

The field has evolved rapidly since 2023. Traditional named entity recognition (NER) models handle well-defined entity types (person, organization, location) with high accuracy on those types but struggle with domain-specific entities. LLM-based extraction handles arbitrary entity types without training data but costs more per document. The practical choice depends on your domain, your volume, and whether your entity types are standard or specialized. Most production systems in 2026 use a hybrid approach: a fast NER model for standard entity types, supplemented by an LLM for domain-specific entities and relationship extraction.

Named Entity Recognition: The Foundation

Named entity recognition is the NLP task of identifying and classifying entities in text into predefined categories. The standard categories come from information extraction research: PERSON (people's names), ORG (organizations), GPE (geopolitical entities like countries and cities), DATE, TIME, MONEY, and QUANTITY. Modern NER models extend this to include PRODUCT, EVENT, LAW, LANGUAGE, and WORK_OF_ART. These categories cover the entity types that appear most frequently in general text.

NER models work by processing text token by token and predicting whether each token is part of an entity and, if so, what type. The standard labeling scheme is BIO (Beginning, Inside, Outside): the first token of an entity gets a B-TYPE label, subsequent tokens get I-TYPE labels, and non-entity tokens get O. For example, in "John Smith works at Acme Corp," "John" is B-PERSON, "Smith" is I-PERSON, "works" and "at" are O, "Acme" is B-ORG, and "Corp" is I-ORG.

The most widely used NER implementations in 2026 are SpaCy, which provides fast, efficient NER with pre-trained models for multiple languages, and transformer-based models from Hugging Face, which provide higher accuracy at the cost of more computation. SpaCy's en_core_web_trf model achieves 89.8% F1 on the OntoNotes benchmark, while fine-tuned BERT models reach 92 to 93% F1 on the same benchmark. For most applications, SpaCy provides sufficient accuracy with much better throughput, processing thousands of documents per second on a single CPU.

The limitation of traditional NER is its fixed entity types. If your domain uses entities that do not map to standard NER categories, like "microservice," "API endpoint," "deployment environment," or "database table," a pre-trained NER model will not recognize them. You can fine-tune a NER model on annotated examples of your domain-specific entities, which requires 200 to 500 labeled examples per entity type to achieve usable accuracy. Alternatively, you can use LLM-based extraction, which handles arbitrary entity types through prompt engineering rather than training data.

LLM-Based Extraction: The Modern Approach

LLM-based entity extraction uses a large language model to identify entities and their types from a text passage. Instead of training a model on labeled examples, you write a prompt that describes what to extract and the LLM applies its understanding of language and world knowledge to identify entities. This approach became practical with GPT-4 and Claude in 2023, and by 2026 it handles most extraction tasks with accuracy comparable to or better than fine-tuned NER models.

The core advantage of LLM-based extraction is flexibility. You can extract any entity type by describing it in the prompt. "Extract all technologies, services, teams, and infrastructure components" works without any training data. You can adjust the extraction scope by modifying the prompt, add new entity types instantly, and handle ambiguous cases by providing examples in the prompt. This makes LLM-based extraction the pragmatic choice for applications where entity types are domain-specific or evolve over time.

A typical extraction prompt asks the LLM to identify entities, classify their types, provide canonical names, list aliases found in the text, and return the results as structured JSON. The structured output requirement is important because downstream processing (deduplication, graph construction, storage) needs machine-readable data, not free-form text. Claude and GPT-4 both produce reliable JSON output when the prompt specifies the expected format clearly.

EXTRACT_PROMPT = """Extract all entities from the text below. For each entity, return: - name: canonical full name - type: one of Person, Organization, Technology, Service, Concept, Location - aliases: other names used in the text for this entity Return as a JSON array. Only include entities explicitly mentioned. Text: {text}"""

The trade-off is cost and latency. Processing a 1,000-token passage through an LLM for entity extraction costs $0.003 to $0.015 depending on the model, compared to essentially zero for a local NER model. At scale, this difference matters: extracting entities from 100,000 documents costs $300 to $1,500 with an LLM versus under $1 with SpaCy. The accuracy difference on domain-specific entities often justifies the cost, but for standard entity types (people, organizations, locations) where NER models already perform well, the LLM adds cost without proportional accuracy improvement.

The practical approach for most production systems is tiered extraction. Run a fast NER model first to identify standard entity types at near-zero cost. Then run the LLM on passages where domain-specific entities are likely, using the NER results as context to avoid re-extracting what has already been found. This reduces LLM costs by 60 to 80% while maintaining accuracy on domain-specific entities.

From Entities to Relationships

Entities by themselves are a list of things. Relationships turn that list into a graph that supports reasoning and traversal. Relationship extraction identifies how entities in the same text connect to each other, producing triples in the form subject-predicate-object. "The checkout service depends on PostgreSQL" becomes (checkout service, depends_on, PostgreSQL). These triples are the building blocks of knowledge graphs.

Relationship extraction is harder than entity extraction because it requires understanding the semantic connection between two entities, not just identifying that they exist. The sentence "John mentioned Redis during the architecture review" contains two entities (John, Redis) but the relationship is weak, John mentioned it, not that John maintains, owns, or depends on it. A good extraction pipeline distinguishes between meaningful structural relationships (depends_on, maintained_by, stores_data_in) and incidental co-occurrences (mentioned, discussed, referenced).

There are three approaches to relationship extraction, ordered by increasing accuracy and cost. Co-occurrence assumes that entities mentioned in the same sentence or paragraph are related. This produces a connected graph quickly but with noisy, untyped relationships. Pattern-based extraction uses dependency parsing to identify the grammatical relationship between entities in a sentence, then maps syntactic patterns to relationship types. This is more accurate than co-occurrence and does not require LLM calls, but it misses implicit relationships that are not expressed in a single syntactic structure. LLM-based extraction asks the model to identify typed relationships between entities, producing the highest quality triples but at the highest cost per document.

Predicate normalization is critical regardless of the extraction approach. If your pipeline produces "uses," "utilizes," "relies on," "is built on," and "depends on" as separate predicates for what is essentially the same relationship type, graph traversal becomes unreliable. Either constrain the extraction to a controlled vocabulary of predicates (provide the list in the prompt for LLM-based extraction, or map patterns to canonical predicates for pattern-based extraction) or apply a normalization step after extraction that merges synonymous predicates.

Core Challenges in Entity Extraction

Coreference Resolution

Coreference resolution is the task of identifying when different expressions refer to the same entity. "The authentication service processes 10,000 requests per minute. It was built by Sarah's team last year. The auth service uses Redis for session caching." In this passage, "the authentication service," "it," and "the auth service" all refer to the same entity. Without coreference resolution, the extraction pipeline treats them as three separate entities, fragmenting the graph and losing connections.

Modern NER models include basic coreference resolution for pronouns (it, they, he, she), but struggle with abbreviated references ("the auth service" for "authentication service") and definite descriptions ("the system" for a previously mentioned specific system). LLM-based extraction handles coreference more naturally because the model understands context, but it still benefits from explicit instructions to resolve references to their canonical names.

Entity Disambiguation

The same name can refer to different entities depending on context. "Mercury" could be a planet, a chemical element, a car brand, or an internal service named Mercury. Entity disambiguation uses surrounding context to determine which entity is intended. In domain-specific text, this is usually resolvable from context, but it requires the extraction system to maintain awareness of the domain's entity inventory so it can match mentions to known entities when possible.

Nested Entities

Some entities contain other entities. "The New York Stock Exchange" contains the location "New York." "Google Cloud Platform's BigQuery" contains the organization "Google" and the product "Google Cloud Platform" as well as the service "BigQuery." Flat NER models must choose between the outer entity and the inner entity. Nested NER models (or LLM-based extraction with appropriate prompting) can identify both levels, which is important when the relationships of the outer entity differ from those of the inner entity.

Implicit Entities

Not all entities are explicitly named in text. "The deployment failed because the database was full" mentions "the database" without naming it. In a knowledge base where only one database exists, the reference is unambiguous. In a system with dozens of databases, this implicit reference cannot be resolved from the text alone. The best approach is to extract what is explicit and flag implicit references for resolution against the existing entity inventory.

Domain-Specific Extraction

General-purpose NER handles about 60% of entity extraction needs for most applications. The remaining 40% involves domain-specific entities that require specialized extraction. The challenge is that every domain has its own entity types, naming conventions, and relationship patterns.

Medical text extraction must identify diseases, symptoms, medications, procedures, anatomical terms, and lab values, each with specific terminology and abbreviation conventions. The entity "MI" means myocardial infarction in a cardiology note but could mean something entirely different in other domains. Medical NER models like BioBERT and SciSpacy are fine-tuned on biomedical text and achieve 85 to 90% F1 on medical entity types.

Legal text extraction deals with statutes, case references, parties, jurisdictions, legal concepts, and temporal scoping. "Section 230 of the Communications Decency Act" is a single entity that references another entity (the Communications Decency Act). Legal NER requires understanding of citation formats, party naming conventions, and the hierarchical structure of legal references.

Financial text extraction identifies companies, financial instruments, metrics, regulatory bodies, market events, and temporal relationships between them. The challenge is that financial entities change over time (companies merge, instruments are delisted, regulations are amended) and historical accuracy requires temporal awareness during extraction.

For software engineering domains, the relevant entity types are services, APIs, databases, teams, deployment environments, configuration settings, libraries, and programming languages. These entities are poorly covered by general NER models because they do not map to standard categories. LLM-based extraction handles them well because the models have extensive training data from technical documentation, but a controlled vocabulary of entity types in the prompt significantly improves consistency.

Building an Extraction Pipeline

A production entity extraction pipeline has five stages: preprocessing, extraction, normalization, deduplication, and storage. Each stage has specific concerns and failure modes.

Preprocessing prepares text for extraction. This includes splitting documents into passages of appropriate size (500 to 1,000 tokens for LLM-based extraction, up to 5,000 tokens for NER models), handling formatting artifacts (HTML tags, markdown, code blocks), and preserving enough context for coreference resolution. The most common preprocessing mistake is splitting too aggressively, which puts the subject of a relationship in one chunk and the object in another, making the relationship invisible to extraction.

Extraction identifies entities and relationships from preprocessed text. In a tiered system, the fast NER pass runs first, extracting standard entity types at high throughput. The LLM pass runs second on passages that contain domain-specific content, extracting specialized entities and typed relationships. The outputs from both passes are merged, with LLM results taking priority when they conflict with NER results on the same span of text.

Normalization standardizes entity names and relationship types. Entity names are normalized to a canonical form: "PostgreSQL," not "Postgres" or "PG" or "the database." Relationship predicates are mapped to a controlled vocabulary. Temporal references are resolved to specific dates or periods. This step is where most quality improvements happen in iterative refinement, because normalization errors compound through the rest of the pipeline.

Deduplication merges entities that refer to the same real-world thing. String similarity handles obvious duplicates (case differences, minor spelling variations). Alias matching catches known alternate names. For harder cases, embedding similarity or LLM-based comparison determines whether two entity names are the same thing. The output is a deduplicated entity set where each real-world entity has one canonical node with associated aliases.

Storage writes the extracted entities and relationships to the graph store. This involves creating entity nodes (or updating existing ones), creating relationship edges with type and confidence metadata, and maintaining provenance links back to the source text that each triple was extracted from. Provenance is important for validation and for understanding why the graph contains a particular fact.

Measuring and Improving Extraction Quality

Extraction quality is measured by precision (what percentage of extracted entities are correct) and recall (what percentage of actual entities were extracted). F1 score is the harmonic mean of precision and recall, providing a single quality metric. For most applications, you want F1 above 85% for entity extraction and above 75% for relationship extraction.

To measure quality, take a sample of 100 to 200 passages, extract entities and relationships automatically, then have a human annotator mark each extraction as correct or incorrect and note any entities or relationships the system missed. This gives you precision, recall, and F1 for both entities and relationships, plus an error analysis that tells you what to fix.

Common error patterns and their fixes: If precision is low (too many false positives), tighten your extraction prompt to require stronger evidence, or increase the confidence threshold for including entities. If recall is low (too many missed entities), broaden your entity type definitions, lower the confidence threshold, or switch to an LLM-based approach that can find entities a NER model does not recognize. If relationship extraction has low precision, constrain your predicate vocabulary and require the LLM to cite the specific text that supports each relationship.

Adaptive Recall handles entity extraction automatically as part of its memory storage pipeline. When you store a memory through any of the seven tools, entities are extracted, relationships are identified, and the knowledge graph is updated. The extraction uses a tiered approach that balances cost and accuracy, and the results improve over time as the entity inventory grows and disambiguation becomes more reliable. You get production-quality entity extraction without building or maintaining the pipeline.

Implementation Guides

Extraction Techniques

Validation and Scale

Core Concepts

Fundamentals

Advanced Topics

Common Questions

Add entity extraction to your AI without building the pipeline. Adaptive Recall extracts entities and relationships automatically when you store memories, building a traversable knowledge graph from every interaction.

Get Started Free