Home » Entity Extraction and NER » Domain-Specific NER

Domain-Specific NER: Medical, Legal, Financial

General NER models recognize people, organizations, and locations. Domain-specific NER recognizes the entity types that matter for specialized fields: diseases, medications, and procedures in medicine, statutes, parties, and jurisdictions in law, instruments, metrics, and regulatory bodies in finance, and services, APIs, and databases in software engineering. Each domain has unique naming conventions, abbreviation patterns, and entity boundary rules that require specialized models or prompts.

Medical NER

Medical text is dense with specialized entities: disease names (acute myocardial infarction), medications (metoprolol 50mg), procedures (percutaneous coronary intervention), anatomical terms (left anterior descending artery), lab values (troponin 2.4 ng/mL), and medical devices (drug-eluting stent). These entities follow naming conventions that general NER models do not understand. "MI" means myocardial infarction in a cardiology note, not a U.S. state. "CAD" means coronary artery disease, not computer-aided design.

Key challenges: Heavy abbreviation use (CHF, COPD, DM2, A-fib), nested entities ("left anterior descending coronary artery" contains both an anatomical location and a specific vessel), negation ("no evidence of MI" should not extract MI as a diagnosed condition), and temporal relationships ("started metoprolol 3 days ago" requires understanding the temporal anchor).

Available models: BioBERT and PubMedBERT are transformer models pre-trained on biomedical text. SciSpacy provides SpaCy-compatible biomedical NER models. These achieve 85 to 90% F1 on medical entity types in clinical text. For clinical notes specifically, models trained on the i2b2 and n2c2 challenge datasets handle the abbreviation-heavy style of clinical documentation.

Practical tip: Medical NER almost always requires a terminology mapping step after extraction. Link extracted entities to standard medical ontologies (SNOMED CT, ICD-10, RxNorm) to normalize the many ways clinicians refer to the same condition or medication. This mapping step is as important as the extraction itself for building usable medical knowledge graphs.

Legal NER

Legal text contains entity types that overlap partially with general NER but have domain-specific subtypes and boundary rules. Case citations ("Brown v. Board of Education, 347 U.S. 483 (1954)") are single entities with internal structure. Statutes ("Section 230 of the Communications Decency Act") reference both a section and a containing law. Parties ("the plaintiff," "Defendant Corporation") are role-specific references that require coreference resolution to link to named parties.

Key challenges: Citation parsing (every jurisdiction has different citation formats), hierarchical entity structure (a statute section within an act within a legal code), temporal scoping (laws are amended, repealed, superseded), and cross-referencing (one legal document references dozens of other documents, each of which is an entity).

Available models: Legal-BERT is pre-trained on legal corpora and improves NER accuracy on legal entity types by 3 to 5% over general BERT. BlackstoneLegal NER handles UK legal texts. For U.S. legal texts, models trained on court opinions from CourtListener achieve 82 to 87% F1 on legal entity types.

Practical tip: Legal entity extraction benefits enormously from structured source data. Court opinions, statutes, and regulations often have structured metadata (case numbers, statute identifiers, party names) available in the document headers or database records. Extract these structured fields first, then use NER on the body text for additional entities.

Financial NER

Financial text references companies, financial instruments (stocks, bonds, derivatives), metrics (P/E ratio, market cap, EBITDA), regulatory bodies (SEC, FINRA, ESMA), market events (IPO, merger, earnings report), and temporal relationships that are critical for accuracy (Q3 2025 revenue vs Q3 2024 revenue). The same company may be referenced by its legal name, trading symbol, or colloquial name in a single document.

Key challenges: Ticker symbol ambiguity (AAPL is Apple, but CRM could be Salesforce or a general reference to customer relationship management), temporal precision (financial data is only meaningful with its time period), nested metrics ("year-over-year revenue growth of 12%" is a single metric entity with multiple components), and the speed at which new entities appear (new companies IPO, new instruments are created, regulations change).

Available models: FinBERT is pre-trained on financial text and handles financial entity types better than general BERT. Models trained on SEC filings achieve 83 to 88% F1 on financial entity types. For real-time financial text (news, earnings transcripts), LLM-based extraction is often preferred because the entity landscape changes faster than fine-tuned models can keep up.

Practical tip: Financial entity extraction should always include a normalization step that links company mentions to canonical identifiers (CUSIP, ISIN, or LEI codes). This prevents the fragmentation that occurs when the same company is mentioned as "Apple," "Apple Inc.," "AAPL," and "the Cupertino-based tech giant" across different documents.

Software Engineering NER

Software engineering text references services, APIs, databases, configuration settings, libraries, programming languages, teams, deployment environments, and infrastructure components. These entity types are poorly covered by general NER models because they do not resemble the proper nouns that NER was designed for. A service name like "checkout-api" looks like a hyphenated word, not an entity. A configuration key like MAX_RETRY_COUNT looks like a constant, not an entity name.

Key challenges: Non-standard naming patterns (kebab-case, snake_case, camelCase service names), version-specific references (React 18 vs React 17 may be different entities for compatibility purposes), implicit references in code ("the handler" refers to a specific file or function), and the rapid churn of entities (services are created, renamed, and deprecated continuously).

Available models: No widely adopted pre-trained model exists specifically for software engineering NER. LLM-based extraction is the most practical approach because the entity types change per organization and the naming conventions are too diverse for a single model to cover. Fine-tuning SpaCy on organization-specific labeled data is an alternative for teams with stable entity types and high extraction volume.

Practical tip: For software engineering domains, supplement NER with structured data sources. Service registries, package.json files, infrastructure-as-code definitions, and team directories provide authoritative entity lists that can bootstrap your entity inventory. Run NER on unstructured text to discover relationships and new entities, but use structured sources as the ground truth for what entities exist.

The Cross-Domain Pattern

Despite their differences, all domain-specific NER implementations share a common pattern: a domain-specific base model (or LLM with domain context) for extraction, a domain terminology mapping step for normalization, and a domain ontology for type hierarchy. If you are building entity extraction for a specialized domain, start with an LLM and a well-defined entity type schema. Add a domain-specific model later if volume justifies it. Always include a normalization step that links extracted names to canonical identifiers.

Adaptive Recall handles software engineering domain extraction out of the box and adapts to other domains through the entity types and relationships that naturally emerge from stored memories. As you store memories about your specific domain, the entity inventory grows organically, and the extraction system learns which entity types and naming patterns are relevant for your use case.

Whatever your domain, Adaptive Recall builds entity knowledge from your memories. The extraction adapts to your entity types as you store information.

Get Started Free