Home » Entity Extraction and NER » Fine-Tune NER

How to Fine-Tune NER for Your Domain

Fine-tuning a NER model means training a pre-trained language model to recognize entity types specific to your domain. If you need to extract "microservices," "Kubernetes namespaces," "database schemas," or "medical procedures" from text, a general NER model will not find them. Fine-tuning takes 200 to 500 labeled examples per entity type and produces a model that runs locally at thousands of documents per second, making it the right choice when you need high throughput on domain-specific entities.

When to Fine-Tune vs When to Use an LLM

Fine-tuning makes sense when you process more than 10,000 documents per day, when your entity types are stable (not changing month to month), and when you need sub-second latency per document. The upfront cost is the labeling effort (20 to 40 hours for a typical domain) and training time (1 to 4 hours on a GPU). The ongoing cost is near zero because the model runs locally.

LLM-based extraction makes sense when your entity types change frequently, when you need to start extracting immediately without a labeling phase, or when your volume is low enough that API costs are manageable. The break-even point for most domains is around 5,000 to 10,000 documents: below that, LLM extraction is cheaper than the labeling effort; above that, the fine-tuned model's zero marginal cost wins.

Step-by-Step Process

Step 1: Define your entity types and annotation guide.
Write a document that clearly defines each entity type with examples, edge cases, and explicit boundary decisions. "Technology" is too vague. "Technology: any programming language, framework, library, protocol, or tool mentioned by name, not including general concepts like 'caching' or 'load balancing'" gives annotators clear guidance. Include 5 to 10 examples per type and 5 edge cases that illustrate where types overlap.

Common domain-specific entity types for software engineering:

Entity Types:
- SERVICE: named microservices, APIs, applications (e.g., "checkout-api")
- INFRA: databases, caches, queues, cloud services (e.g., "Redis", "S3")
- TEAM: named teams, departments, squads (e.g., "platform team")
- CONFIG: configuration keys, environment variables, feature flags
- LIBRARY: named packages, SDKs, frameworks (e.g., "React 18", "FastAPI")
- ENDPOINT: API paths, routes (e.g., "/api/v2/orders")

NOT entities:
- Generic concepts: "authentication", "caching", "deployment"
- Unnamed references: "the database", "their API", "the team"

Step 2: Collect and label training data.
Gather 500 to 1,000 text passages from your actual domain (documentation, code reviews, incident reports, chat logs). Annotate entities in BIO format using a labeling tool. You need at least 200 labeled examples per entity type for usable accuracy, and 500 per type for good accuracy. Split your data into training (80%) and evaluation (20%) sets.

Recommended labeling tools: Prodigy ($490, integrates directly with SpaCy), Label Studio (free, open source, supports NER out of the box), and Doccano (free, simpler interface). Prodigy's active learning feature reduces labeling time by 40 to 60% by prioritizing the most informative examples for annotation.

# BIO format example:
# "The checkout-api depends on Redis for session caching"
# The        O
# checkout   B-SERVICE
# -          I-SERVICE
# api        I-SERVICE
# depends    O
# on         O
# Redis      B-INFRA
# for        O
# session    O
# caching    O

Step 3: Choose a base model.
Fine-tuning starts from a pre-trained model that already understands language. The choice of base model affects both accuracy and inference speed.

SpaCy with transformers: Use spacy-transformers with a RoBERTa or DeBERTa base model. This gives you SpaCy's efficient pipeline with transformer-level accuracy. Fine-tuning takes 1 to 2 hours on a GPU. Inference is 200 to 500 documents per second on a CPU with the non-transformer SpaCy pipeline, or 20 to 50 documents per second with the transformer pipeline.

Hugging Face token classification: Use a BERT or DeBERTa model with a token classification head. More flexible than SpaCy for custom architectures but requires more engineering for production deployment. Fine-tuning takes 1 to 4 hours on a GPU. Accuracy is slightly higher than SpaCy on complex entity types.

Domain-specific base models: If your domain is biomedical (BioBERT, PubMedBERT), legal (Legal-BERT), or financial (FinBERT), starting from a domain-specific base model improves accuracy by 3 to 5% F1 compared to general BERT. These models understand domain vocabulary and abbreviations better because they were pre-trained on domain text.

Step 4: Train and evaluate.
Train on your labeled data for 10 to 30 epochs with early stopping based on evaluation F1. Evaluate on the held-out test set using per-type precision, recall, and F1. Error analysis tells you what to fix: if recall is low for a specific type, add more examples of that type; if precision is low, clarify the annotation guide and relabel ambiguous cases.

# SpaCy fine-tuning example
# config.cfg contains model architecture and training params
# train.spacy and dev.spacy are your labeled datasets

# Generate a config
python -m spacy init config config.cfg --lang en \
    --pipeline ner --optimize accuracy

# Train the model
python -m spacy train config.cfg \
    --output ./custom_ner_model \
    --paths.train ./train.spacy \
    --paths.dev ./dev.spacy \
    --training.patience 5 \
    --training.max_epochs 30

# Evaluate
python -m spacy evaluate ./custom_ner_model/model-best \
    ./dev.spacy --output metrics.json

Target metrics for production use: 85% F1 or higher per entity type. If a type consistently falls below 80%, either add more labeled examples, refine the annotation guide for that type, or consider using LLM-based extraction as a fallback for that specific type.

Step 5: Deploy and monitor.
Package the trained model as a SpaCy pipeline or a Hugging Face model and load it in your application. Set up quality monitoring by periodically sampling predictions and having a human verify them. Retrain when accuracy drops below your threshold, which typically happens when the domain vocabulary shifts (new services, new technologies, team restructuring).

import spacy

nlp = spacy.load("./custom_ner_model/model-best")

def extract_entities(text):
    doc = nlp(text)
    entities = []
    for ent in doc.ents:
        entities.append({
            "name": ent.text,
            "type": ent.label_,
            "start": ent.start_char,
            "end": ent.end_char
        })
    return entities

Hybrid approach: Many production systems run a fine-tuned NER model for standard entity types (fast, free) and an LLM for domain-specific types or relationship extraction (flexible, accurate). This gives you the best of both approaches while keeping costs manageable.

Skip the training pipeline. Adaptive Recall uses a tuned extraction system that handles entity recognition automatically as you store memories.

Try It Free

How to Fine-Tune NER for Your Domain

When to Fine-Tune vs When to Use an LLM

Step-by-Step Process

Related Articles