How to Fine-Tune NER for Your Domain
When to Fine-Tune vs When to Use an LLM
Fine-tuning makes sense when you process more than 10,000 documents per day, when your entity types are stable (not changing month to month), and when you need sub-second latency per document. The upfront cost is the labeling effort (20 to 40 hours for a typical domain) and training time (1 to 4 hours on a GPU). The ongoing cost is near zero because the model runs locally.
LLM-based extraction makes sense when your entity types change frequently, when you need to start extracting immediately without a labeling phase, or when your volume is low enough that API costs are manageable. The break-even point for most domains is around 5,000 to 10,000 documents: below that, LLM extraction is cheaper than the labeling effort; above that, the fine-tuned model's zero marginal cost wins.
Step-by-Step Process
Write a document that clearly defines each entity type with examples, edge cases, and explicit boundary decisions. "Technology" is too vague. "Technology: any programming language, framework, library, protocol, or tool mentioned by name, not including general concepts like 'caching' or 'load balancing'" gives annotators clear guidance. Include 5 to 10 examples per type and 5 edge cases that illustrate where types overlap.
Common domain-specific entity types for software engineering:
Entity Types:
- SERVICE: named microservices, APIs, applications (e.g., "checkout-api")
- INFRA: databases, caches, queues, cloud services (e.g., "Redis", "S3")
- TEAM: named teams, departments, squads (e.g., "platform team")
- CONFIG: configuration keys, environment variables, feature flags
- LIBRARY: named packages, SDKs, frameworks (e.g., "React 18", "FastAPI")
- ENDPOINT: API paths, routes (e.g., "/api/v2/orders")
NOT entities:
- Generic concepts: "authentication", "caching", "deployment"
- Unnamed references: "the database", "their API", "the team"Gather 500 to 1,000 text passages from your actual domain (documentation, code reviews, incident reports, chat logs). Annotate entities in BIO format using a labeling tool. You need at least 200 labeled examples per entity type for usable accuracy, and 500 per type for good accuracy. Split your data into training (80%) and evaluation (20%) sets.
Recommended labeling tools: Prodigy ($490, integrates directly with SpaCy), Label Studio (free, open source, supports NER out of the box), and Doccano (free, simpler interface). Prodigy's active learning feature reduces labeling time by 40 to 60% by prioritizing the most informative examples for annotation.
# BIO format example:
# "The checkout-api depends on Redis for session caching"
# The O
# checkout B-SERVICE
# - I-SERVICE
# api I-SERVICE
# depends O
# on O
# Redis B-INFRA
# for O
# session O
# caching OFine-tuning starts from a pre-trained model that already understands language. The choice of base model affects both accuracy and inference speed.
SpaCy with transformers: Use spacy-transformers with a RoBERTa or DeBERTa base model. This gives you SpaCy's efficient pipeline with transformer-level accuracy. Fine-tuning takes 1 to 2 hours on a GPU. Inference is 200 to 500 documents per second on a CPU with the non-transformer SpaCy pipeline, or 20 to 50 documents per second with the transformer pipeline.
Hugging Face token classification: Use a BERT or DeBERTa model with a token classification head. More flexible than SpaCy for custom architectures but requires more engineering for production deployment. Fine-tuning takes 1 to 4 hours on a GPU. Accuracy is slightly higher than SpaCy on complex entity types.
Domain-specific base models: If your domain is biomedical (BioBERT, PubMedBERT), legal (Legal-BERT), or financial (FinBERT), starting from a domain-specific base model improves accuracy by 3 to 5% F1 compared to general BERT. These models understand domain vocabulary and abbreviations better because they were pre-trained on domain text.
Train on your labeled data for 10 to 30 epochs with early stopping based on evaluation F1. Evaluate on the held-out test set using per-type precision, recall, and F1. Error analysis tells you what to fix: if recall is low for a specific type, add more examples of that type; if precision is low, clarify the annotation guide and relabel ambiguous cases.
# SpaCy fine-tuning example
# config.cfg contains model architecture and training params
# train.spacy and dev.spacy are your labeled datasets
# Generate a config
python -m spacy init config config.cfg --lang en \
--pipeline ner --optimize accuracy
# Train the model
python -m spacy train config.cfg \
--output ./custom_ner_model \
--paths.train ./train.spacy \
--paths.dev ./dev.spacy \
--training.patience 5 \
--training.max_epochs 30
# Evaluate
python -m spacy evaluate ./custom_ner_model/model-best \
./dev.spacy --output metrics.jsonTarget metrics for production use: 85% F1 or higher per entity type. If a type consistently falls below 80%, either add more labeled examples, refine the annotation guide for that type, or consider using LLM-based extraction as a fallback for that specific type.
Package the trained model as a SpaCy pipeline or a Hugging Face model and load it in your application. Set up quality monitoring by periodically sampling predictions and having a human verify them. Retrain when accuracy drops below your threshold, which typically happens when the domain vocabulary shifts (new services, new technologies, team restructuring).
import spacy
nlp = spacy.load("./custom_ner_model/model-best")
def extract_entities(text):
doc = nlp(text)
entities = []
for ent in doc.ents:
entities.append({
"name": ent.text,
"type": ent.label_,
"start": ent.start_char,
"end": ent.end_char
})
return entitiesSkip the training pipeline. Adaptive Recall uses a tuned extraction system that handles entity recognition automatically as you store memories.
Try It Free