Home » Entity Extraction and NER » Without Fine-Tuning

Can LLMs Do Entity Extraction Without Fine-Tuning

Yes. Claude and GPT-4 achieve 85 to 92% F1 on entity extraction tasks with zero-shot prompting, meaning no fine-tuning and no training data. You describe the entity types you want in the prompt, and the model uses its language understanding to identify matching entities. For standard entity types, this matches fine-tuned NER models. For domain-specific types, it outperforms unfine-tuned NER models by a wide margin. The trade-off is cost and latency: API calls cost $0.003 to $0.015 per passage and take 1 to 3 seconds, compared to near-zero cost and sub-millisecond latency for local NER.

Zero-Shot Accuracy

In zero-shot entity extraction, the LLM receives a prompt describing the entity types to extract and a text passage, with no labeled examples. Claude and GPT-4 achieve 85 to 88% F1 on standard entity types (person, organization, location) in this mode. Adding 2 to 3 examples of correct extractions in the prompt (few-shot) pushes accuracy to 88 to 92% F1, closing most of the gap with fine-tuned NER models.

On domain-specific entity types, the advantage of LLMs is even stronger. A zero-shot LLM prompt for "microservices, APIs, and databases" achieves 80 to 87% F1 on software engineering text. A general NER model achieves 0% on these types because it has never seen them. A fine-tuned NER model requires 200 to 500 labeled examples to reach 82 to 88% F1, which means days of labeling work before the first extraction.

Prompt Engineering for Better Results

The quality of zero-shot LLM extraction depends heavily on the prompt. A vague prompt ("extract entities from this text") produces noisy, inconsistent results. A specific prompt with clear type definitions, output format, and edge case handling produces results that rival fine-tuned models.

Key prompt components: define each entity type with what it includes and excludes, specify the output format as JSON with required fields, instruct the model to use canonical names rather than abbreviations, and add a confidence score requirement so low-confidence extractions can be filtered. Including 2 to 3 examples of correctly formatted output in the prompt improves consistency significantly.

Few-Shot vs Zero-Shot: The Practical Difference

Adding 2 to 3 examples of correct extractions to the prompt (few-shot) typically improves F1 by 3 to 5 percentage points over zero-shot. The examples serve two purposes: they show the model the exact output format you expect, reducing JSON formatting errors, and they calibrate the model's extraction threshold by demonstrating what counts as an entity in your domain. For example, showing that "Redis" is an Infrastructure entity but "caching" is not teaches the model the boundary between entities and concepts.

The trade-off is that few-shot examples consume tokens, adding to prompt length and cost. Three examples at 200 tokens each adds 600 tokens per API call. For most use cases, this cost is trivial compared to the improvement in extraction quality. If you are optimizing for the lowest possible cost, zero-shot works acceptably. If you are optimizing for quality, few-shot is worth the extra tokens.

Where Zero-Shot Degrades

LLM extraction without fine-tuning performs worst in three situations. First, highly ambiguous domains where entity boundaries are unclear. In semiconductor manufacturing, is "7nm process" an entity or a property? Without domain-specific examples, the LLM guesses inconsistently. Second, very short text passages (under 50 tokens) where there is not enough context for disambiguation. The single word "Mercury" could be a planet, element, or company, and without surrounding text, the LLM has no basis for choosing. Third, entity types that are inherently vague, like "concept" or "topic," where reasonable people would disagree on what counts. These situations benefit from either fine-tuning or very detailed prompt instructions with boundary examples.

When Fine-Tuning Is Still Worth It

Fine-tuning a local NER model is worth the effort in three situations: when you process more than 10,000 documents per day (LLM costs become prohibitive), when you need sub-100ms latency per document (LLMs take seconds), and when your entity types are stable enough that the labeling investment pays off over months of use without needing to relabel.

For most applications in 2026, the recommendation is to start with LLM-based extraction, validate the results, and only invest in fine-tuning if cost or latency constraints demand it. The LLM gives you a working extraction system in hours. Fine-tuning gives you a cheaper extraction system in weeks. Many teams find that LLM extraction meets their needs indefinitely and never reach the scale where fine-tuning becomes necessary.

Adaptive Recall uses a tuned extraction system that combines the flexibility of LLM-based extraction with optimizations that reduce per-memory cost. Entities are extracted automatically from every stored memory, and the extraction quality improves as the entity inventory provides better context for disambiguation.

No fine-tuning needed. Adaptive Recall extracts entities from every memory automatically using optimized extraction that adapts to your domain.

Try It Free