Home » Entity Extraction and NER » Relationship Pipeline

How to Build a Relationship Extraction Pipeline

A relationship extraction pipeline takes text and a set of known entities and identifies how those entities connect to each other. The output is a set of subject-predicate-object triples like (checkout service, depends_on, PostgreSQL) that form the edges of a knowledge graph. Building this pipeline involves defining your relationship types, extracting connections with an LLM or pattern matcher, normalizing the predicates, and filtering out weak or hallucinated relationships.

Why Relationship Extraction Matters

Entities without relationships are just a list. The value of a knowledge graph comes from the connections between entities, because those connections enable traversal. When someone asks "what is affected if Redis goes down," the answer comes from following depends_on relationships from other services to Redis, not from searching for documents that mention Redis. Without extracted relationships, the graph has no edges and traversal has nothing to follow.

Relationship extraction is harder than entity extraction because it requires understanding the semantic connection between two entities, not just recognizing that they exist. The sentence "the team discussed Redis during the architecture review" contains two entities but the relationship (discussed) is incidental, not structural. A good pipeline distinguishes between meaningful relationships that should be in the graph and incidental co-occurrences that add noise.

Step-by-Step Process

Step 1: Define your predicate vocabulary.
Start with 10 to 20 relationship types that matter for your domain. Each predicate should represent a distinct, meaningful connection between entities. Providing this vocabulary to the extraction system (whether LLM or pattern-based) dramatically improves consistency compared to letting the system invent predicates freely.

PREDICATES = [
    "depends_on",        # service/system depends on another
    "maintained_by",     # entity is maintained by a person or team
    "stores_data_in",    # service stores data in a database or store
    "communicates_with", # service communicates with another service
    "implements",        # service implements a concept or pattern
    "part_of",           # entity is a component of a larger entity
    "uses",              # entity uses a technology or library
    "created_by",        # entity was created by a person or team
    "deployed_on",       # service is deployed on infrastructure
    "replaces",          # entity replaces a previous entity
    "configured_by",     # entity is configured by a setting or file
    "documented_in",     # entity is documented in a resource
    "tested_by",         # entity is tested by a test suite or tool
    "monitors",          # entity monitors another entity
    "authenticates_via", # service authenticates via a mechanism
]

Step 2: Extract entities first.
Relationship extraction works best when the system already knows which entities exist in the text. Run entity extraction as a separate step (see the LLM extraction guide) and pass the extracted entity list into the relationship extraction prompt. This two-pass approach produces higher quality than trying to extract entities and relationships in a single prompt, because each pass can focus on its specific task.

Step 3: Design the relationship extraction prompt.
The prompt takes two inputs: the list of entities found in the text and the text itself. It asks the LLM to identify typed relationships between entity pairs, using the predicate vocabulary you defined. Requiring the LLM to cite the specific sentence that supports each relationship reduces hallucination.

RELATIONSHIP_PROMPT = """Given these entities found in the text:
{entities}

And these allowed relationship types:
{predicates}

Extract all relationships between the entities from the text below.
For each relationship, return:
- subject: the entity name (must be from the entity list)
- predicate: one of the allowed relationship types
- object: the entity name (must be from the entity list)
- evidence: the specific sentence from the text that supports this
- confidence: 0.0 to 1.0

Only extract relationships that are explicitly stated or strongly implied.
Do not infer relationships that require assumptions beyond the text.

Return as a JSON array of objects.

Text:
{text}"""

import json
from anthropic import Anthropic

client = Anthropic()

def extract_relationships(text, entities, predicates):
    entity_names = [e["name"] for e in entities]
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=3000,
        messages=[{
            "role": "user",
            "content": RELATIONSHIP_PROMPT
                .replace("{entities}", json.dumps(entity_names))
                .replace("{predicates}", json.dumps(predicates))
                .replace("{text}", text)
        }]
    )
    try:
        return json.loads(response.content[0].text)
    except json.JSONDecodeError:
        return []

Step 4: Normalize predicates.
Even with a controlled vocabulary in the prompt, the LLM sometimes generates variations or synonyms. Map extracted predicates to your canonical vocabulary. If the LLM returns "relies_on" when your vocabulary has "depends_on," normalize it. Keep a synonym map that grows as you encounter new variations.

PREDICATE_SYNONYMS = {
    "relies_on": "depends_on",
    "relies on": "depends_on",
    "built_by": "created_by",
    "built by": "created_by",
    "owned_by": "maintained_by",
    "owned by": "maintained_by",
    "runs_on": "deployed_on",
    "runs on": "deployed_on",
    "utilizes": "uses",
    "connects_to": "communicates_with",
}

def normalize_predicate(predicate):
    p = predicate.lower().strip()
    return PREDICATE_SYNONYMS.get(p, p)

Step 5: Score and filter relationships.
Apply a confidence threshold to remove weak relationships. A threshold of 0.6 to 0.7 works well for most domains. Also filter relationships where the subject or object does not match a known entity (the LLM sometimes invents entity names that were not in the input list). Validate that the evidence sentence actually supports the stated relationship by checking that both entity names appear in the cited sentence.

def filter_relationships(relationships, entity_names, threshold=0.65):
    valid = []
    name_set = {n.lower() for n in entity_names}
    for rel in relationships:
        if rel.get("confidence", 0) < threshold:
            continue
        if rel["subject"].lower() not in name_set:
            continue
        if rel["object"].lower() not in name_set:
            continue
        rel["predicate"] = normalize_predicate(rel["predicate"])
        valid.append(rel)
    return valid

Step 6: Store as triples.
Write the validated triples to your graph store. Include the confidence score, the evidence sentence, and a reference to the source document. This provenance metadata is essential for debugging incorrect graph traversal results and for updating the graph when source documents change.

def store_triples(triples, source_id, graph_db):
    for triple in triples:
        graph_db.add_triple(
            subject=triple["subject"],
            predicate=triple["predicate"],
            obj=triple["object"],
            metadata={
                "confidence": triple["confidence"],
                "evidence": triple.get("evidence", ""),
                "source": source_id,
                "extracted_at": datetime.utcnow().isoformat()
            }
        )

Handling Edge Cases

Some relationships span multiple sentences. "The checkout service processes credit card payments. To do this, it communicates with Stripe's API." The relationship (checkout service, communicates_with, Stripe) is split across two sentences. Larger chunk sizes and explicit instructions in the prompt to consider multi-sentence context help capture these cases.

Negated relationships require special attention. "The analytics service no longer depends on Redis" should not produce a depends_on triple. Add explicit instructions to distinguish between active and negated relationships. You can either exclude negated relationships entirely or store them with a "negated" flag for historical tracking.

Temporal relationships change over time. "We migrated from MySQL to PostgreSQL in Q3" means uses(service, PostgreSQL) is current and uses(service, MySQL) is historical. Capturing the temporal dimension requires the extraction prompt to identify tense and temporal markers, which the LLM handles well when instructed to do so.

Skip the pipeline engineering. Adaptive Recall extracts relationships automatically as you store memories, maintaining a knowledge graph that stays current with every interaction.

Try It Free

How to Build a Relationship Extraction Pipeline

Why Relationship Extraction Matters

Step-by-Step Process

Handling Edge Cases

Related Articles