How to Build a Relationship Extraction Pipeline
Why Relationship Extraction Matters
Entities without relationships are just a list. The value of a knowledge graph comes from the connections between entities, because those connections enable traversal. When someone asks "what is affected if Redis goes down," the answer comes from following depends_on relationships from other services to Redis, not from searching for documents that mention Redis. Without extracted relationships, the graph has no edges and traversal has nothing to follow.
Relationship extraction is harder than entity extraction because it requires understanding the semantic connection between two entities, not just recognizing that they exist. The sentence "the team discussed Redis during the architecture review" contains two entities but the relationship (discussed) is incidental, not structural. A good pipeline distinguishes between meaningful relationships that should be in the graph and incidental co-occurrences that add noise.
Step-by-Step Process
Start with 10 to 20 relationship types that matter for your domain. Each predicate should represent a distinct, meaningful connection between entities. Providing this vocabulary to the extraction system (whether LLM or pattern-based) dramatically improves consistency compared to letting the system invent predicates freely.
PREDICATES = [
"depends_on", # service/system depends on another
"maintained_by", # entity is maintained by a person or team
"stores_data_in", # service stores data in a database or store
"communicates_with", # service communicates with another service
"implements", # service implements a concept or pattern
"part_of", # entity is a component of a larger entity
"uses", # entity uses a technology or library
"created_by", # entity was created by a person or team
"deployed_on", # service is deployed on infrastructure
"replaces", # entity replaces a previous entity
"configured_by", # entity is configured by a setting or file
"documented_in", # entity is documented in a resource
"tested_by", # entity is tested by a test suite or tool
"monitors", # entity monitors another entity
"authenticates_via", # service authenticates via a mechanism
]Relationship extraction works best when the system already knows which entities exist in the text. Run entity extraction as a separate step (see the LLM extraction guide) and pass the extracted entity list into the relationship extraction prompt. This two-pass approach produces higher quality than trying to extract entities and relationships in a single prompt, because each pass can focus on its specific task.
The prompt takes two inputs: the list of entities found in the text and the text itself. It asks the LLM to identify typed relationships between entity pairs, using the predicate vocabulary you defined. Requiring the LLM to cite the specific sentence that supports each relationship reduces hallucination.
RELATIONSHIP_PROMPT = """Given these entities found in the text:
{entities}
And these allowed relationship types:
{predicates}
Extract all relationships between the entities from the text below.
For each relationship, return:
- subject: the entity name (must be from the entity list)
- predicate: one of the allowed relationship types
- object: the entity name (must be from the entity list)
- evidence: the specific sentence from the text that supports this
- confidence: 0.0 to 1.0
Only extract relationships that are explicitly stated or strongly implied.
Do not infer relationships that require assumptions beyond the text.
Return as a JSON array of objects.
Text:
{text}"""import json
from anthropic import Anthropic
client = Anthropic()
def extract_relationships(text, entities, predicates):
entity_names = [e["name"] for e in entities]
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=3000,
messages=[{
"role": "user",
"content": RELATIONSHIP_PROMPT
.replace("{entities}", json.dumps(entity_names))
.replace("{predicates}", json.dumps(predicates))
.replace("{text}", text)
}]
)
try:
return json.loads(response.content[0].text)
except json.JSONDecodeError:
return []Even with a controlled vocabulary in the prompt, the LLM sometimes generates variations or synonyms. Map extracted predicates to your canonical vocabulary. If the LLM returns "relies_on" when your vocabulary has "depends_on," normalize it. Keep a synonym map that grows as you encounter new variations.
PREDICATE_SYNONYMS = {
"relies_on": "depends_on",
"relies on": "depends_on",
"built_by": "created_by",
"built by": "created_by",
"owned_by": "maintained_by",
"owned by": "maintained_by",
"runs_on": "deployed_on",
"runs on": "deployed_on",
"utilizes": "uses",
"connects_to": "communicates_with",
}
def normalize_predicate(predicate):
p = predicate.lower().strip()
return PREDICATE_SYNONYMS.get(p, p)Apply a confidence threshold to remove weak relationships. A threshold of 0.6 to 0.7 works well for most domains. Also filter relationships where the subject or object does not match a known entity (the LLM sometimes invents entity names that were not in the input list). Validate that the evidence sentence actually supports the stated relationship by checking that both entity names appear in the cited sentence.
def filter_relationships(relationships, entity_names, threshold=0.65):
valid = []
name_set = {n.lower() for n in entity_names}
for rel in relationships:
if rel.get("confidence", 0) < threshold:
continue
if rel["subject"].lower() not in name_set:
continue
if rel["object"].lower() not in name_set:
continue
rel["predicate"] = normalize_predicate(rel["predicate"])
valid.append(rel)
return validWrite the validated triples to your graph store. Include the confidence score, the evidence sentence, and a reference to the source document. This provenance metadata is essential for debugging incorrect graph traversal results and for updating the graph when source documents change.
def store_triples(triples, source_id, graph_db):
for triple in triples:
graph_db.add_triple(
subject=triple["subject"],
predicate=triple["predicate"],
obj=triple["object"],
metadata={
"confidence": triple["confidence"],
"evidence": triple.get("evidence", ""),
"source": source_id,
"extracted_at": datetime.utcnow().isoformat()
}
)Handling Edge Cases
Some relationships span multiple sentences. "The checkout service processes credit card payments. To do this, it communicates with Stripe's API." The relationship (checkout service, communicates_with, Stripe) is split across two sentences. Larger chunk sizes and explicit instructions in the prompt to consider multi-sentence context help capture these cases.
Negated relationships require special attention. "The analytics service no longer depends on Redis" should not produce a depends_on triple. Add explicit instructions to distinguish between active and negated relationships. You can either exclude negated relationships entirely or store them with a "negated" flag for historical tracking.
Temporal relationships change over time. "We migrated from MySQL to PostgreSQL in Q3" means uses(service, PostgreSQL) is current and uses(service, MySQL) is historical. Capturing the temporal dimension requires the extraction prompt to identify tense and temporal markers, which the LLM handles well when instructed to do so.
Skip the pipeline engineering. Adaptive Recall extracts relationships automatically as you store memories, maintaining a knowledge graph that stays current with every interaction.
Try It Free