How to Build a Knowledge Graph from Text
Why Build a Graph from Text
Most organizational knowledge lives in unstructured text: documentation, meeting notes, support tickets, code comments, chat transcripts. This text contains rich information about how things relate to each other, but that information is implicit. A paragraph about a deployment might mention a service name, a database, a team, and a configuration setting without explicitly connecting them. A knowledge graph makes those connections explicit and traversable.
The alternative to building a graph from text is building one manually, which is accurate but does not scale. A team of knowledge engineers can create a high-quality graph for a small domain, but most organizations have too much knowledge changing too fast for manual approaches to keep up. LLM-based extraction bridges this gap by automating the process at the cost of some accuracy, which can be managed through validation and iterative refinement.
Step-by-Step Process
Segment your documents into chunks that are small enough for the LLM to process reliably but large enough to preserve the context needed for relationship extraction. For entity extraction, chunk size matters less than for embedding because the LLM can identify entities in short passages. For relationship extraction, chunks need to be long enough that both the subject and object of a relationship appear in the same chunk. A good default is 500 to 1,000 tokens per chunk with 100-token overlap between consecutive chunks.
def chunk_text(text, chunk_size=800, overlap=100):
words = text.split()
chunks = []
start = 0
while start < len(words):
end = start + chunk_size
chunk = ' '.join(words[start:end])
chunks.append(chunk)
start = end - overlap
return chunksIf your documents have natural boundaries (sections, paragraphs, pages), chunk along those boundaries rather than splitting mid-sentence. Relationship extraction quality drops significantly when the subject and object of a relationship land in different chunks.
Use a structured prompt that asks the LLM to identify all entities and categorize them by type. Common entity types include Person, Organization, Technology, Service, Concept, Location, and any domain-specific types relevant to your application. Ask for normalized names (full name rather than abbreviations or pronouns) so that the same entity is identified consistently across chunks.
ENTITY_PROMPT = """Extract all entities from the following text.
For each entity, provide:
- name: the canonical name (full name, not abbreviations)
- type: one of Person, Organization, Technology, Service, Concept
- aliases: any other names used for this entity in the text
Return as a JSON array.
Text:
{chunk}"""
import json
from anthropic import Anthropic
client = Anthropic()
def extract_entities(chunk):
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=2000,
messages=[{"role": "user", "content":
ENTITY_PROMPT.replace("{chunk}", chunk)}]
)
return json.loads(response.content[0].text)Extract typed relationships as subject-predicate-object triples. You can do this in the same LLM call as entity extraction or in a separate pass. A separate pass tends to produce higher quality because the model can focus on one task at a time, but it doubles the cost. For most use cases, a combined prompt works well enough.
TRIPLE_PROMPT = """Given these entities: {entities}
Extract all relationships between them from this text.
For each relationship, provide:
- subject: entity name
- predicate: relationship type (e.g., "depends on", "is maintained by")
- object: entity name
- confidence: 0.0 to 1.0
Return as a JSON array of triples.
Text:
{chunk}"""
def extract_triples(chunk, entities):
entity_names = [e["name"] for e in entities]
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=2000,
messages=[{"role": "user", "content":
TRIPLE_PROMPT.replace("{entities}",
json.dumps(entity_names)).replace("{chunk}", chunk)}]
)
return json.loads(response.content[0].text)The same entity will appear across multiple chunks with different names. "PostgreSQL," "Postgres," "PG," and "the database" might all refer to the same entity. String similarity catches obvious duplicates (Levenshtein distance, case-insensitive matching). Alias lists from the extraction step help. For harder cases, use the LLM to judge whether two entity names refer to the same thing, batching comparisons for efficiency.
from difflib import SequenceMatcher
def find_duplicates(entities, threshold=0.85):
merged = {}
for entity in entities:
name = entity["name"].lower().strip()
matched = False
for canonical in merged:
ratio = SequenceMatcher(None, name, canonical).ratio()
if ratio >= threshold:
merged[canonical]["aliases"].add(name)
matched = True
break
if not matched:
merged[name] = {
"name": entity["name"],
"type": entity["type"],
"aliases": set(entity.get("aliases", []))
}
return mergedLoad entities as nodes and relationships as edges into your chosen graph store. For Neo4j, use the Cypher query language. For a relational database, use a triples table with subject, predicate, and object columns. For prototyping, an in-memory structure using Python's NetworkX library works well for graphs under 100,000 nodes.
# Neo4j example
from neo4j import GraphDatabase
driver = GraphDatabase.driver("bolt://localhost:7687",
auth=("neo4j", "password"))
def store_triple(tx, subject, predicate, obj, confidence):
tx.run("""
MERGE (s:Entity {name: $subject})
MERGE (o:Entity {name: $object})
MERGE (s)-[r:RELATES {type: $predicate}]->(o)
SET r.confidence = $confidence
""", subject=subject, predicate=predicate,
object=obj, confidence=confidence)
with driver.session() as session:
for triple in triples:
session.execute_write(store_triple,
triple["subject"], triple["predicate"],
triple["object"], triple["confidence"])Sample 50 to 100 triples from your graph and verify them against the source text. Track error types: missed entities, hallucinated relationships, inconsistent predicates, unresolved duplicates. Each error type points to a specific improvement in your extraction prompt or post-processing pipeline. Expect 70 to 80% accuracy on the first pass and 90%+ after two or three iterations of prompt refinement.
Connecting the Graph to Retrieval
A knowledge graph by itself is a database. To improve AI retrieval, you need to connect it to your retrieval pipeline. The most common integration point is between the vector search step and the reranking step. After vector search returns its top-k results, look up entities mentioned in the query, traverse their graph connections, and boost the scores of results that are connected to those entities. This adds structural relevance to the semantic relevance that vector search provides.
Adaptive Recall handles this integration automatically. When you store memories through the MCP tools, entities are extracted and added to the knowledge graph. When you recall memories, the query entities are looked up, activation spreads through the graph, and connected memories receive a score boost. You get the retrieval quality benefits of a knowledge graph without building the extraction pipeline, graph database, or traversal logic.
Skip the graph infrastructure. Adaptive Recall extracts entities and builds the knowledge graph automatically as you store memories.
Try It Free