How to Convert Entities into a Knowledge Graph
Choosing a Graph Storage Backend
The choice of graph storage depends on your scale, query complexity, and operational capacity. There are three practical options in 2026, each with different trade-offs.
Graph databases (Neo4j, Amazon Neptune, Memgraph) are purpose-built for graph storage and traversal. Neo4j uses the property graph model and the Cypher query language, which makes complex traversals intuitive to write. Multi-hop queries, path finding, and community detection are first-class operations. The trade-off is operational complexity: you run a separate database with its own backup, monitoring, and scaling concerns. Neo4j AuraDB (managed) starts at $65/month for production workloads.
Relational triple stores use your existing PostgreSQL or MySQL database with a triples table (subject, predicate, object columns plus metadata). This avoids a new database but makes multi-hop traversal queries verbose. A two-hop query requires a self-join, three hops require two self-joins, and so on. PostgreSQL's recursive CTEs handle this cleanly up to about 100,000 triples. Beyond that, query performance degrades without careful indexing. The advantage is that you already have the infrastructure, monitoring, and backup in place.
In-memory graph libraries (NetworkX, igraph) store the graph in application memory. These are excellent for prototyping and for graphs under 50,000 nodes. Traversal is fast because there is no network round-trip to a database. The limitations are obvious: the graph disappears when the process stops, memory usage grows with graph size, and there is no concurrent access from multiple application instances.
Step-by-Step Process
For graphs under 100,000 triples where you already run PostgreSQL, use a relational triple store. For graphs over 100,000 triples or where you need complex traversal (community detection, shortest path, pattern matching), use Neo4j. For prototyping, use NetworkX. If you are using Adaptive Recall, graph storage is handled for you.
Each entity becomes a node with properties. At minimum, store the canonical name, entity type, a list of aliases, and a confidence score. Add a created_at timestamp and a source list (which documents the entity was extracted from) for provenance tracking.
# PostgreSQL schema for a triple store
CREATE TABLE entities (
id SERIAL PRIMARY KEY,
name TEXT NOT NULL UNIQUE,
entity_type TEXT NOT NULL,
aliases TEXT[] DEFAULT '{}',
confidence REAL DEFAULT 0.8,
created_at TIMESTAMP DEFAULT NOW(),
updated_at TIMESTAMP DEFAULT NOW()
);
CREATE TABLE triples (
id SERIAL PRIMARY KEY,
subject_id INTEGER REFERENCES entities(id),
predicate TEXT NOT NULL,
object_id INTEGER REFERENCES entities(id),
confidence REAL DEFAULT 0.8,
evidence TEXT,
source_doc TEXT,
created_at TIMESTAMP DEFAULT NOW()
);
CREATE INDEX idx_triples_subject ON triples(subject_id);
CREATE INDEX idx_triples_object ON triples(object_id);
CREATE INDEX idx_triples_predicate ON triples(predicate);Each relationship becomes a directed edge from subject to object. Store the predicate type, confidence score, the evidence sentence from the source text, and a reference to the source document. The evidence field is critical for debugging incorrect graph data and for knowing when a relationship needs to be updated because its source document changed.
Use upsert logic so that loading the same entity twice updates the existing node rather than creating a duplicate. Merge alias lists when an existing entity is encountered with new aliases. Update the confidence score using the maximum of the old and new values, since a higher-confidence extraction takes precedence.
import psycopg2
def upsert_entity(cursor, name, entity_type, aliases, confidence):
cursor.execute("""
INSERT INTO entities (name, entity_type, aliases, confidence)
VALUES (%s, %s, %s, %s)
ON CONFLICT (name) DO UPDATE SET
aliases = ARRAY(
SELECT DISTINCT unnest(
entities.aliases || EXCLUDED.aliases
)
),
confidence = GREATEST(entities.confidence, EXCLUDED.confidence),
updated_at = NOW()
RETURNING id
""", (name, entity_type, aliases, confidence))
return cursor.fetchone()[0]For each triple, look up the subject and object entity IDs, then create the edge. Use upsert logic on edges too: if a triple with the same subject, predicate, and object already exists, update the confidence and evidence rather than creating a duplicate edge.
def upsert_triple(cursor, subject_id, predicate, object_id,
confidence, evidence, source_doc):
cursor.execute("""
INSERT INTO triples
(subject_id, predicate, object_id, confidence,
evidence, source_doc)
VALUES (%s, %s, %s, %s, %s, %s)
ON CONFLICT ON CONSTRAINT unique_triple DO UPDATE SET
confidence = GREATEST(triples.confidence, EXCLUDED.confidence),
evidence = EXCLUDED.evidence,
source_doc = EXCLUDED.source_doc
""", (subject_id, predicate, object_id, confidence,
evidence, source_doc))ALTER TABLE triples ADD CONSTRAINT unique_triple UNIQUE (subject_id, predicate, object_id);
The graph is only useful for retrieval if traversal results can be linked back to the documents or memories they came from. Add a junction table or metadata field that connects entity nodes to the documents they appear in. When graph traversal activates an entity, the system can retrieve the associated documents and include them in the LLM's context.
CREATE TABLE entity_sources (
entity_id INTEGER REFERENCES entities(id),
source_type TEXT NOT NULL, -- 'document', 'memory', 'chunk'
source_id TEXT NOT NULL,
mention_count INTEGER DEFAULT 1,
PRIMARY KEY (entity_id, source_type, source_id)
);
-- Query: find all documents connected to entities
-- reachable from a starting entity within 2 hops
WITH RECURSIVE reachable AS (
SELECT object_id AS entity_id, 1 AS depth
FROM triples WHERE subject_id = $1
UNION
SELECT t.object_id, r.depth + 1
FROM triples t
JOIN reachable r ON t.subject_id = r.entity_id
WHERE r.depth < 2
)
SELECT DISTINCT es.source_id
FROM entity_sources es
JOIN reachable r ON es.entity_id = r.entity_id;Neo4j Alternative
If you chose Neo4j instead of PostgreSQL, the loading process uses Cypher's MERGE command, which handles upsert logic natively.
from neo4j import GraphDatabase
driver = GraphDatabase.driver(
"bolt://localhost:7687", auth=("neo4j", "password"))
def load_triple(tx, subject, predicate, obj, confidence, evidence):
tx.run("""
MERGE (s:Entity {name: $subject})
MERGE (o:Entity {name: $object})
MERGE (s)-[r:RELATES {type: $predicate}]->(o)
SET r.confidence = CASE
WHEN r.confidence IS NULL THEN $confidence
WHEN r.confidence < $confidence THEN $confidence
ELSE r.confidence END,
r.evidence = $evidence
""", subject=subject, predicate=predicate,
object=obj, confidence=confidence, evidence=evidence)
with driver.session() as session:
for triple in extracted_triples:
session.execute_write(load_triple,
triple["subject"], triple["predicate"],
triple["object"], triple["confidence"],
triple.get("evidence", ""))Adaptive Recall manages the entire graph storage and retrieval pipeline. When you store memories, entities are extracted, merged into the graph, and linked to the source memory. When you recall, graph traversal finds connected entities and boosts the retrieval scores of associated memories. You get a fully connected knowledge graph without managing a graph database.
Turn extracted entities into a live knowledge graph without the infrastructure. Adaptive Recall builds and maintains the graph as you store memories.
Try It Free