How to Build a Knowledge Graph with Neo4j
Why Neo4j for Knowledge Graphs
Neo4j stores data as nodes and relationships natively, which means graph traversal operations (find all neighbors, follow a path, discover connected components) are executed without the join operations that relational databases require. A three-hop traversal in Neo4j is a single query that executes in milliseconds. The same traversal in PostgreSQL requires three self-joins on a triples table, which becomes slow as the table grows. For AI applications where retrieval latency matters, this performance difference is significant.
The property graph model allows you to attach properties to both nodes and relationships. An entity node can carry properties like name, type, created date, confidence score, and source document. A relationship can carry properties like type, strength, evidence text, and extraction date. This richness lets you filter and weight traversal results based on metadata, which is essential for retrieval quality.
Step-by-Step Setup
For local development, install Neo4j Community Edition via Docker. For production, use Neo4j AuraDB (managed cloud) or deploy Enterprise Edition on your infrastructure. The Community Edition is free and supports everything needed for knowledge graph applications. The main limitation is single-instance only (no clustering), which is fine for graphs under a few million nodes.
# Docker setup for local development
docker run \
--name neo4j-kg \
-p 7474:7474 -p 7687:7687 \
-e NEO4J_AUTH=neo4j/your-password \
-v neo4j-data:/data \
neo4j:5-communityAccess the Neo4j Browser at http://localhost:7474 to verify the installation. The browser provides a visual interface for running Cypher queries and exploring the graph structure, which is invaluable during development.
Define node labels for your entity types and relationship types for your predicates. In Neo4j, node labels are like table names in a relational database, and relationship types are like foreign key constraints. Keep labels singular (Service, not Services) and relationship types as uppercase verbs (DEPENDS_ON, not depends_on or DependsOn).
// Example schema for a software engineering knowledge graph
// Node labels: Service, Database, Person, Team, Technology, Document
// Relationship types: DEPENDS_ON, USES, MAINTAINED_BY, PART_OF,
// DOCUMENTED_IN, AUTHORED_BY
// Create a service with properties
CREATE (s:Service {
name: 'Checkout Service',
description: 'Handles payment processing and order creation',
created: datetime('2026-01-15'),
confidence: 0.95
})Add uniqueness constraints on entity names to prevent duplicates during loading. Add indexes on frequently queried properties (name, type, created date) for fast lookups. Indexes are critical for performance because entity linking during query time needs to find nodes by name in milliseconds.
// uniqueness constraint prevents duplicate entities
CREATE CONSTRAINT entity_name IF NOT EXISTS
FOR (e:Service) REQUIRE e.name IS UNIQUE;
CREATE CONSTRAINT db_name IF NOT EXISTS
FOR (d:Database) REQUIRE d.name IS UNIQUE;
CREATE CONSTRAINT person_name IF NOT EXISTS
FOR (p:Person) REQUIRE p.name IS UNIQUE;
// full-text index for fuzzy entity matching
CREATE FULLTEXT INDEX entity_search IF NOT EXISTS
FOR (n:Service|Database|Person|Team|Technology)
ON EACH [n.name, n.aliases];Use MERGE statements to load data idempotently. MERGE creates the node or relationship if it does not exist and matches the existing one if it does. This handles duplicate entities from overlapping text chunks gracefully. Batch your MERGE operations in transactions of 1,000 to 5,000 statements for performance.
from neo4j import GraphDatabase
driver = GraphDatabase.driver(
"bolt://localhost:7687",
auth=("neo4j", "your-password")
)
def load_entities(tx, entities):
for e in entities:
tx.run("""
MERGE (n:Entity {name: $name})
ON CREATE SET n.type = $type,
n.aliases = $aliases,
n.created = datetime(),
n.confidence = 0.8
ON MATCH SET n.confidence =
CASE WHEN n.confidence < 0.95
THEN n.confidence + 0.05
ELSE n.confidence END
""", name=e["name"], type=e["type"],
aliases=e.get("aliases", []))
def load_relationships(tx, triples):
for t in triples:
tx.run("""
MATCH (s:Entity {name: $subject})
MATCH (o:Entity {name: $object})
MERGE (s)-[r:RELATES {type: $predicate}]->(o)
ON CREATE SET r.evidence = $evidence,
r.created = datetime(),
r.strength = 1.0
ON MATCH SET r.strength = r.strength + 0.1
""", subject=t["subject"], predicate=t["predicate"],
object=t["object"],
evidence=t.get("evidence", ""))
with driver.session() as session:
session.execute_write(load_entities, extracted_entities)
session.execute_write(load_relationships, extracted_triples)Build Cypher queries for the access patterns your AI application needs. The most common patterns are: find all neighbors of an entity (one-hop), find entities within N hops (multi-hop), find the shortest path between two entities, and find all entities of a type connected to a starting entity.
// Find everything the checkout service depends on (1 hop)
MATCH (s:Entity {name: 'Checkout Service'})-[:RELATES {type: 'depends_on'}]->(dep)
RETURN dep.name, dep.type
// Find all transitive dependencies (up to 3 hops)
MATCH path = (s:Entity {name: 'Checkout Service'})
-[:RELATES*1..3 {type: 'depends_on'}]->(dep)
RETURN dep.name, dep.type, length(path) as depth
ORDER BY depth
// Find how two entities are connected
MATCH path = shortestPath(
(a:Entity {name: 'Checkout Service'})-[*]-(b:Entity {name: 'Redis'})
)
RETURN path
// Spreading activation: find connected entities with decay
MATCH (start:Entity {name: 'Redis'})-[r1:RELATES]-(hop1)
OPTIONAL MATCH (hop1)-[r2:RELATES]-(hop2)
WHERE hop2 <> start
RETURN hop1.name as entity, 1.0 as activation
UNION
RETURN hop2.name as entity, 0.5 as activationIntegrate Neo4j queries into your retrieval system. Extract entities from the user query, look them up in Neo4j, traverse their connections, and use the connected entities to boost vector search results or retrieve additional context. The Neo4j Python driver handles connection pooling and transaction management.
def graph_augmented_retrieve(query, vector_results):
entities = extract_query_entities(query)
with driver.session() as session:
graph_entities = set()
for entity in entities:
result = session.run("""
MATCH (start:Entity {name: $name})-[r:RELATES]-(connected)
RETURN connected.name as name, r.strength as strength
ORDER BY r.strength DESC LIMIT 20
""", name=entity["name"])
for record in result:
graph_entities.add(record["name"])
# boost vector results that mention graph-connected entities
for result in vector_results:
for entity in graph_entities:
if entity.lower() in result["text"].lower():
result["score"] *= 1.3 # 30% boost
return sorted(vector_results, key=lambda r: -r["score"])When Neo4j Is Overkill
Neo4j adds operational complexity: a separate database to host, monitor, back up, and scale. For graphs under 50,000 entities, a simple triples table in PostgreSQL with recursive CTEs handles traversal well enough. For prototyping, an in-memory graph in Python (using NetworkX or a dictionary of adjacency lists) is sufficient. Neo4j earns its complexity when your graph exceeds 100,000 entities, when you need sub-millisecond traversal, or when you need the Cypher query language for complex access patterns.
Adaptive Recall includes a managed knowledge graph as part of its memory system. Entities are extracted during memory storage and stored in a graph that supports spreading activation during recall. If your primary goal is better AI retrieval rather than building a general-purpose graph database, the built-in graph eliminates the need to operate Neo4j separately.
Get graph-powered retrieval without running a graph database. Adaptive Recall manages the knowledge graph for you.
Get Started Free