From Concept to Running System
You've read about knowledge graphs. The architecture makes sense. The benefits are clear. Now: how do you actually build one?
This chapter is a practical guide. Not theory--steps. By the end, you'll have a working knowledge graph you can query.
Week 1: Foundation
Goal: Working database with basic schema and vector search capability.
Day 1-2: Environment Setup
Install PostgreSQL locally or provision a cloud instance. Install the pgvector extension:
CREATE EXTENSION vector;Create a database for your knowledge graph:
CREATE DATABASE knowledge_graph;Day 3-4: Basic Schema
Create core tables:
-- Documents table with content and embedding
CREATE TABLE documents (
id SERIAL PRIMARY KEY,
title TEXT,
content TEXT NOT NULL,
source TEXT,
created_at TIMESTAMP DEFAULT NOW(),
updated_at TIMESTAMP DEFAULT NOW(),
embedding vector(1536)
);
-- Entities extracted from documents
CREATE TABLE entities (
id SERIAL PRIMARY KEY,
name TEXT NOT NULL,
type TEXT,
created_at TIMESTAMP DEFAULT NOW()
);
-- Relationships between entities
CREATE TABLE relationships (
id SERIAL PRIMARY KEY,
source_entity_id INTEGER REFERENCES entities(id),
target_entity_id INTEGER REFERENCES entities(id),
relationship_type TEXT,
confidence FLOAT,
created_at TIMESTAMP DEFAULT NOW()
);
-- Links between documents
CREATE TABLE document_links (
id SERIAL PRIMARY KEY,
source_doc_id INTEGER REFERENCES documents(id),
target_doc_id INTEGER REFERENCES documents(id),
link_type TEXT,
similarity FLOAT,
created_at TIMESTAMP DEFAULT NOW()
);
-- Document-entity connections
CREATE TABLE document_entities (
id SERIAL PRIMARY KEY,
document_id INTEGER REFERENCES documents(id),
entity_id INTEGER REFERENCES entities(id)
);Day 5: Create Indices
-- Vector index for semantic search
CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops);
-- Text index for keyword search
CREATE INDEX documents_content_idx ON documents USING GIN (to_tsvector('english', content));
-- Standard indices
CREATE INDEX ON document_links(source_doc_id);
CREATE INDEX ON document_links(target_doc_id);
CREATE INDEX ON document_entities(document_id);
CREATE INDEX ON document_entities(entity_id);Day 6-7: Basic Ingestion
Write a simple script to load documents:
import psycopg2
from openai import OpenAI
def embed_text(text):
client = OpenAI()
response = client.embeddings.create(
model="text-embedding-3-small",
input=text
)
return response.data[0].embedding
def insert_document(conn, title, content, source):
embedding = embed_text(content)
with conn.cursor() as cur:
cur.execute("""
INSERT INTO documents (title, content, source, embedding)
VALUES (%s, %s, %s, %s)
RETURNING id
""", (title, content, source, embedding))
conn.commit()
return cur.fetchone()[0]Load a few test documents. Verify they're searchable.
Week 2: Search and Retrieval
Goal: Working semantic search and keyword search.
Day 1-2: Semantic Search
def semantic_search(conn, query, limit=10):
query_embedding = embed_text(query)
with conn.cursor() as cur:
cur.execute("""
SELECT id, title, content,
1 - (embedding <=> %s) as similarity
FROM documents
ORDER BY embedding <=> %s
LIMIT %s
""", (query_embedding, query_embedding, limit))
return cur.fetchall()Test with various queries. Does it find semantically related content?
Day 3-4: Hybrid Search
def hybrid_search(conn, query, limit=10, semantic_weight=0.7):
query_embedding = embed_text(query)
ts_query = ' & '.join(query.split())
with conn.cursor() as cur:
cur.execute("""
SELECT id, title, content,
%s * (1 - (embedding <=> %s)) +
%s * COALESCE(ts_rank(to_tsvector('english', content),
to_tsquery('english', %s)), 0) as score
FROM documents
WHERE embedding <=> %s < 0.5
OR to_tsvector('english', content) @@ to_tsquery('english', %s)
ORDER BY score DESC
LIMIT %s
""", (semantic_weight, query_embedding, 1-semantic_weight,
ts_query, query_embedding, ts_query, limit))
return cur.fetchall()Day 5-7: Query Interface
Build a simple CLI or web interface for querying:
while True:
query = input("Search: ")
if query == 'quit':
break
results = hybrid_search(conn, query)
for r in results:
print(f"\n{r[1]} (score: {r[3]:.3f})")
print(r[2][:200] + "...")Week 3: Entity Extraction and Linking
Goal: Extract entities and build the graph layer.
Day 1-3: Entity Extraction
Use an LLM to extract entities:
def extract_entities(content):
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{
"role": "user",
"content": f"""Extract entities from this text.
Return JSON array with objects containing 'name' and 'type'.
Types: person, concept, technology, project, organization
Text: {content}"""
}]
)
return json.loads(response.choices[0].message.content)
def process_document_entities(conn, doc_id, content):
entities = extract_entities(content)
for entity in entities:
entity_id = get_or_create_entity(conn, entity['name'], entity['type'])
link_document_entity(conn, doc_id, entity_id)Run extraction on all documents.
Day 4-5: Link Discovery
Find similar documents and create links:
def discover_links(conn, similarity_threshold=0.8):
with conn.cursor() as cur:
cur.execute("""
SELECT d1.id, d2.id, 1 - (d1.embedding <=> d2.embedding) as sim
FROM documents d1, documents d2
WHERE d1.id < d2.id
AND d1.embedding <=> d2.embedding < %s
""", (1 - similarity_threshold,))
for row in cur.fetchall():
cur.execute("""
INSERT INTO document_links
(source_doc_id, target_doc_id, link_type, similarity)
VALUES (%s, %s, 'semantic', %s)
ON CONFLICT DO NOTHING
""", (row[0], row[1], row[2]))
conn.commit()Day 6-7: Graph Queries
Query the graph:
def get_related_documents(conn, doc_id, hops=2):
"""Get documents within N hops of the given document."""
with conn.cursor() as cur:
cur.execute("""
WITH RECURSIVE connected AS (
SELECT target_doc_id as id, 1 as depth
FROM document_links WHERE source_doc_id = %s
UNION
SELECT source_doc_id, 1
FROM document_links WHERE target_doc_id = %s
UNION
SELECT dl.target_doc_id, c.depth + 1
FROM document_links dl
JOIN connected c ON dl.source_doc_id = c.id
WHERE c.depth < %s
)
SELECT DISTINCT d.id, d.title, c.depth
FROM connected c
JOIN documents d ON d.id = c.id
ORDER BY c.depth, d.title
""", (doc_id, doc_id, hops))
return cur.fetchall()Week 4: Integration and Polish
Goal: Usable system for daily use.
Day 1-2: Incremental Updates
Watch for new documents and process them automatically. A simple approach using file modification times:
def sync_directory(conn, directory, last_sync):
for filepath in Path(directory).glob('**/*.md'):
if filepath.stat().st_mtime > last_sync:
content = filepath.read_text()
upsert_document(conn, filepath.name, content, str(filepath))
process_document_entities(conn, doc_id, content)
return time.time()Day 3-4: MCP Server
Expose your knowledge graph through MCP:
from mcp.server import Server
server = Server("knowledge-graph")
@server.tool("search")
def search(query: str) -> str:
results = hybrid_search(conn, query)
return format_results(results)
@server.tool("related")
def related(document_title: str) -> str:
doc_id = get_document_by_title(conn, document_title)
results = get_related_documents(conn, doc_id)
return format_results(results)Day 5-7: Testing and Refinement
Load your actual notes and documents. Use the system. Fix what's broken. Tune thresholds. Improve entity extraction prompts.
Next Steps
After this foundation:
Improve entity extraction: Better prompts, relationship extraction, entity resolution.
Add PageRank: Compute importance scores for documents and entities.
Build better UI: A web interface for browsing the graph.
Automate more: Background workers for processing, scheduled link discovery.
Scale if needed: Read replicas, caching, optimization.
The four-week foundation gets you a working system. Evolution is continuous from there.
Related: Cluster B provides deep technical coverage of each component. C2 and C3 show how this integrates into productivity workflows.