Skip to content
C5

Building Your First Knowledge Graph

Part of Personal Infrastructure

A working knowledge graph can be built in four weeks using PostgreSQL with pgvector, following a structured progression from database schema through entity extraction to MCP integration.

From Concept to Running System

You've read about knowledge graphs. The architecture makes sense. The benefits are clear. Now: how do you actually build one?

This chapter is a practical guide. Not theory--steps. By the end, you'll have a working knowledge graph you can query.


Week 1: Foundation

Goal: Working database with basic schema and vector search capability.

Day 1-2: Environment Setup

Install PostgreSQL locally or provision a cloud instance. Install the pgvector extension:

sql
CREATE EXTENSION vector;

Create a database for your knowledge graph:

sql
CREATE DATABASE knowledge_graph;

Day 3-4: Basic Schema

Create core tables:

sql
-- Documents table with content and embedding CREATE TABLE documents ( id SERIAL PRIMARY KEY, title TEXT, content TEXT NOT NULL, source TEXT, created_at TIMESTAMP DEFAULT NOW(), updated_at TIMESTAMP DEFAULT NOW(), embedding vector(1536) ); -- Entities extracted from documents CREATE TABLE entities ( id SERIAL PRIMARY KEY, name TEXT NOT NULL, type TEXT, created_at TIMESTAMP DEFAULT NOW() ); -- Relationships between entities CREATE TABLE relationships ( id SERIAL PRIMARY KEY, source_entity_id INTEGER REFERENCES entities(id), target_entity_id INTEGER REFERENCES entities(id), relationship_type TEXT, confidence FLOAT, created_at TIMESTAMP DEFAULT NOW() ); -- Links between documents CREATE TABLE document_links ( id SERIAL PRIMARY KEY, source_doc_id INTEGER REFERENCES documents(id), target_doc_id INTEGER REFERENCES documents(id), link_type TEXT, similarity FLOAT, created_at TIMESTAMP DEFAULT NOW() ); -- Document-entity connections CREATE TABLE document_entities ( id SERIAL PRIMARY KEY, document_id INTEGER REFERENCES documents(id), entity_id INTEGER REFERENCES entities(id) );

Day 5: Create Indices

sql
-- Vector index for semantic search CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops); -- Text index for keyword search CREATE INDEX documents_content_idx ON documents USING GIN (to_tsvector('english', content)); -- Standard indices CREATE INDEX ON document_links(source_doc_id); CREATE INDEX ON document_links(target_doc_id); CREATE INDEX ON document_entities(document_id); CREATE INDEX ON document_entities(entity_id);

Day 6-7: Basic Ingestion

Write a simple script to load documents:

python
import psycopg2 from openai import OpenAI def embed_text(text): client = OpenAI() response = client.embeddings.create( model="text-embedding-3-small", input=text ) return response.data[0].embedding def insert_document(conn, title, content, source): embedding = embed_text(content) with conn.cursor() as cur: cur.execute(""" INSERT INTO documents (title, content, source, embedding) VALUES (%s, %s, %s, %s) RETURNING id """, (title, content, source, embedding)) conn.commit() return cur.fetchone()[0]

Load a few test documents. Verify they're searchable.


Week 2: Search and Retrieval

Goal: Working semantic search and keyword search.

Day 1-2: Semantic Search

python
def semantic_search(conn, query, limit=10): query_embedding = embed_text(query) with conn.cursor() as cur: cur.execute(""" SELECT id, title, content, 1 - (embedding <=> %s) as similarity FROM documents ORDER BY embedding <=> %s LIMIT %s """, (query_embedding, query_embedding, limit)) return cur.fetchall()

Test with various queries. Does it find semantically related content?

Day 3-4: Hybrid Search

python
def hybrid_search(conn, query, limit=10, semantic_weight=0.7): query_embedding = embed_text(query) ts_query = ' & '.join(query.split()) with conn.cursor() as cur: cur.execute(""" SELECT id, title, content, %s * (1 - (embedding <=> %s)) + %s * COALESCE(ts_rank(to_tsvector('english', content), to_tsquery('english', %s)), 0) as score FROM documents WHERE embedding <=> %s < 0.5 OR to_tsvector('english', content) @@ to_tsquery('english', %s) ORDER BY score DESC LIMIT %s """, (semantic_weight, query_embedding, 1-semantic_weight, ts_query, query_embedding, ts_query, limit)) return cur.fetchall()

Day 5-7: Query Interface

Build a simple CLI or web interface for querying:

python
while True: query = input("Search: ") if query == 'quit': break results = hybrid_search(conn, query) for r in results: print(f"\n{r[1]} (score: {r[3]:.3f})") print(r[2][:200] + "...")

Week 3: Entity Extraction and Linking

Goal: Extract entities and build the graph layer.

Day 1-3: Entity Extraction

Use an LLM to extract entities:

python
def extract_entities(content): client = OpenAI() response = client.chat.completions.create( model="gpt-4o-mini", messages=[{ "role": "user", "content": f"""Extract entities from this text. Return JSON array with objects containing 'name' and 'type'. Types: person, concept, technology, project, organization Text: {content}""" }] ) return json.loads(response.choices[0].message.content) def process_document_entities(conn, doc_id, content): entities = extract_entities(content) for entity in entities: entity_id = get_or_create_entity(conn, entity['name'], entity['type']) link_document_entity(conn, doc_id, entity_id)

Run extraction on all documents.

Day 4-5: Link Discovery

Find similar documents and create links:

python
def discover_links(conn, similarity_threshold=0.8): with conn.cursor() as cur: cur.execute(""" SELECT d1.id, d2.id, 1 - (d1.embedding <=> d2.embedding) as sim FROM documents d1, documents d2 WHERE d1.id < d2.id AND d1.embedding <=> d2.embedding < %s """, (1 - similarity_threshold,)) for row in cur.fetchall(): cur.execute(""" INSERT INTO document_links (source_doc_id, target_doc_id, link_type, similarity) VALUES (%s, %s, 'semantic', %s) ON CONFLICT DO NOTHING """, (row[0], row[1], row[2])) conn.commit()

Day 6-7: Graph Queries

Query the graph:

python
def get_related_documents(conn, doc_id, hops=2): """Get documents within N hops of the given document.""" with conn.cursor() as cur: cur.execute(""" WITH RECURSIVE connected AS ( SELECT target_doc_id as id, 1 as depth FROM document_links WHERE source_doc_id = %s UNION SELECT source_doc_id, 1 FROM document_links WHERE target_doc_id = %s UNION SELECT dl.target_doc_id, c.depth + 1 FROM document_links dl JOIN connected c ON dl.source_doc_id = c.id WHERE c.depth < %s ) SELECT DISTINCT d.id, d.title, c.depth FROM connected c JOIN documents d ON d.id = c.id ORDER BY c.depth, d.title """, (doc_id, doc_id, hops)) return cur.fetchall()

Week 4: Integration and Polish

Goal: Usable system for daily use.

Day 1-2: Incremental Updates

Watch for new documents and process them automatically. A simple approach using file modification times:

python
def sync_directory(conn, directory, last_sync): for filepath in Path(directory).glob('**/*.md'): if filepath.stat().st_mtime > last_sync: content = filepath.read_text() upsert_document(conn, filepath.name, content, str(filepath)) process_document_entities(conn, doc_id, content) return time.time()

Day 3-4: MCP Server

Expose your knowledge graph through MCP:

python
from mcp.server import Server server = Server("knowledge-graph") @server.tool("search") def search(query: str) -> str: results = hybrid_search(conn, query) return format_results(results) @server.tool("related") def related(document_title: str) -> str: doc_id = get_document_by_title(conn, document_title) results = get_related_documents(conn, doc_id) return format_results(results)

Day 5-7: Testing and Refinement

Load your actual notes and documents. Use the system. Fix what's broken. Tune thresholds. Improve entity extraction prompts.


Next Steps

After this foundation:

Improve entity extraction: Better prompts, relationship extraction, entity resolution.

Add PageRank: Compute importance scores for documents and entities.

Build better UI: A web interface for browsing the graph.

Automate more: Background workers for processing, scheduled link discovery.

Scale if needed: Read replicas, caching, optimization.

The four-week foundation gets you a working system. Evolution is continuous from there.


Related: Cluster B provides deep technical coverage of each component. C2 and C3 show how this integrates into productivity workflows.