negrini.io

From Concept to Running System

You've read about knowledge graphs. The architecture makes sense. The benefits are clear. Now: how do you actually build one?

This chapter is a practical guide. Not theory--steps. By the end, you'll have a working knowledge graph you can query.

Week 1: Foundation

Goal: Working database with basic schema and vector search capability.

Day 1-2: Environment Setup

Install PostgreSQL locally or provision a cloud instance. Install the pgvector extension:

sql

CREATE EXTENSION vector;

Create a database for your knowledge graph:

sql

CREATE DATABASE knowledge_graph;

Day 3-4: Basic Schema

Create core tables:

sql

-- Documents table with content and embedding
CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
    title TEXT,
    content TEXT NOT NULL,
    source TEXT,
    created_at TIMESTAMP DEFAULT NOW(),
    updated_at TIMESTAMP DEFAULT NOW(),
    embedding vector(1536)
);
 
-- Entities extracted from documents
CREATE TABLE entities (
    id SERIAL PRIMARY KEY,
    name TEXT NOT NULL,
    type TEXT,
    created_at TIMESTAMP DEFAULT NOW()
);
 
-- Relationships between entities
CREATE TABLE relationships (
    id SERIAL PRIMARY KEY,
    source_entity_id INTEGER REFERENCES entities(id),
    target_entity_id INTEGER REFERENCES entities(id),
    relationship_type TEXT,
    confidence FLOAT,
    created_at TIMESTAMP DEFAULT NOW()
);
 
-- Links between documents
CREATE TABLE document_links (
    id SERIAL PRIMARY KEY,
    source_doc_id INTEGER REFERENCES documents(id),
    target_doc_id INTEGER REFERENCES documents(id),
    link_type TEXT,
    similarity FLOAT,
    created_at TIMESTAMP DEFAULT NOW()
);
 
-- Document-entity connections
CREATE TABLE document_entities (
    id SERIAL PRIMARY KEY,
    document_id INTEGER REFERENCES documents(id),
    entity_id INTEGER REFERENCES entities(id)
);

Day 5: Create Indices

sql

-- Vector index for semantic search
CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops);
 
-- Text index for keyword search
CREATE INDEX documents_content_idx ON documents USING GIN (to_tsvector('english', content));
 
-- Standard indices
CREATE INDEX ON document_links(source_doc_id);
CREATE INDEX ON document_links(target_doc_id);
CREATE INDEX ON document_entities(document_id);
CREATE INDEX ON document_entities(entity_id);

Day 6-7: Basic Ingestion

Write a simple script to load documents:

python

import psycopg2
from openai import OpenAI
 
def embed_text(text):
    client = OpenAI()
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )
    return response.data[0].embedding
 
def insert_document(conn, title, content, source):
    embedding = embed_text(content)
    with conn.cursor() as cur:
        cur.execute("""
            INSERT INTO documents (title, content, source, embedding)
            VALUES (%s, %s, %s, %s)
            RETURNING id
        """, (title, content, source, embedding))
        conn.commit()
        return cur.fetchone()[0]

Load a few test documents. Verify they're searchable.

Week 2: Search and Retrieval

Goal: Working semantic search and keyword search.

Day 1-2: Semantic Search

python

def semantic_search(conn, query, limit=10):
    query_embedding = embed_text(query)
    with conn.cursor() as cur:
        cur.execute("""
            SELECT id, title, content,
                   1 - (embedding <=> %s) as similarity
            FROM documents
            ORDER BY embedding <=> %s
            LIMIT %s
        """, (query_embedding, query_embedding, limit))
        return cur.fetchall()

Test with various queries. Does it find semantically related content?

Day 3-4: Hybrid Search

python

def hybrid_search(conn, query, limit=10, semantic_weight=0.7):
    query_embedding = embed_text(query)
    ts_query = ' & '.join(query.split())
 
    with conn.cursor() as cur:
        cur.execute("""
            SELECT id, title, content,
                   %s * (1 - (embedding <=> %s)) +
                   %s * COALESCE(ts_rank(to_tsvector('english', content),
                                        to_tsquery('english', %s)), 0) as score
            FROM documents
            WHERE embedding <=> %s < 0.5
               OR to_tsvector('english', content) @@ to_tsquery('english', %s)
            ORDER BY score DESC
            LIMIT %s
        """, (semantic_weight, query_embedding, 1-semantic_weight,
              ts_query, query_embedding, ts_query, limit))
        return cur.fetchall()

Day 5-7: Query Interface

Build a simple CLI or web interface for querying:

python

while True:
    query = input("Search: ")
    if query == 'quit':
        break
    results = hybrid_search(conn, query)
    for r in results:
        print(f"\n{r[1]} (score: {r[3]:.3f})")
        print(r[2][:200] + "...")

Week 3: Entity Extraction and Linking

Goal: Extract entities and build the graph layer.

Day 1-3: Entity Extraction

Use an LLM to extract entities:

python

def extract_entities(content):
    client = OpenAI()
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{
            "role": "user",
            "content": f"""Extract entities from this text.
Return JSON array with objects containing 'name' and 'type'.
Types: person, concept, technology, project, organization
 
Text: {content}"""
        }]
    )
    return json.loads(response.choices[0].message.content)
 
def process_document_entities(conn, doc_id, content):
    entities = extract_entities(content)
    for entity in entities:
        entity_id = get_or_create_entity(conn, entity['name'], entity['type'])
        link_document_entity(conn, doc_id, entity_id)

Run extraction on all documents.

Day 4-5: Link Discovery

Find similar documents and create links:

python

def discover_links(conn, similarity_threshold=0.8):
    with conn.cursor() as cur:
        cur.execute("""
            SELECT d1.id, d2.id, 1 - (d1.embedding <=> d2.embedding) as sim
            FROM documents d1, documents d2
            WHERE d1.id < d2.id
              AND d1.embedding <=> d2.embedding < %s
        """, (1 - similarity_threshold,))
 
        for row in cur.fetchall():
            cur.execute("""
                INSERT INTO document_links
                (source_doc_id, target_doc_id, link_type, similarity)
                VALUES (%s, %s, 'semantic', %s)
                ON CONFLICT DO NOTHING
            """, (row[0], row[1], row[2]))
        conn.commit()

Day 6-7: Graph Queries

Query the graph:

python

def get_related_documents(conn, doc_id, hops=2):
    """Get documents within N hops of the given document."""
    with conn.cursor() as cur:
        cur.execute("""
            WITH RECURSIVE connected AS (
                SELECT target_doc_id as id, 1 as depth
                FROM document_links WHERE source_doc_id = %s
                UNION
                SELECT source_doc_id, 1
                FROM document_links WHERE target_doc_id = %s
                UNION
                SELECT dl.target_doc_id, c.depth + 1
                FROM document_links dl
                JOIN connected c ON dl.source_doc_id = c.id
                WHERE c.depth < %s
            )
            SELECT DISTINCT d.id, d.title, c.depth
            FROM connected c
            JOIN documents d ON d.id = c.id
            ORDER BY c.depth, d.title
        """, (doc_id, doc_id, hops))
        return cur.fetchall()

Week 4: Integration and Polish

Goal: Usable system for daily use.

Day 1-2: Incremental Updates

Watch for new documents and process them automatically. A simple approach using file modification times:

python

def sync_directory(conn, directory, last_sync):
    for filepath in Path(directory).glob('**/*.md'):
        if filepath.stat().st_mtime > last_sync:
            content = filepath.read_text()
            upsert_document(conn, filepath.name, content, str(filepath))
            process_document_entities(conn, doc_id, content)
    return time.time()

Day 3-4: MCP Server

Expose your knowledge graph through MCP:

python

from mcp.server import Server
 
server = Server("knowledge-graph")
 
@server.tool("search")
def search(query: str) -> str:
    results = hybrid_search(conn, query)
    return format_results(results)
 
@server.tool("related")
def related(document_title: str) -> str:
    doc_id = get_document_by_title(conn, document_title)
    results = get_related_documents(conn, doc_id)
    return format_results(results)

Day 5-7: Testing and Refinement

Load your actual notes and documents. Use the system. Fix what's broken. Tune thresholds. Improve entity extraction prompts.

Next Steps

After this foundation:

Improve entity extraction: Better prompts, relationship extraction, entity resolution.

Add PageRank: Compute importance scores for documents and entities.

Build better UI: A web interface for browsing the graph.

Automate more: Background workers for processing, scheduled link discovery.

Scale if needed: Read replicas, caching, optimization.

The four-week foundation gets you a working system. Evolution is continuous from there.

Related: Cluster B provides deep technical coverage of each component. C2 and C3 show how this integrates into productivity workflows.