Building the Knowledge Graph
From scattered notes to semantic search - the technical and personal journey of building a working knowledge graph with RAG, embeddings, and hybrid retrieval.
My first knowledge graph was a disaster. I had read about "second brain" systems, watched videos about linked notes, and downloaded Obsidian with grand ambitions. Six weeks later, I had 847 notes, no discernible structure, and a search that returned garbage regardless of what I queried.
The problem wasn't the tool. The problem was that I didn't understand what a knowledge graph actually was--or what it was for.
What a Knowledge Graph Actually Is
A knowledge graph connects information in ways that mirror how ideas actually relate. Unlike a file system (hierarchical), a database (tabular), or a document collection (flat), a knowledge graph captures the connections between concepts.
The technical foundation involves three elements:
Nodes: Individual pieces of information--notes, documents, concepts, entities. Each node contains content and metadata.
Edges: Relationships between nodes. These can be typed (is-a, relates-to, contradicts, supports) or untyped (simple links). The edges transform a collection of documents into a graph.
Properties: Attributes attached to nodes and edges--dates, tags, sources, confidence levels.
When you search a knowledge graph, you're not just finding documents that contain keywords. You're traversing a network of relationships to surface information that's semantically connected to your query.
The Retrieval-Augmented Generation Pipeline
Knowledge graphs become powerful when combined with modern AI through Retrieval-Augmented Generation, or RAG.
RAG treats the knowledge graph as external memory for an AI system. When you ask a question, the system doesn't just search--it:
- Processes the query: Understanding intent, expanding terms, sometimes generating multiple search variants
- Retrieves relevant content: Searching the graph for nodes that match semantically
- Ranks and filters: Scoring results by relevance, removing noise
- Assembles context: Combining retrieved content into a coherent package
- Generates a response: Using the retrieved context to inform the AI's answer
The key insight: RAG treats context as a compiled view over richer underlying state--exactly the pattern we discussed in Chapter 01 for managing AI agent memory.
Vector Embeddings: The Mathematical Foundation
The magic that makes semantic search work is vector embeddings. This is how text becomes mathematically comparable.
An embedding model transforms text into high-dimensional vectors--arrays of numbers that capture semantic meaning. Two pieces of text that mean similar things will have vectors that point in similar directions, even if they use completely different words.
"The quarterly financial report shows revenue growth" and "Income increased according to the fiscal analysis" use different vocabulary but similar concepts. Their embedding vectors will be mathematically close, making them findable through similarity search.
The mathematics involves distance measures:
- Cosine similarity: How aligned two vectors are, regardless of magnitude
- Euclidean distance: The straight-line distance between vectors in high-dimensional space
- Dot product: A combination of alignment and magnitude
Vector search enables a fundamentally different kind of retrieval than keyword matching. Instead of "find documents containing these words," you can query "find documents about this concept."
Building My Knowledge Graph: The Real Story
The theory is clean. The practice is messier.
My current knowledge graph contains over 1,700 notes--Claude conversations, meeting notes, project documentation, research artifacts, personal reflections. It's been through at least four major architectural revisions.
The first version was pure Obsidian: markdown files with wiki-style links. It worked until it didn't. Around 500 notes, search became useless. Queries returned too many results, none of them ranked by relevance. I couldn't find things I knew existed.
The second version added local embedding generation using open-source models. Now I had semantic search, but the quality was inconsistent. Technical notes embedded well; personal reflections embedded poorly. The embedding model didn't understand my domain.
The third version moved to a hybrid search approach: combining vector similarity with traditional keyword matching and graph link analysis. This was the breakthrough. A query about "context management" would find notes that:
- Contained the exact phrase (keyword match)
- Discussed similar concepts (vector similarity)
- Were connected to notes about context (graph traversal)
The current system runs on PostgreSQL with pgvector for embeddings, a custom indexer for incremental updates, and multiple search modes that can be combined based on query type.
The Chunking Problem
One of the most underestimated challenges in building a knowledge graph is chunking--how you break documents into pieces for embedding.
Embedding models have input limits. You can't embed a 20,000-word document as a single vector. You have to split it into chunks. How you split matters enormously.
Too small: Chunks lose context. A paragraph about "the system" doesn't embed well if the previous paragraph established what system we're discussing.
Too large: Chunks dilute meaning. If a 3,000-word chunk mentions AI once, the embedding barely registers AI as a topic.
Bad boundaries: Splitting mid-sentence or mid-concept creates chunks that don't represent coherent ideas.
My approach uses semantic boundary detection--looking for natural breakpoints in the content where topics shift. Combined with overlap (each chunk includes some content from the previous chunk for context), this produces embeddings that actually capture what the text is about.
Link Discovery: The Graph Part of Knowledge Graph
Embeddings enable semantic search. But a knowledge graph needs explicit connections--edges between nodes that capture relationships.
Some links are explicit: wiki-style references from one note to another. These are high-signal but sparse. You only link what you remember to link.
The more powerful approach is discovered links. Using embedding similarity, the system can suggest: "This note about context windows is semantically close to this note about ADHD working memory. Should they be connected?"
These discovered links surface patterns you didn't consciously notice. They're not always right--the system might connect two notes that happen to use similar vocabulary but discuss unrelated concepts. But when they're right, they reveal the structure hidden in your own thinking.
The Hierarchy Problem
Early in my knowledge graph journey, I tried to impose hierarchy. Notes would be organized into folders: Projects, Research, Personal, Archive. Each folder would have subfolders.
This failed for the same reason all hierarchies fail for complex knowledge: ideas don't fit neatly into categories.
A note about "AI agent memory architecture" belongs in Research. But it also relates to a specific Project. And it connects to Personal reflections about ADHD. Where does it go?
The answer is: everywhere and nowhere. The note exists as a node in the graph. Its connections--to Research topics, Project contexts, Personal insights--are edges. The graph doesn't need folders because the structure is the connections.
Graphs are not hierarchies. Trying to impose hierarchical organization on graph-structured information creates friction and loses the very relationships you're trying to capture.
Building Your First Knowledge Graph: Practical Guidance
If you're starting from zero:
Start with a tool that supports bidirectional links. Obsidian, Logseq, Roam, Notion--the specific tool matters less than the capability to create connections between notes easily.
Write atomically. One note, one concept. If a note covers multiple ideas, split it. Atomic notes embed better and link more precisely.
Link liberally. When you're writing a note and think of a related concept, link it. Even if that linked note doesn't exist yet. Creating the link is more important than filling it immediately.
Don't organize hierarchically. Resist the urge to create folder structures. Use tags if you need categorical metadata, but let the link structure be the primary organization.
Search constantly. The test of a knowledge graph is retrieval. Before writing a new note, search for related content. This reinforces connections and surfaces knowledge you've forgotten.
Iterate the stack. Your first implementation will be wrong. Plan to revise it. The goal isn't a perfect system from day one--it's a system that evolves as you understand your actual usage patterns.
The ADHD Advantage in Graph Building
Building knowledge graphs plays to ADHD strengths.
The act of creating links is inherently pattern-recognition work. You're constantly asking "what does this connect to?" and surfacing non-obvious relationships. This is exactly the kind of associative thinking that ADHD brains do naturally.
Meanwhile, the graph compensates for ADHD weaknesses. Can't remember that note from three months ago? Search finds it. Lost the thread of a project? The graph shows what connects to what. Forgot why you made a decision? The decision log (which is part of the graph) explains.
My knowledge graph is the external memory my brain doesn't have. But building it--creating the links, surfacing the patterns, discovering the structure--uses cognitive abilities that my brain excels at.
Infrastructure that accommodates your limitations while leveraging your strengths isn't compromise. It's optimization.
The Compound Effect
A knowledge graph becomes more valuable over time, faster than the notes accumulate.
At 100 notes, search is useful. At 500 notes, discovered links start revealing patterns. At 1,000 notes, querying the graph feels like consulting an expert who's read everything you've written.
At 1,700 notes, where I am now, the graph knows things I don't consciously remember. It surfaces connections I made years ago and forgot. It answers questions by synthesizing across dozens of sources.
Every note added increases the value of every existing note. Every link created strengthens the entire network. And because the graph is external, it doesn't suffer from the limitations that constrain biological memory.
I don't remember what I wrote last year. The graph does.
Go Deeper
Explore related deep-dives from the Constellation Clusters:
- Graph RAG Architecture - How retrieval-augmented generation works with graph structures
- Semantic Search Implementation - Vector embeddings and hybrid retrieval for meaning-based discovery
- Query Patterns and Optimization - Making retrieval efficient as your graph grows