Skip to content
C2

Personal Data Infrastructure

Part of Personal Infrastructure

Building personal knowledge infrastructure with MCP servers, embeddings, and knowledge graphs creates an externalized memory system that compensates for working memory limitations and transforms how you access your own information.

Building My Own Infrastructure

The concepts in this book aren't theoretical. I built them. Personal data infrastructure--a system for managing my own knowledge, context, and productivity--became the test bed for everything I've described.

This isn't a case study from a client engagement. It's my own system. Warts and all. What worked, what didn't, and what I learned building it.


The Starting Point

Like many knowledge workers, I had information scattered everywhere:

  • Notes in multiple apps
  • Documents in cloud storage
  • Code in repositories
  • Bookmarks in browsers
  • Conversations in chat tools

Finding anything required remembering where I put it. Context switching meant losing context. Starting new projects meant reconstructing knowledge I'd developed before.

The ADHD dimension made this worse. Working memory limitations meant more dependence on external systems. Those external systems were fragmented and unreliable.

The goal: a unified infrastructure that could serve as externalized memory. Something I could query naturally and trust to surface what I needed.


Architecture Decisions

The system needed to:

  • Ingest content from multiple sources
  • Index for semantic search
  • Build a knowledge graph of connections
  • Expose querying through natural interfaces
  • Integrate with AI assistants

PostgreSQL with pgvector: Database foundation. Vector search plus relational data plus full-text search in one system.

MCP servers: Model Context Protocol servers exposing knowledge graph tools. Let AI assistants query the knowledge base directly.

Note-taking integrations: Sync from Obsidian, Notion, and other tools where content originates.

Embedding pipeline: Process content through embedding models, chunk appropriately, store vectors.

Entity extraction: Use LLMs to identify entities and relationships, build the graph layer.


MCP Development Journey

MCP--the Model Context Protocol--became central to making the infrastructure useful.

The problem MCP solves: AI assistants like Claude are powerful but have no memory and can't access your data. MCP lets them connect to custom tools and data sources.

Building my first MCP server: A server that exposes knowledge graph queries. Ask about a topic, get relevant notes. Ask for connections, get graph paths.

Iteration through use: What queries were helpful? What was missing? The server evolved based on actual usage patterns.

Tool design learnings: Clear descriptions matter. Constrained operations beat general interfaces. Return enough context for the assistant to use results effectively.

Integration with Claude Desktop: Once the MCP server connected, Claude could access my knowledge base naturally. "What do we know about X?" became answerable.


The Ingest Problem

Getting data into the system was harder than expected:

Format diversity: Markdown, HTML, PDFs, images with text, code files. Each needed different processing.

Metadata preservation: Creation date, source, tags, folder hierarchy. Losing metadata loses context.

Incremental updates: Changed files need re-processing. Deleted files need cleanup. New files need discovery.

Deduplication: Same content in multiple places shouldn't create multiple entries.

Error handling: What happens when parsing fails? Some content is unrecoverable. Graceful degradation needed.

The ingest pipeline became a project itself. Robust ingestion underlies everything else.


Query Patterns That Emerged

Once the system was running, patterns emerged in how I used it:

"What do I know about X?": Semantic search for a topic. Surface relevant notes regardless of where they live.

"How is X connected to Y?": Graph traversal. Find the relationship path between concepts.

"What was I thinking about last week?": Temporal query combined with content search.

"What's the most important thing about X?": PageRank-weighted retrieval. Central notes first.

"Summarize my notes on X": Retrieve, then synthesize. The AI assistant does the synthesis part.

These patterns informed tool design. The queries people actually make should be easy to express.


Failures and Lessons

Not everything worked:

Over-engineering the schema: Started with elaborate entity types. Simplified drastically after seeing actual usage patterns.

Embedding model changes: Upgraded models, invalidated all embeddings, had to re-process everything. Now I track model versions and plan for this.

Graph noise: Aggressive link discovery created too many connections. Finding signal required raising thresholds.

Sync conflicts: Multiple apps writing to the same content created conflicts. Needed conflict resolution policies.

Search quality issues: Some queries returned irrelevant results. Required tuning retrieval and adding feedback loops.

Each failure taught something. The system today is better because of what broke.


Daily Workflow Integration

The infrastructure only matters if it's used. Integration into daily workflow was essential:

Morning context: Query for recent developments, upcoming deadlines, in-progress work.

Research sessions: Query the knowledge base before searching externally. Often the answer is already captured.

End-of-day capture: New notes, decisions, and learnings go into the system. Tomorrow's morning context is today's capture.

Project transitions: When switching projects, query for relevant context. Don't start cold.

Writing support: When writing (like this book), query for relevant notes, sources, and previous thinking.

The habit matters as much as the infrastructure. Systems that aren't used provide no value.


ROI Assessment

Was it worth it?

Time spent building: Substantial. Weeks of development over months.

Time saved: Hard to quantify precisely. Faster context retrieval, less re-searching, more connections surfaced.

Quality improvements: Better recall of past decisions. Fewer mistakes from forgotten context.

Compound effects: The value increased over time as the knowledge base grew.

For my work style and ADHD needs, clearly worth it. But this is personal infrastructure. The investment makes sense for heavy knowledge work. It might not for other contexts.


Open Questions

The system continues evolving. Current questions:

Better entity resolution: Many entities are duplicated with slightly different names. How to merge automatically?

Automated insight generation: Can the system surface interesting patterns proactively, not just reactively?

Multi-user extensions: Could this pattern work for a team, not just an individual?

Mobile access: Currently desktop-only. Mobile would extend usefulness.

Interoperability: Connecting to others' knowledge graphs while preserving privacy.

Building personal infrastructure isn't a one-time project. It's ongoing development of systems that support how you work.


Related: C3 discusses the productivity system layer built on this infrastructure. Cluster B provides technical depth on graph implementation.