The Power of Links
Knowledge isn't just documents. It's how documents connect.
A note about "context windows" linked to a note about "ADHD working memory" creates meaning beyond either note alone. The relationship itself is information: these concepts connect, and the connection matters.
Link discovery automates finding these connections. PageRank and similar algorithms then use link structure to identify the most important nodes. The combination transforms a document collection into a navigable knowledge space.
Explicit vs Discovered Links
Some links are explicit. A document references another by name, cites a source, or includes a hyperlink. These explicit links are easy to capture--just parse references.
Discovered links are harder but often more valuable. Two documents that never mention each other might discuss closely related concepts. A semantic similarity threshold can surface these connections: if document A and B have embedding similarity above 0.85, create a link between them.
Explicit links:
- High precision (the author intended the connection)
- Limited recall (only connections the author thought to make)
- Directional (the reference goes one way)
Discovered links:
- Variable precision (depends on threshold and method)
- High recall (finds connections authors missed)
- Often bidirectional (similarity is symmetric)
Effective knowledge graphs include both. Explicit links provide reliable structure. Discovered links fill gaps and surface non-obvious connections.
Link Discovery Methods
Several approaches to discovering implicit links:
Embedding similarity: Compute cosine similarity between document embeddings. Link documents above a threshold.
Entity co-occurrence: If documents mention the same entities, they might be related. More shared entities = stronger connection.
Citation patterns: Documents that cite similar sources are likely related, even without direct connection.
Temporal proximity: Documents created around the same time on related topics might connect.
Author overlap: Content from the same author often relates to other content from that author.
Collaborative filtering: If users who accessed document A also accessed document B, they might be related.
Each method captures different types of relatedness. Combining methods creates richer link structures.
Thresholding and Quality
Link discovery produces candidates. Not all candidates should become links.
Too permissive: Everything links to everything. The graph becomes noise. Traversal is useless because every path exists.
Too restrictive: Only obvious links survive. You miss the valuable non-obvious connections that make graphs powerful.
Finding the balance requires experimentation and evaluation:
Similarity threshold: Start around 0.8 for semantic similarity. Adjust based on link quality review.
Top-k connections: Instead of a fixed threshold, keep the k most similar links per document. Ensures every document has some connections without explosion.
Confidence scoring: Assign confidence to discovered links. Use high-confidence links for traversal, low-confidence links for exploration.
Human validation: Sample discovered links and assess quality. Calculate precision. Adjust thresholds to hit target precision.
PageRank for Knowledge Graphs
PageRank was designed for web links but applies naturally to knowledge graphs. The core idea: important documents are referenced by other important documents.
The algorithm:
- Initialize all documents with equal rank
- Iteratively redistribute rank based on incoming links
- Apply damping factor to handle sinks and cycles
- Converge to stable rank distribution
Interpretation: High PageRank documents are central to the knowledge graph. They're widely referenced and serve as connectors.
Use cases:
- Surface the most important documents on a topic
- Weight retrieval results by document importance
- Identify the core documents in a knowledge domain
- Find authoritative sources for synthesis
Implementation in SQL is possible with recursive CTEs, though dedicated graph processing is faster for large graphs.
Beyond PageRank
PageRank has limitations. Alternative algorithms address different needs:
HITS (Hyperlink-Induced Topic Search): Distinguishes hubs (documents that link to many) from authorities (documents linked by many). Sometimes you want hubs, not just authorities.
Personalized PageRank: Bias rank toward certain seed documents. "What's important relative to this starting point?" instead of global importance.
Topic-sensitive PageRank: Different importance scores for different topics. A document might be authoritative for topic A but not topic B.
Weighted PageRank: Links have weights based on strength of connection. Strong links transfer more rank than weak links.
Temporal PageRank: Recent links matter more than old links. Keeps rankings fresh as knowledge evolves.
The choice depends on your use case. General importance? Standard PageRank. Personalized exploration? Personalized PageRank. Topic-specific authority? Topic-sensitive variant.
Graph Traversal Patterns
With links established, traversal becomes possible:
Single-hop neighbors: What's directly connected to this document? Simple and fast.
Multi-hop paths: What connects to the connections? Finds indirect relationships but risks explosion.
Shortest path: What's the most direct connection between A and B? Reveals relationship structure.
Path finding with constraints: Paths that pass through certain concepts or avoid others.
Subgraph extraction: All nodes within k hops of a starting point. Useful for context building.
Community detection: Groups of densely connected documents. Reveals topic clusters.
Traversal extends retrieval. Instead of "documents similar to query," you get "documents connected to the retrieved context." The graph structure multiplies what each query can surface.
Link Maintenance
Links require maintenance as knowledge evolves:
Stale links: Source document changes but link remains. Periodic revalidation needed.
Missing links: New documents don't automatically link to existing content. Re-run discovery for new content.
Broken links: Referenced document deleted. Clean up dangling references.
Quality degradation: Embedding models update, changing similarity scores. Links based on old scores may need recalculation.
Link weight updates: As documents change, connection strength may change. Weight updates keep the graph accurate.
Treat links as derived data that needs refresh, not as permanent facts. The maintenance overhead is real but manageable with good tooling.
Practical Implementation
A practical link discovery pipeline:
Batch discovery: When ingesting new documents, compute similarities against existing documents. Create links above threshold.
Scheduled refresh: Periodically recompute similarities for the full graph. Catch drift and changes.
Quality sampling: Randomly sample links and validate. Track precision over time.
Storage design: Store links in a table with source, target, type, confidence, and timestamp. Index for fast traversal.
Graph export: For heavy graph operations, export to a graph processing framework. Import results back to PostgreSQL.
The goal is making links cheap to compute, store, and use. If links are expensive, they won't be maintained. If they aren't maintained, they lose value.