negrini.io

The Freshness Problem

Knowledge grows. Every day brings new notes, updated documents, revised thinking. A knowledge graph that only reflects what existed at initial build becomes stale.

Incremental indexing solves this: processing changes efficiently as they occur rather than rebuilding everything. Add a new note, and it's searchable within seconds. Update an existing document, and the index reflects the changes.

The challenge is doing this without full reprocessing while maintaining consistency and quality.

Change Detection

Before you can process changes, you must detect them:

File modification time: Simple but can miss some changes. Filesystem metadata isn't always reliable.

Content hashing: Hash document content. If hash changes, content changed. Reliable but requires reading entire file.

Version control hooks: If documents live in git, commit hooks can trigger indexing. Catches all committed changes.

File watching: Operating system notifications when files change. Real-time but can miss rapid sequences.

Polling: Periodically scan for changes. Reliable but not immediate.

External triggers: APIs that notify when content updates. Depends on content source.

The right approach depends on your content sources. File-based knowledge bases benefit from file watching plus periodic polling as backup. API-sourced content needs webhook handlers.

Processing New Documents

When a new document arrives:

Parse content: Extract text, metadata, structure
Generate embedding: Compute vector representation
Extract entities: Identify concepts, people, projects mentioned
Discover links: Compare to existing documents for relationship candidates
Update indices: Insert into vector index, full-text index, graph

This can happen synchronously (block until complete) or asynchronously (queue for background processing).

Synchronous: Simpler, immediately consistent. But slow documents block the pipeline.

Asynchronous: More responsive, handles volume better. But there's a lag before new content is searchable.

For most personal knowledge systems, asynchronous with short queues works well. The few-second lag is rarely noticeable.

Processing Updates

Updates are more complex than inserts:

Full reprocessing: Treat updates as delete + insert. Re-extract everything. Simple but wasteful if changes are small.

Differential processing: Identify what changed. Only recompute affected elements.

Hybrid: Simple content changes trigger differential. Structure changes trigger full reprocess.

Differential processing requires tracking dependencies:

If content changes, re-embed
If entities change, update entity index and re-discover links
If metadata changes, update metadata index

Getting dependencies wrong leads to inconsistency. When in doubt, full reprocess is safer.

Handling Deletions

Deleted documents require cleanup:

Document removal: Delete from document store Vector removal: Delete from vector index Entity cleanup: Remove entities only mentioned in deleted document Link cleanup: Remove links to/from deleted document Graph cleanup: Recalculate PageRank and other derived metrics

Deletions can cascade. An entity node might only exist because of one document. A link might only exist because of one connection. Proper cleanup requires dependency tracking or periodic garbage collection.

Consistency Models

How consistent must your index be with source content?

Immediate consistency: Updates visible instantly. Requires synchronous processing. Highest latency.

Bounded staleness: Updates visible within N seconds/minutes. Asynchronous processing with queue monitoring.

Eventual consistency: Updates visible eventually, no timing guarantee. Simplest to implement, hardest to reason about.

For most knowledge management, bounded staleness around 30-60 seconds is fine. You rarely query for content you just created.

Systems serving real-time use cases (search during editing) need tighter bounds or optimistic display strategies (show unpersisted content locally).

Batch vs Stream Processing

Batch processing: Accumulate changes, process together. More efficient for large volumes. Introduces delay.

Stream processing: Process changes as they arrive. Lower latency. Higher overhead per document.

Micro-batch: Process in small batches frequently. Balances efficiency and latency.

For growing knowledge bases, micro-batching often hits the sweet spot. Queue changes for a few seconds, then process together. Amortizes overhead without noticeable delay.

The choice matters more at scale. For personal systems with single-digit documents per day, any approach works.

Index Maintenance During Updates

Vector indices and text indices need maintenance as content changes:

HNSW index updates: pgvector HNSW indices can be updated incrementally, but significant churn degrades quality. Periodic rebuilds restore performance.

GIN index updates: PostgreSQL GIN indices handle updates well but need occasional VACUUM ANALYZE.

Statistics freshness: Query planner uses statistics. Outdated statistics lead to poor query plans. ANALYZE periodically.

Index bloat: Deletes and updates leave dead tuples. VACUUM reclaims space. Autovacuum helps but may need tuning.

Schedule maintenance during low-usage periods. Weekend rebuilds work for personal systems. Production systems need rolling maintenance strategies.

Failure Handling

Incremental processing can fail partway through:

Partial document processing: Embedding succeeds but entity extraction fails. Document is partially indexed.

Queue backlogs: Processing falls behind ingestion. Queue grows.

Index corruption: Index gets into bad state. Queries fail or return wrong results.

Mitigation strategies:

Atomic operations: Process documents in transactions. All-or-nothing.

Dead letter queues: Failed documents go to a separate queue for investigation.

Health monitoring: Track queue depth, processing latency, error rates.

Rebuild capability: Ability to fully rebuild from source documents when needed.

Design for failure from the start. It will happen.

Observability

You need visibility into incremental processing:

Queue depth: How many documents waiting to process?

Processing latency: How long from change detection to index update?

Error rates: What percentage of documents fail processing?

Index freshness: What's the timestamp of the most recent indexed document?

Coverage: What percentage of source documents are indexed?

Dashboards showing these metrics let you spot problems before they compound. A slowly growing queue is easier to fix than a queue with a week of backlog.

Scaling Considerations

As document volume grows, incremental processing must scale:

Parallel processing: Multiple workers processing the queue. Requires coordination to avoid conflicts.

Sharding: Partition documents across multiple databases. Each shard processes independently.

Tiered processing: Fast path for simple updates, slow path for complex reprocessing.

Priority queues: Important documents processed first.

For most personal knowledge bases, single-threaded processing is plenty. Think about scaling when you actually need it, not before.