Skip to content
Stop 6 of 7

Production healthcare AI — proof of domain-specific delivery

SaaS Company KB Agent Architecture

SaaS Company KB Agent Architecture

You have a sales team that needs instant, accurate answers about your product portfolio. The information exists — scattered across SharePoint, Slack, wikis, PDFs, and the heads of people who have been at the company for a decade. A generic chatbot hallucinates features you do not have. A search engine returns twenty documents and lets the rep figure out which one is current.

The KB Agent is a different approach: a production-ready dual-agent RAG system designed specifically for healthcare product knowledge. It routes queries to specialized agents, retrieves documents through hybrid search, ranks sources by institutional authority, and maintains session context across multi-turn conversations. When facts are disputed, it follows a principled correction protocol that will terminate a session rather than propagate uncertain information downstream.

Why Two Agents Instead of One

Most RAG systems use a single retrieval-generation pipeline for every query. That works until you realize that "What is HBA?" and "Compare our bundled payment capabilities against the top three competitors for an RFP response" require fundamentally different approaches.

The Product Lookup Agent handles fast fact retrieval. It operates within a 2,000-4,000 token context budget, retrieves 3-5 documents, and returns concise answers in 2-3 seconds. When someone on a sales call needs a quick product definition or feature confirmation, this is what responds.

The Product Catalog Expert handles multi-document synthesis. It operates within a 4,000-8,000 token budget and runs in three distinct modes:

ModePurposeFull DocsOutput Style
RFP SpecsProposal content generation5 + 8 metadataFormal, citation-heavy
Agent ContextInternal team handoffs6 + 10 metadataStructured briefs
Sales MarketingCustomer-facing messaging5 + 7 metadataPersuasive, proof-backed

An Intent Router examines each query and routes it to the appropriate agent. When the Lookup Agent encounters comparison keywords ("compare," "vs," "versus"), synthesis requests ("summarize across," "recommend"), RFP language, or multi-product-line queries, it escalates to the Expert. This routing decision happens before retrieval, so you never waste tokens sending a simple lookup through the full Expert pipeline.

Hybrid Retrieval and Authority Ranking

The retrieval engine combines vector search with knowledge graph traversal. When a query arrives, it first extracts product mentions — both exact matches ("HBA," "PHA," "VBC") and fuzzy matches ("population health" maps to PHA). It includes conversation history in the extraction so follow-up questions inherit context from earlier turns.

The vector store uses ChromaDB with all-MiniLM-L6-v2 embeddings and cosine similarity. Documents are indexed with metadata including product ID, authority layer, content type, and market segment. Search returns candidates ranked by semantic similarity, but that ranking alone is insufficient for enterprise knowledge.

Here is why: marketing says "up to 40% cost reduction." The product catalog says "15-25% typical reduction." An RFP template says "demonstrated 30% in pilot." All three are semantically relevant to a cost reduction query. Which one should the agent cite?

The authority hierarchy resolves this:

  • Product Catalog — 100% authority, the source of truth
  • Platform Messaging — 95% authority, application of truth
  • Competitive Intel — 80% authority, contextual comparison
  • RFP Templates — 70% authority, contextual expansion

After vector search, the system re-ranks results using a composite key: (authority_weight, relevance_rank). Authority is the primary sort criterion. Semantic relevance breaks ties within the same authority layer. The weights shift based on the user's goal — for RFP queries, RFP content moves to the top; for sales conversations, marketing messaging takes priority.

The Knowledge Graph

The graph schema captures product relationships, buyer context, and competitive positioning across 13 node types:

Product hierarchy: ProductClass → ProductLine → Product → Capability

Buyer hierarchy: MarketSegment → SubMarketSegment → OrganizationType → DecisionRole

Jobs-to-be-done: Job, Pain, Gain (connected to both Capabilities and DecisionRoles)

Competitive: Competitor nodes linked to ProductLines

This structure means the system can traverse from a capability question ("What does our risk stratification do?") through to the buyer pain it addresses ("CFOs worried about performance under downside risk contracts") and the competitive landscape ("How does this compare to Innovacer's approach?"). The graph does not just find documents — it understands the relationships between them.

Eighteen relationship types connect these nodes: Contains, Includes, Provides, Has, Employs, Causes, Creates, Enables, Relieves, Serves, CompetesWith, DependsOn, and others. The knowledge graph currently holds 300-400 nodes, which is small enough for fast traversal but structured enough to answer complex relational queries.

Session State Management

Sessions are not just conversation history. They are stateful objects that track products referenced, claims made, correction events, and agent transitions. A session progresses through defined states:

Created → Active → (Escalated | Correcting) → Active → (Closed | TimedOut)

The Escalated state occurs when the Lookup Agent hands off to the Expert. The Correcting state triggers when a user disputes a factual claim. Sessions time out after 2.5 hours of inactivity.

Within each session, the system maintains three memory structures:

  • Artifact Manager: Persists structured outputs (comparison tables, RFP responses) that the user might reference later
  • Scratchpad: Temporary working memory for multi-step synthesis
  • Handoff Brief: Context package prepared when escalating between agents

The 3-Branch Correction Protocol

This is the architectural decision that most distinguishes the KB Agent from generic chatbots. When a user disputes a factual claim, the system does not apologize and move on. It verifies.

Branch A — User Correct: The system cannot verify its claim against the knowledge base. It acknowledges the error, thanks the user, and resumes the session with corrected context.

Branch B — Agent Correct: The system finds the source document supporting its claim. It cites the specific source and invites the user to verify. If the user accepts, the session resumes.

Branch C — Unresolved: The user rejects the citation in Branch B. The session terminates. The dispute is flagged for human review.

Branch C is the critical design choice. Terminating a session feels aggressive, but the alternative is worse: continuing a conversation where the user and the system disagree about facts means every subsequent response is potentially poisoned. In a healthcare context, where product claims flow into RFP responses and customer commitments, propagating disputed information creates real business risk. The system chooses integrity over conversational continuity.

Query Processing Pipeline

Every query flows through five stages:

  1. Route: The Intent Router classifies the query and checks for goal shifts (the user started asking about pricing but is now asking about competitive positioning)
  2. Retrieve: Product extraction, vector search, graph traversal, authority re-ranking
  3. Compile: The View Compiler assembles the context window with a cache-optimized structure — stable system prompt prefix, semi-stable product tags in the middle, volatile conversation history and retrieved documents at the end
  4. Generate: Claude generates a response with embedded claim tracking
  5. Track: Session state update, provenance logging, analytics capture

The cache optimization in step 3 deserves attention. Anthropic's prompt caching benefits from stable prefixes, so the View Compiler orders content deliberately: the system prompt and KB schema (which rarely change) go first, followed by sorted product tags (which change per-session but stay stable within a session), followed by conversation history and retrieved documents (which change every turn). This ordering means roughly 500 tokens of the context window hit cache on every request.

API Surface

The system exposes a FastAPI REST API with nine endpoints:

EndpointMethodPurpose
/sessionsPOSTCreate new session
/sessionsGETList sessions
/sessions/{id}GETSession details
/sessions/{id}DELETEDelete session
/queryPOSTSend message, get response
/stream/queryPOSTStream response via SSE
/history/sessionsGETList with filters
/history/sessions/{id}/exportGETExport JSON/Markdown
/analytics/summaryGETSystem statistics

The streaming endpoint uses Server-Sent Events with typed event payloads: start, token, citation, and done. This means the client can render citations inline as the response streams rather than waiting for the complete response.

Technology Stack and Performance

The system runs on FastAPI with async support, using Claude Sonnet for response generation, ChromaDB for vector search, and all-MiniLM-L6-v2 for embeddings. Pydantic 2.0 handles request/response validation, and SSE-Starlette provides the streaming infrastructure. The test suite covers 159 automated tests.

Latency breaks down predictably: product extraction takes 5-10ms (regex and fuzzy matching), embedding generation takes 50-80ms, ChromaDB search takes 20-50ms (HNSW approximate nearest neighbor), authority re-ranking takes 1-2ms (in-memory sort), and context assembly takes 5-10ms. Total retrieval latency lands between 80-150ms before the LLM call — fast enough that the LLM generation time dominates the user-perceived latency.

What This Architecture Teaches

The KB Agent is not just a product — it is an architecture pattern. The dual-agent routing, authority-based re-ranking, and principled correction protocol are transferable to any domain where accuracy matters more than conversational fluency. If you are building a RAG system for enterprise knowledge, the core lessons are: specialize your agents by query complexity, rank your sources by institutional trust, and decide in advance what happens when facts are disputed. The answers to those questions shape everything downstream.