Skip to content

Hybrid Local/Cloud LLM System: Architecture Guide

The CLAUDE.md that powers a production hybrid routing system - complexity-based scoring from 1-10, automatic model selection across four tiers (local Qwen through cloud Opus), contextual RAG embeddings that improve retrieval by 5-10%, and the architecture that achieves 95-99% cost savings versus all-cloud

7 min read
Depth:

The architectural guide for a hybrid local/cloud LLM system - covering the complexity scoring engine (1-10 scale routing across Qwen 2.5-7B, DeepSeek-Coder, Claude Haiku, Sonnet, and Opus), the RAG system with contextual embeddings (LLM-generated context improves code retrieval by 5-10%), production hardening across 8 phases (circuit breaker, rate limiting, structured logging), the FastMCP server exposing 10 tools to Claude Desktop, and the cost tracking system that validates 95-99% savings.