Hybrid Local/Cloud LLM System README
A production-grade routing system that cuts LLM costs by 95-99% - complexity scoring routes simple queries to free local Ollama models while sending complex reasoning to Claude, with RAG semantic search, real-time monitoring, and 10 MCP tools for Claude Desktop integration
The Hybrid Local/Cloud LLM System - a complete implementation across 8 phases covering intelligent complexity-based routing (Qwen 2.5-7B for simple queries through Claude Opus for complex reasoning), RAG with contextual embeddings and ChromaDB, production hardening (circuit breaker, retry, security, logging), real-time Flask monitoring dashboard, FastMCP server with 10 tools for Claude Desktop, and hardware-specific optimization for RTX 5070 12GB VRAM.
Also connected to
A production-grade routing system that cuts LLM costs by 95-99% - complexity scoring routes simple queries to free local Ollama models while sending complex reasoning to Claude, with RAG semantic search, real-time monitoring, and 10 MCP tools for Claude Desktop integration
Your LLM can write poetry but can't reliably add two numbers - hybrid architectures solve this by routing each subtask to the system that actually handles it well
The CLAUDE.md that powers a production hybrid routing system - complexity-based scoring from 1-10, automatic model selection across four tiers (local Qwen through cloud Opus), contextual RAG embeddings that improve retrieval by 5-10%, and the architecture that achieves 95-99% cost savings versus all-cloud
The complete guide to running local LLMs on an RTX 5070 with 12GB VRAM - model recommendations by task type, inference engine comparisons, quantization strategies, Claude Code integration patterns, and the multi-model architecture that handles everything from free coding assistance to privacy-first document processing