Skip to content

Hybrid Local/Cloud LLM System README

A production-grade routing system that cuts LLM costs by 95-99% - complexity scoring routes simple queries to free local Ollama models while sending complex reasoning to Claude, with RAG semantic search, real-time monitoring, and 10 MCP tools for Claude Desktop integration

6 min read
Depth:

The Hybrid Local/Cloud LLM System - a complete implementation across 8 phases covering intelligent complexity-based routing (Qwen 2.5-7B for simple queries through Claude Opus for complex reasoning), RAG with contextual embeddings and ChromaDB, production hardening (circuit breaker, retry, security, logging), real-time Flask monitoring dashboard, FastMCP server with 10 tools for Claude Desktop, and hardware-specific optimization for RTX 5070 12GB VRAM.