Local Open Source LLM Options
The complete guide to running local LLMs on an RTX 5070 with 12GB VRAM - model recommendations by task type, inference engine comparisons, quantization strategies, Claude Code integration patterns, and the multi-model architecture that handles everything from free coding assistance to privacy-first document processing
A comprehensive survey of local open-source LLM options for high-end consumer hardware - covering model recommendations by VRAM budget and task type (general, coding, reasoning, structured output), inference engines (Ollama, LM Studio, llama.cpp, vLLM), quantization formats and tradeoffs (GGUF, GPTQ, AWQ, EXL2), Claude Code and MCP integration patterns for hybrid orchestration, privacy-first workflows, RAG system architecture, and the multi-model strategy that optimizes quality per dollar across different workload types.
Also connected to
The complete guide to running local LLMs on an RTX 5070 with 12GB VRAM - model recommendations by task type, inference engine comparisons, quantization strategies, Claude Code integration patterns, and the multi-model architecture that handles everything from free coding assistance to privacy-first document processing
Your LLM can write poetry but can't reliably add two numbers - hybrid architectures solve this by routing each subtask to the system that actually handles it well
The CLAUDE.md that powers a production hybrid routing system - complexity-based scoring from 1-10, automatic model selection across four tiers (local Qwen through cloud Opus), contextual RAG embeddings that improve retrieval by 5-10%, and the architecture that achieves 95-99% cost savings versus all-cloud
A production-grade routing system that cuts LLM costs by 95-99% - complexity scoring routes simple queries to free local Ollama models while sending complex reasoning to Claude, with RAG semantic search, real-time monitoring, and 10 MCP tools for Claude Desktop integration