negrini.io

The Constraint That Defines Everything

Every large language model operates within a context window. This isn't a feature--it's a fundamental constraint. A 128K token window sounds enormous until you realize that a moderately complex codebase can exceed that in its core files alone.

The context window is working memory for AI systems. Everything the model knows about your current task must fit within it. History, instructions, retrieved documents, tool outputs, conversation turns--all competing for the same limited space.

This constraint shapes every architectural decision in agentic systems. It explains why naive chatbot interfaces break down on complex tasks. It reveals why the same model can seem brilliant on simple queries and incompetent on multi-step problems.

Understanding the context window isn't optional for building systems that work. It's the starting point.

What Actually Happens in the Window

When you send a prompt to a language model, you're providing a sequence of tokens that the model uses to predict what comes next. The context window is the maximum number of tokens the model can process in a single forward pass.

Think of it as a stage. Everything the model "sees" must be on stage simultaneously. If the critical information is backstage, it doesn't exist for that prediction.

The window gets populated by:

System instructions: The rules and guidelines that shape behavior. These typically get placed at the beginning and stay fixed.

Conversation history: Previous turns of the interaction. In long conversations, this grows until it must be trimmed or summarized.

Retrieved content: Documents, code files, search results pulled in to inform the response. This is where retrieval-augmented generation lives.

Tool outputs: Results from function calls--API responses, file contents, execution results.

Current input: The immediate request being processed.

Every token in the window costs attention. The model distributes its processing capacity across everything present. More tokens means less attention per token. Critical instructions can get diluted when competing with verbose retrieved documents.

The Attention Distribution Problem

Not all positions in the context window are equal. Research consistently shows that information at the beginning and end of long contexts gets more reliable attention than information in the middle.

This "lost in the middle" phenomenon isn't a bug to be fixed--it's a property of how attention mechanisms work. The model processes tokens through layers of self-attention, and certain positional patterns lead to certain tokens getting more or less processing.

The implications for system design:

Put critical instructions first: System prompts at the beginning maintain influence throughout.

Put actionable context last: The immediate request and most relevant information should be positioned close to where the model will generate its response.

Be skeptical of middle content: Long retrieved passages may look comprehensive but actually contribute less than their token cost suggests.

Consider explicit reminders: For long sessions, periodically restating key instructions helps counteract decay.

This isn't about gaming the system--it's about understanding how the system actually processes information and designing accordingly.

Working Memory: The Human Parallel

The context window parallels human working memory with striking precision. Cognitive science identifies working memory as the mental system that holds and manipulates information during complex cognitive tasks. It's limited, it decays, and it competes for capacity.

Human working memory typically holds 4-7 chunks of information. The exact number depends on how information is encoded and how chunks are defined. But the fundamental constraint is the same: limited capacity that must be actively managed.

Both systems face the same challenges:

Interference: New information can disrupt processing of existing information. Loading a new document can push out critical context.

Decay: Information that isn't refreshed fades. Long conversations lose early context not because it was deleted but because attention has moved on.

Capacity limits: There's a hard ceiling on how much can be processed simultaneously. Exceeding it doesn't cause graceful degradation--it causes failures.

The solutions that work for AI systems work for human cognition too. External memory systems. Chunking and compression. Strategic loading and unloading of information. These aren't just analogies--they're applications of the same principles to the same type of constraint.

Practical Context Management

Managing context is an engineering discipline. The goal: maximize the signal-to-noise ratio in the window while staying within token limits.

Compression strategies: Summarize long histories. Extract key facts from verbose documents. Represent structured data compactly.

Retrieval strategies: Don't load everything--load what's relevant. Semantic search, keyword matching, and recency filtering all help select appropriate content.

Pruning strategies: Remove what's no longer needed. Old tool outputs, resolved conversation branches, and superseded information all consume tokens without adding value.

Pinning strategies: Certain information must always be present. System instructions, critical constraints, and identity-defining content get protected from rotation.

The art is in the balance. Too aggressive on compression and you lose nuance. Too conservative on pruning and you hit limits. Too narrow on retrieval and you miss relevant context. Too broad and you dilute attention.

Production systems typically implement context budgets--allocations of tokens to different categories. Perhaps 10% for system instructions, 30% for retrieved content, 40% for conversation history, 20% for current request. The percentages vary, but the discipline of explicit allocation prevents runaway growth.

Why This Matters Beyond AI

The context window constraint isn't just about building better AI systems. It's a lens for understanding any system that must operate with limited working state.

Software architectures face this constantly. How much state does a service hold in memory versus query from external storage? How does a streaming system handle backpressure when input rate exceeds processing capacity?

Human organizations face it too. How much context can a new team member hold while onboarding? How do handoffs work when the receiving person can't absorb everything the departing person knew?

The patterns that solve context window management--external storage with selective retrieval, explicit state management, strategic summarization--apply wherever capacity constraints meet complex information needs.

Understanding this constraint deeply changes how you design systems. Not just AI systems. Any system where limited working state must coordinate with rich external resources.

The ADHD Connection

For ADHD brains, the context window isn't metaphor--it's daily experience. The feeling of losing track of what you were doing mid-task. The way important instructions slip away when new information arrives. The struggle to hold complex multi-step plans in active awareness.

The compensations that work are the same: external memory, strategic retrieval, explicit state management. Writing things down immediately. Keeping critical information visible. Structuring environments to provide the context that working memory can't hold.

This isn't weakness--it's architecture. The constraint is real. The solutions are engineering problems with engineering solutions. Recognizing the parallel doesn't minimize the challenge; it clarifies the path forward.

Related: A4 explores how memory systems extend beyond the immediate context window. D1 examines this parallel from the ADHD perspective specifically.