Skip to content
Stop 5 of 7

24-report synthesis on context engineering — the research depth

ACE Comprehensive Reference Specification

ACE Comprehensive Reference Specification

AI agents fail in production for a reason that has nothing to do with model capability. The models are smart enough. They can reason, plan, write code, and synthesize information at an impressive level. The failures happen upstream - in the context those models receive when making decisions.

An agent that gets the right context at the right time behaves intelligently. The same agent with bloated, stale, or irrelevant context behaves erratically. This isn't a prompting problem. It's a systems engineering problem. And the discipline that addresses it is Agentic Context Engineering - ACE.

This specification synthesizes research and production patterns from Google's Agent Development Kit, Anthropic's agent frameworks, the Manus autonomous coding agent, and Stanford/SambaNova's ACE research into a unified reference. The patterns described here are not theoretical - they're what the organizations running agents at scale have converged on independently.

The Core Problem: Memory Architecture Failures

Agent failures cluster around two memory architecture problems:

Cross-session amnesia. The agent forgets everything between conversations. Each session starts from scratch with no grounded sense of what was decided, what was tried, or where things stand. The user spends the first portion of every session rebuilding context that existed hours ago.

Within-session degradation. The agent gets worse the longer it runs within a single session. It repeats itself. It forgets constraints it acknowledged ten turns ago. It re-attempts approaches that already failed. The quality at minute 30 is measurably worse than at minute 5.

The counterintuitive insight: larger context windows often make things worse, not better. The problem isn't capacity - it's attention. Every token in the context window competes for the model's limited attention budget. As context grows, the model's ability to focus on what actually matters for the current decision gets stretched thin. Signal gets drowned by accumulation.

The Fundamental Insight: Context as Compiled View

The organizations running agents at production scale have all arrived at the same architectural insight, expressed most clearly by Google's ADK team:

Context is a compiled view over a richer stateful system.

The failed model treats context as a mutable string buffer. Every message, tool call, and result gets appended to a growing transcript. That growing transcript gets passed to the model on each turn. Eventually you hit limits and truncate. This is accumulation, and it breaks down predictably.

The correct model treats every LLM call as a freshly computed projection against durable state. You ask: what's relevant NOW? What instructions apply NOW? Which artifacts matter NOW? Which memories should surface NOW? Then you compile a minimal, high-signal context window from those answers.

The key separation: durable state and per-call views serve different purposes and must evolve independently.

AspectDurable StatePer-Call View
PurposeFull record of everythingMinimal for current decision
SizeCan grow arbitrarilyMust stay small
LifetimePersists across callsComputed fresh each time
FormatStructured eventsOptimized for model

This separation is what enables multi-hour agent loops. The underlying state grows with every turn, but the working context stays small and focused. You can run an agent for hours without degradation because each inference call gets a freshly compiled view, not a bloated transcript.

The Four-Layer Memory Architecture

The compiled view model requires a structured memory system. Production implementations converge on four layers:

Layer 1: Working Context

What actually gets sent to the model on each call. System instructions, selected conversation history, relevant tool outputs, memory retrieval results, artifact references. This layer is computed, not accumulated. It's assembled fresh for each step, kept minimal, and every token must justify its attention cost.

The assembly process: start with a stable prefix (identity, instructions), add context-dependent elements via processors, include only retrieved and selected content, compile to model-specific format.

Layer 2: Sessions

Structured event logs capturing the full trajectory within a single interaction. The critical word is "structured" - not raw prompt strings, but typed records: user messages, agent replies, tool calls and results, control signals, errors, state transitions.

Why structured matters: structured records are model-agnostic (you can swap models without losing history), filterable (retrieve only relevant events), and compactable (summarize older events without losing their structured metadata).

Layer 3: Memory

Searchable knowledge that persists across sessions. Cross-session insights, extracted patterns, user preferences, domain knowledge. This layer is queried when relevant, not permanently present. The retrieval mechanism determines what memory surfaces into working context for each call.

Memory is where learning happens. When an agent discovers that a particular API has a 100-request rate limit, that becomes a memory entry. Future sessions that interact with that API will retrieve this memory without the agent needing to rediscover it.

Layer 4: Artifacts

Large objects stored by reference - codebases, PDFs, database results, tool outputs. These are never tokenized directly into context. Instead, working context carries references (file paths, artifact IDs) that the agent can fetch on demand. The agent sees what information is available without paying the token cost unless it actually needs the content.

This is how Manus handles tool results: full results stored in the filesystem, compact references in the context window. The agent can fetch full results if needed but doesn't carry them by default. Token savings are enormous.

Domain Memory: The Persistence Pattern

The four-layer architecture solves within-session problems. Domain memory solves cross-session amnesia.

Domain memory is a persistent, structured representation of work that includes:

Goals - explicit feature lists, requirements, and constraints. What the agent is trying to achieve across sessions, not just within the current one.

State - what's passing, failing, tried, broken, reverted. The ground truth of where things stand, independent of what any single session remembers.

Scaffolding - how to run, test, and extend the system. The practical knowledge that makes an agent productive from the first turn of a new session.

The key distinction: domain memory is NOT a vector database you query. It's the structured artifacts that give agents a lived context. In the GSD methodology, this is the .planning/ directory - PROJECT.md, ROADMAP.md, STATE.md, phase research, plans, and verification reports. Each file serves a specific purpose in reconstructing context for a new session.

Domain memory is what makes the difference between "Claude, let me re-explain what we're building" and "Claude already knows the project, the decisions, and the current state."

Nine Scaling Principles

Production agent systems follow these principles as they scale:

1. Compile, don't accumulate. Never append indefinitely. Compute the relevant view for each call from structured state.

2. Budget attention explicitly. Track token usage across context categories. Set budgets for system instructions, history, retrieval, and response space.

3. Separate storage from presentation. Store everything in structured form. Present only what's relevant in working context.

4. Make memory searchable, not permanent. Memory should be retrieved on demand, not loaded by default. Only load what's relevant to the current task.

5. Reference large objects, don't inline them. File contents, tool outputs, and database results should be accessible by reference, not tokenized into every call.

6. Compress progressively. Recent context gets full fidelity. Older context gets progressively summarized. Ancient context becomes key decisions and outcomes only.

7. Make context assembly observable. You should be able to inspect exactly what went into each LLM call. When something goes wrong, observability is how you diagnose it.

8. Design for context failure. Context will sometimes be wrong, stale, or missing. The system should degrade gracefully, not catastrophically.

9. Enable self-improvement. Agents should update their strategies based on execution feedback without requiring weight retraining. Memory and domain state are the mechanisms for this.

Nine Failure Modes

Equally important are the failure patterns to avoid:

1. The growing transcript. Appending everything to a context buffer that grows until it's truncated. The most common failure mode and the easiest to avoid.

2. Context rot. Performance degradation as token count increases, even within supported context limits. The model becomes recency-biased, paying attention to whatever appeared most recently regardless of actual importance.

3. Retrieval flooding. Loading too many retrieved results into context. Each result adds tokens that compete for attention. Ten moderately relevant results often perform worse than three highly relevant ones.

4. Instruction dilution. System instructions buried under layers of conversation history and tool outputs. The model "forgets" its instructions not because it can't see them, but because they're lost in the noise.

5. Stale context. Context that was accurate when loaded but has since been invalidated by actions taken during the session. The agent reasons from outdated information.

6. Cross-agent contamination. Context from one agent's session bleeding into another agent's working state. Agent A's debugging context shouldn't influence Agent B's planning context.

7. Ghost memory. The agent "remembers" things that aren't actually in its context - hallucinated memories from patterns in training data. This is especially dangerous when it mimics real domain memory.

8. Context thrashing. Rapidly loading and unloading context as the agent switches between sub-tasks. Each context swap introduces latency and risks losing important state.

9. Premature summarization. Compressing context before it's safe to do so, losing details that turn out to be important later. Summarization should be progressive and reversible.

The Self-Improvement Framework

The most powerful capability that correct memory architecture enables is agent self-improvement - agents that get better at their job without model retraining.

The mechanism is straightforward: after execution, the agent evaluates its performance against success criteria. What worked? What failed? What took longer than expected? These observations become memory entries that inform future decisions.

A Planner agent that notices its estimates are consistently 50% too optimistic adjusts its estimation patterns. An Executor agent that frequently encounters a particular type of integration error learns to check for that condition proactively. A Verifier agent that discovers stubs masquerading as implementations adds stub detection to its verification routine.

This isn't speculative. It's the direct consequence of having persistent memory that carries forward across sessions. Without memory, every improvement must be encoded in prompts or weights. With memory, improvements happen through natural feedback loops.

Platform Implications

An agentic platform that takes ACE seriously must:

Enforce these patterns as intrinsic capabilities, not optional add-ons. Context compilation should be the default, not something developers opt into.

Prevent the common failure modes by design. Growing transcripts should be architecturally impossible. Context budgets should be explicit and enforced.

Enable the capabilities that only exist with correct memory architecture. Self-improvement. Multi-hour loops. Cross-session continuity. These aren't features you bolt on - they emerge from correct context management.

Provide the infrastructure for domain memory, context compilation, and performance observation. The platform handles the plumbing so developers can focus on agent behavior.

Putting It Together

ACE isn't a tool or a library. It's a discipline - a way of thinking about the relationship between AI models and the information they consume. The model is the engine. Context is the fuel. ACE is the fuel injection system that ensures the right fuel reaches the engine at the right time in the right quantity.

The organizations running agents at production scale arrived at these patterns independently because the alternative doesn't work. Naive context management produces agents that look smart in demos and fail in production. Disciplined context engineering produces agents that work reliably at scale.

The choice isn't whether to adopt these patterns. It's whether you discover them through painful experience or through studying what others have already learned.

Related

For practical context management patterns you can implement today, see Context Management Best Practices.

For the multi-tier architecture that implements ACE's memory layers, see Context Management System Architecture.

For the retrieval augmentation patterns that power Layer 3 memory, see ACE Reference: Retrieval Augmentation.