Skip to content
A5

Error Recovery and Resilience

Part of AI Agent Patterns

In agentic systems, failure is the norm rather than the exception, making resilience and recovery primary design considerations from the start.

Failure Is Not Exceptional

In traditional software, errors are exceptional cases. You design for the happy path and handle exceptions when they occur.

In agentic systems, failure is the norm. Tools timeout. APIs return errors. The model misunderstands. Context overflows. Something always goes wrong on any sufficiently complex task.

Building resilient agents means treating failure as a primary design consideration, not an afterthought. How does the agent detect failure? How does it recover? How does it avoid the same failure next time?


The Nine Failure Modes Revisited

Chapter 04 catalogued nine failure modes. Let's examine recovery strategies for each:

1. Context Overflow

Detection: Token count exceeds limit or approaches it. Recovery: Aggressive summarization. Prune old content. Offload to memory layer. Prevention: Context budgeting. Streaming summarization throughout the session.

2. Lost in the Middle

Detection: Agent ignores or contradicts mid-context content. Recovery: Elevate critical information to beginning or end. Prevention: Strategic information placement. Periodic reminders.

3. Instruction Decay

Detection: Agent behavior drifts from original instructions over long sessions. Recovery: Reinject original instructions. Summarize with instruction emphasis. Prevention: Pinned context that never rotates out.

4. Tool Confusion

Detection: Wrong tool called. Wrong parameters. Results misinterpreted. Recovery: Clearer tool descriptions. Retry with explicit guidance. Prevention: Tool design review. Reduced overlap between tools.

5. Loop Traps

Detection: Same action repeated without progress. Recovery: Force alternative approach. Inject problem-solving strategies. Prevention: Loop detection with intervention triggers.

6. Hallucination Cascades

Detection: Verifiable facts are wrong. Downstream reasoning built on fabrications. Recovery: Fact-check against tools. Restart from verified state. Prevention: Grounding checks at key decision points.

7. Goal Drift

Detection: Agent optimizes for intermediate metrics, losing sight of original objective. Recovery: Restate original goal. Realign evaluation criteria. Prevention: Goal pinning. Periodic objective checks.

8. Permission Failures

Detection: Agent attempts unauthorized actions. Recovery: Inform agent of constraints. Suggest permitted alternatives. Prevention: Clear capability boundaries in system prompt.

9. Recovery Failures

Detection: Agent can't restart after interruption. State is corrupted or lost. Recovery: Checkpoint restoration. Summary-based reconstruction. Prevention: Frequent checkpoints. Recoverable state design.


Error Detection Patterns

You can't recover from errors you don't detect. Detection strategies:

Exit code monitoring: Tools return success/failure status. Act on failures immediately.

Output validation: Check that tool outputs match expected schemas. Malformed results indicate problems.

Semantic validation: Results should make sense in context. A file read that returns "permission denied" isn't a successful read.

Progress monitoring: If the agent runs many turns without advancing toward the goal, something's wrong.

Resource monitoring: Track token usage, API call counts, time elapsed. Anomalies suggest problems.

Human feedback: Sometimes only a human can tell if results are correct. Build in feedback collection points.

The more sophisticated your detection, the earlier you catch problems. Early detection means cheaper recovery.


Recovery Strategies

Once an error is detected, how do you recover?

Retry: Try the same operation again. Works for transient failures--network glitches, rate limits, temporary unavailability.

Retry with backoff: Retry but wait longer between attempts. Prevents hammering struggling systems.

Alternative approach: Try a different method to achieve the same goal. File not readable? Try a different path. API down? Use cached data.

Graceful degradation: Accept partial success. Can't get perfect answer? Provide best available answer with caveats.

Rollback: Restore to a known good state and try a different path. Requires checkpoints.

Escalation: Ask for help. Human intervention, specialized agent, or different model.

Controlled failure: When recovery isn't possible, fail cleanly. Clear error message, preserved state, actionable next steps.

The choice depends on failure type, cost of retry, and criticality of success. Transient failures warrant retry. Systemic failures need alternative approaches. Unrecoverable failures need clean exit.


Circuit Breaker Pattern

When a dependency fails repeatedly, stop trying. This is the circuit breaker pattern:

Closed state: Normal operation. Calls proceed through.

Open state: Dependency is failing. Calls fail immediately without attempting.

Half-open state: Test whether dependency has recovered. If yes, return to closed. If no, stay open.

Circuit breakers prevent cascade failures. When one service is down, hammering it with requests makes everything worse. Failing fast and pursuing alternatives preserves system health.

The agent equivalent: if a tool keeps failing, stop calling it. Note the failure in context. Try alternatives or report the limitation.


Idempotency and Safe Retry

Safe retry requires idempotent operations. Calling the same operation twice should have the same effect as calling it once.

Naturally idempotent: Reading a file. Querying a database. Checking status.

Made idempotent: Using unique identifiers for writes. "Create item if not exists" rather than "create item."

Not idempotent: Incrementing a counter. Sending an email. Executing arbitrary code.

For non-idempotent operations, you need additional mechanisms:

  • Track which operations have been attempted
  • Check state before retrying
  • Use transactions where available
  • Accept that some failures can't be safely retried

Designing for idempotency from the start makes recovery much simpler.


Checkpoint Patterns

Checkpoints are snapshots of state that enable recovery to a known good point.

Automatic checkpoints: Save state at regular intervals. Simple but may checkpoint bad states.

Event-triggered checkpoints: Save after significant events--task completion, tool calls, user confirmations. More meaningful recovery points.

Semantic checkpoints: Save when the agent reaches coherent milestones. "Implementation complete" is a better checkpoint than "halfway through function."

Checkpoint storage must be reliable. A checkpoint system that loses checkpoints is worse than no system--you rely on it and it fails you.

Checkpoint restoration must be possible. Saving state is useless if you can't load it and resume. Test restoration regularly.


Error Budgets

Perfect reliability is impossible and pursuing it has diminishing returns. Error budgets provide a framework:

Define acceptable error rates. Perhaps 95% of tasks should succeed. The 5% budget allows for failures.

Track against the budget. If you're burning budget faster than expected, invest in reliability. If you're under budget, invest elsewhere.

Use budget to make decisions. A new feature that increases error rate might still be worth shipping if you have budget. A reliability improvement might not be worth the cost if you're already under budget.

This isn't an excuse for sloppiness. It's recognition that some failure is inevitable and resources should be allocated strategically.


Resilience Through Simplicity

The most resilient systems are often the simplest. Every component is a potential failure point. Every dependency is a risk.

Reduce moving parts where possible. Do you need that caching layer? Can the agent just re-query?

Favor simple recovery over complex prevention. A system that fails and recovers quickly may outperform one that tries to prevent all failures.

Accept visible failures over hidden degradation. A clear error message is better than silently wrong results.

Complexity serves a purpose, but that purpose should be explicit. If you can't articulate why a component exists, you probably shouldn't have it.


Related: A7 covers how to test resilience. A3 discusses tool failure specifically. A2 addresses failure in multi-agent systems.