negrini.io

The Evaluation Challenge

How do you know if an agent is good?

Traditional software has clear answers. The function returns the right value or it doesn't. Tests pass or fail. Behavior is deterministic.

Agents are probabilistic. The same prompt can produce different outputs. "Good" is often subjective--readable code, helpful explanations, appropriate tone. And agent behavior emerges from the interaction of many components.

Evaluation requires different approaches. Not abandoning rigor--applying rigor differently.

Levels of Evaluation

Agent evaluation operates at multiple levels:

Component level: Does each piece work correctly? Do tools return expected outputs? Does retrieval find relevant documents? Does summarization preserve key information?

Behavior level: Does the agent act appropriately? Does it follow instructions? Does it use tools correctly? Does it know when to stop?

Task level: Does the agent accomplish goals? Can it complete coding tasks? Answer questions accurately? Follow multi-step instructions?

System level: Does the overall system meet requirements? Latency, cost, user satisfaction, error rates.

Each level requires different metrics and methods. Component testing looks like traditional software testing. Behavior testing requires understanding intent. Task testing needs ground truth or human judgment. System testing needs production metrics.

Evaluation Metrics

Common metrics for agent evaluation:

Task success rate: What percentage of tasks complete successfully? Requires defining "success" clearly.

Accuracy: For factual tasks, how often is the agent correct? Requires ground truth data.

Format compliance: Does output match required format? JSON validity, schema compliance, structure adherence.

Instruction following: Did the agent do what was asked? Not just produce a response--produce the right kind of response.

Latency: How long do tasks take? Matters for user experience and system costs.

Token efficiency: How many tokens to accomplish a goal? Fewer is cheaper.

Error recovery rate: When failures occur, how often does the agent recover?

Preference alignment: For subjective tasks, do human raters prefer agent outputs? Often measured through head-to-head comparisons.

Building Evaluation Sets

Good evaluation requires good data:

Representative tasks: Cover the range of real-world usage. Edge cases, common cases, challenging cases.

Ground truth: For objective tasks, correct answers to compare against. Often expensive to create.

Diverse inputs: Vary phrasing, context, complexity. Agents shouldn't only work on training-like examples.

Adversarial examples: Inputs designed to cause failure. Prompt injections, edge cases, ambiguous requests.

Regression triggers: Examples that caught bugs before. Keep them to prevent regression.

Stratification: Balance across difficulty levels, task types, user populations. Unbalanced sets give misleading results.

Building evaluation sets is ongoing work. As usage expands, evaluation must expand too. New failure modes require new test cases.

Testing Approaches

Unit testing: Test components in isolation. Mock dependencies. Fast, focused, limited scope.

Integration testing: Test components working together. Real tools, real retrieval. Slower, broader, catches interaction bugs.

End-to-end testing: Test complete workflows. User input through final output. Slowest, most realistic, hardest to debug when failures occur.

Regression testing: Run existing test suite after changes. Catch breakage.

Continuous evaluation: Production monitoring. Track metrics over time. Detect degradation.

Human evaluation: For subjective quality, human judges rate outputs. Expensive but often necessary.

The testing pyramid applies: many fast unit tests, fewer integration tests, selective end-to-end tests. But the levels differ from traditional software.

Dealing with Non-Determinism

Agents produce different outputs on different runs. Testing must account for this:

Seed control: Fix random seeds where possible. Reproducible runs help debugging.

Multiple runs: For important evaluations, run multiple times and analyze distribution.

Statistical acceptance: Instead of exact match, accept responses in an acceptable range. "Contains required elements" rather than "exact string match."

Semantic comparison: Evaluate meaning rather than exact text. Embeddings can measure semantic similarity.

Property testing: Verify properties outputs should have regardless of variation. "Summary should be shorter than input." "Code should compile."

Non-determinism isn't a bug to be eliminated--it's a property to be managed. Evaluation methods must accommodate it.

Benchmarks and Baselines

Benchmarks provide standard evaluation tasks:

Task-specific benchmarks: Coding (HumanEval, MBPP), question answering (TriviaQA), reasoning (GSM8K).

Agent benchmarks: WebArena for web tasks, SWE-bench for software engineering.

Custom benchmarks: For your specific use case. More relevant but less comparable.

Baselines provide comparison points:

Previous versions: Is the new agent better than the old?

Human performance: How does the agent compare to humans?

Simpler systems: Does the complexity add value?

Commercial alternatives: How does your system compare to available products?

Benchmarks and baselines contextualize results. 80% accuracy means nothing in isolation. 80% compared to 60% baseline is progress. 80% compared to 95% state-of-art is a gap.

Offline vs Online Evaluation

Offline evaluation: Test against fixed datasets before deployment. Controlled, reproducible, limited to anticipated scenarios.

Online evaluation: Measure performance on live traffic after deployment. Real scenarios, noisy data, actual impact.

Both are necessary. Offline evaluation catches obvious problems before users see them. Online evaluation catches problems that only appear in production.

A/B testing bridges them: deploy variations to fractions of traffic, measure differences, roll out winners.

Debugging Agent Failures

When tests fail, diagnosing the cause requires structured investigation:

Reproduce: Get a minimal example that triggers the failure. Remove irrelevant context.

Trace: Log everything that happened. What was the prompt? What did the model return? What tools were called?

Isolate: Which component failed? The model? A tool? The orchestration logic?

Hypothesize: Why did it fail? Generate theories.

Test: Modify the system to test each theory. Change the prompt, swap components, adjust parameters.

Fix: Implement the correction. Add to regression suite.

Agent debugging is like traditional debugging but with more uncertainty. The model is a black box. Failures may be probabilistic. Multiple causes can produce similar symptoms.

Investing in observability pays dividends. The more you can see into system behavior, the faster you can diagnose problems.

Continuous Improvement Loop

Evaluation isn't a one-time activity. It's an ongoing loop:

Deploy: Put the agent into use
Monitor: Track metrics and collect feedback
Analyze: Identify failure patterns and improvement opportunities
Hypothesize: What changes would help?
Experiment: Test changes offline
Validate: Confirm improvements
Deploy: Updated agent returns to step 1

The loop never ends. User needs evolve. Models improve. New failure modes emerge. Continuous evaluation keeps the system aligned with reality.

Related: A5 covers failure recovery that evaluation should test. A6 discusses prompt testing specifically.