Evaluating a chatbot response is comparatively simple: did the answer follow instructions, use the right facts and avoid obvious policy failures? Long-running agents are harder. They produce plans, choose tools, update state, make partial decisions and recover from errors before a final output exists.
An agent can fail even when the final answer looks acceptable. It may have used the wrong source, skipped an approval step, spent too much, leaked unnecessary data, retried an unsafe action or created a result that cannot be reproduced. Evaluation must therefore inspect the path, not only the destination.
Build task suites from real work
Good agent evaluations start with realistic tasks. Use anonymised tickets, representative documents, common edge cases, awkward handoffs and historical failures. Include tasks that should be refused or escalated, not just tasks the agent should complete.
Each task should define success criteria at multiple levels: plan quality, tool selection, data handling, intermediate artefacts, final output, time, cost and escalation behaviour. This makes failures more diagnosable and helps teams improve the workflow instead of endlessly tweaking prompts.
Trace review is mandatory
Every meaningful agent run should produce a trace that can be reviewed. The trace should show inputs, retrieved context, model decisions, tool calls, errors, retries, approvals and final artefacts. For production systems, traces are the foundation for debugging, audit, incident response and continuous evaluation.
Trace review also supports a useful failure taxonomy. Label problems such as missing context, wrong tool, weak plan, unsafe action, policy miss, citation gap, format drift, excessive retries, escalation failure and cost overrun. Once failures have names, teams can track whether the system is actually improving.
Release gates should be boring and strict
Agent changes should not ship because a demo looked good. They should pass regression gates. These gates should include must-pass safety cases, representative business tasks, tool permission checks, cost thresholds, latency thresholds and human-review acceptance criteria.
The point is to create confidence that survives model upgrades, prompt changes, tool changes and data changes. As models become more capable, evaluation has to become more systematic. Otherwise the organisation only learns about regressions after users or customers find them.