What does this article cover?

How to evaluate agents with simulated user journeys that test tool use, recovery, safety and end-to-end task success.

Teams shipping agentic workflows who need repeatable end-to-end testing beyond single-turn prompt evaluations.

Simulated Users for Agent Evaluation: Journeys, Adversarial Prompts and Tool Outcomes

Single-turn prompt evaluations are useful, but agents fail in sequences. They fail after a tool error, after a permission block, after an ambiguous user message, or after a long multi-step plan. Simulated users are a way to test agents end-to-end with repeatable journeys.

Define journeys as task scripts

A journey is a script of user messages and expected outcomes. Start with real workflows:

Account setup, ticket triage, report generation, or knowledge lookup.
Include constraints: tenant context, permissions, and policies.
Define success criteria at the task boundary, not just "the model responded".

This connects directly to operational reliability metrics (see agent reliability metrics).

Include tool outcomes as first-class assertions

For agentic systems, output text is not the only outcome. Assert on:

Tool calls. Which tools were used and with what arguments.
State changes. Did the intended change occur (and only the intended change)?
Side effects. No unintended writes, no duplicated actions, no fan-out loops.

These assertions require strong tool contracts and validation (see tool contracts and structured validation).

Add adversarial and "messy user" variants

Real users do not follow scripts. Add variations:

Ambiguous inputs that should trigger clarification.
Conflicting requirements that should trigger a safe plan and a question.
Adversarial instructions that try to bypass policy or tool approvals (see red teaming).

This is where many failures appear: the agent becomes overly confident or starts calling tools too early.

Score journeys with a small rubric

Not every journey can be evaluated with exact matching. Use a rubric that covers:

Task success. Was the user goal achieved?
Policy compliance. Did the agent refuse, abstain, or ask for approval appropriately?
Efficiency. Was the number of steps and cost reasonable?

Rubrics keep human scoring consistent (see evaluation rubrics).

Build a regression suite and run it on every change

Journeys become powerful when they are repeatable. Add them to your change process:

Run on prompt changes, tool changes, and routing changes.
Store results with version metadata (prompt/tool versions).
Gate high-risk releases on passing journeys (see prompt registries).

This is the agent version of a RAG benchmark harness (see RAG benchmarks).

Simulated users are not about replacing real user testing. They are about making your most important journeys deterministic so you can ship changes with confidence.