Tools are where agents meet reality: APIs time out, permissions fail, schemas change, and dependencies degrade. If your system treats all tool errors the same, you get the worst outcomes: infinite retries, duplicated side effects, user confusion, and high costs.
Good tool error handling is a combination of engineering discipline (contracts, idempotency) and operational discipline (telemetry, runbooks).
Start with an error taxonomy
Agents need machine-readable errors. A practical taxonomy:
- retryable. Timeouts, transient 5xx, dependency instability.
- non_retryable. Validation errors, missing required fields.
- auth. 401/403, missing permissions, expired tokens.
- conflict. Concurrency conflicts, stale versions.
- rate_limited. 429s, quota exhaustion.
This improves both agent behaviour and dashboards (see AI error taxonomy).
Define retry budgets, not infinite retries
Retries should be bounded and intentional:
- Retry count. e.g. 2-3 retries for retryable classes.
- Retry window. Stop retries after a short time budget.
- Per-tool budgets. Tools with side effects get stricter rules.
- Per-run budgets. Cap total retries per agent run (see loop prevention).
Use exponential backoff and jitter
Backoff prevents retry storms. Use exponential backoff with jitter for retryable errors. For rate limits, respect server-provided retry-after hints. Track retry effectiveness: retries that never succeed are waste.
Make tools safe to retry: idempotency and contracts
The safest retry policy is still dangerous if tools are not idempotent. Controls include:
- Idempotency keys. Prevent duplicated writes on retry.
- Strict schemas. Validate inputs deterministically (see structured validation).
- Versioned contracts. Add backward-compatible changes and deprecation windows (see tool contracts).
Use circuit breakers and degradation modes
When a dependency is unhealthy, retries often make it worse. Add circuit breakers:
- Stop calling a failing tool for a short cooldown.
- Switch to alternative routes or modes.
- Use a safe partial answer or handoff path.
These patterns belong in your broader fallback strategy (see fallback and degradation).
Make errors user-visible in a helpful way
Agents should not hide tool errors behind vague replies. A useful user message includes:
- What failed (system name or action category, not internal stack traces).
- What the system tried, and what it will do next.
- How the user can proceed (retry later, provide missing info, request escalation).
For sensitive actions, route to approvals instead of retries (see approvals).
Operationalise with tracing and runbooks
Finally, treat tool errors like any production system:
- Trace runs end-to-end to locate where failures occur (see run tracing).
- Create runbooks for top tool failure modes (see alerting and runbooks).
- Feed incidents back into regression tests (see incident response).
Agents become reliable when tool failures become predictable. That is achieved with clear error classes, bounded retries, safe tool design, and operational feedback loops.