AI systems can be "up" while being wrong. Latency and error-rate monitoring is necessary, but not sufficient. Reliable operations require signals for quality, safety, tool correctness and cost.
Start from failure modes, not metrics
Define the failure modes you must detect and recover from (see error taxonomy). Then map each to an observable signal and a runbook action.
Alert classes that matter for AI
- Availability. Provider errors, timeouts, or dependency failures.
- Quality regressions. Groundedness drop, missing citations, higher escalation rate.
- Retrieval regressions. Golden queries failing to retrieve expected sources (see synthetic monitoring).
- Safety regressions. Sensitive disclosures, prompt injection signals, unsafe content (see prompt injection defence).
- Tool failures. Tool error spikes, schema validation failures, idempotency conflicts (see tool reliability).
- Cost incidents. Token per task spikes, retries multiplying, or budget breaches (see cost anomaly detection).
Use SLO-style thresholds and error budgets
AI has variance. Alerts should focus on trend breaks and sustained degradation, not single failures. Use SLOs for latency, quality and safety, and pace change using error budgets (see SLO playbooks).
Make runbooks actionable with fast levers
A runbook is only useful if operators can act quickly. Common "fast levers" include:
- Feature flags. Disable expensive or risky paths (tools, reranking) temporarily (see feature flags).
- Routing fallbacks. Switch models/providers when health breaks (see routing and failover).
- Context throttles. Cap retrieved chunks and tool output size (see context engineering).
- Change freeze. Pause risky changes when the system is unstable (see change freeze playbooks).
Diagnose using decision logs and telemetry
When alerts fire, the first question is what changed: routing rules, prompts, policies, retrieval config, or tool enablement. Decision logs and structured telemetry make that visible (see decision logging and telemetry schema).
Turn incidents into tests
After recovery, add the failure to a regression suite or a golden query set so it is detected earlier next time (see prompt regression testing).
Great AI operations is not perfect prevention. It is fast detection, fast recovery, and continuous hardening.