AI systems need more than unit tests. A single change can affect quality, safety and cost without breaking a build. A testing pyramid for AI combines fast checks with deeper evaluation and production monitoring.
Layer 1: Static checks (fast, always-on)
Static checks catch obvious problems before you run any models:
- Schema validation. Tool contracts, structured outputs, required fields (see structured outputs).
- Policy linting. Ensure required policy prompts and guardrails are present (see policy layering).
- Configuration checks. Routing rules, residency constraints and feature flags are valid (see routing and feature flags).
Layer 2: Offline evaluation (repeatable, representative)
Offline evaluation answers: "Is this change better on representative cases?" Use:
- Versioned datasets and scenario suites (see evaluation datasets).
- Rubrics and judges for correctness, groundedness and policy adherence (see rubrics and LLM-as-a-judge).
Layer 3: Regression suites (stability over time)
Regression suites protect critical workflows. Convert incidents and support tickets into cases that must not regress (see prompt regression testing and support playbooks).
Layer 4: Canary and experiment gates
Before full rollout, use canaries and controlled experiments with guardrails (see canary rollouts and experimentation).
Layer 5: Production monitoring (truth)
Production monitoring detects drift and real user impact:
- Synthetic monitoring with golden queries (see synthetic monitoring).
- Safety SLIs and alerting (see safety dashboards and runbooks).
- Cost anomaly detection and budgets (see cost anomalies).
The goal is not exhaustive testing. The goal is layered confidence: fast checks prevent obvious mistakes, evaluation catches regressions, and monitoring detects what you could not predict.