"Add guardrails" is the most common instruction in AI delivery, and the least precise. Guardrails are not a single feature. They are a layered system of controls that reduce the probability and impact of unsafe or incorrect outcomes.
This taxonomy is a practical way to name what you are doing, decide what you still need, and assign ownership. It also helps you avoid guardrail theatre: lots of controls that look reassuring, but do not actually reduce risk.
1) Policy guardrails (what is allowed)
Policy guardrails define the rules: what the assistant is allowed to do, what it must refuse, and what actions require extra checks. Good policy guardrails are explicit and testable:
- Scope. What domains and tasks the assistant supports.
- Data boundaries. What data classes are allowed in prompts and outputs.
- Action boundaries. What tools can be used and under what conditions.
Policy is typically implemented through a combination of system instructions, tool routing rules, and enforcement logic (see enterprise guardrails and policy layering).
2) Input guardrails (what comes in)
Input controls reduce the chance that the model is influenced by malicious or sensitive content. Common input guardrails include:
- Prompt injection detection. Flag suspicious patterns and isolate untrusted text.
- PII and secrets scanning. Detect and redact before sending to the model (see PII redaction pipelines).
- Context budgeting. Limit how much external content can be injected into the prompt (see context budgeting).
Input guardrails are especially important for RAG systems where untrusted content can enter via documents and connectors (see connector hardening).
3) Retrieval guardrails (what evidence is used)
For RAG, the most powerful guardrails are often in retrieval. Examples:
- Permissions filtering. Apply ACLs at retrieval time (see RAG permissions).
- Freshness checks. Prefer recent sources for time-sensitive questions (see freshness evaluation).
- Answerability gates. Decide when to answer, ask, or abstain based on evidence quality (see answerability gates).
Retrieval guardrails reduce hallucinations by improving the evidence, not by trying to "talk the model into" being careful.
4) Output guardrails (what goes out)
Output controls catch problems after generation. Common output guardrails include:
- Safety filters. Policy checks for harmful or disallowed content.
- PII leakage checks. Scan generated text for sensitive content.
- Grounding checks. Ensure claims are supported by retrieved sources and citations (see structured citations).
Output guardrails should be measurable: you should know how often they trigger and whether they prevent real incidents.
5) Validation guardrails (structure and constraints)
Where possible, do not rely on free-form text. Use structure and validation to reduce ambiguity:
- Structured outputs. Require JSON with schemas for machine-consumed outputs.
- Deterministic validators. Validate types, enums, ranges and required fields.
- Action planning constraints. Limit tool arguments to known-safe shapes.
Validation is one of the highest leverage controls for agentic systems (see structured validation).
6) Tool guardrails (what actions happen)
If your system can call tools, treat tools as a risk boundary. Tool guardrails include:
- Allowlists and least privilege. Only expose approved tools with scoped credentials (see tool authorisation).
- Approvals. Require human approval for sensitive actions (see agent approvals).
- Tool contracts. Use strict schemas and error taxonomies (see tool contracts).
7) Human review guardrails (when automation is not enough)
Some decisions require human judgment. Human-in-the-loop controls are most effective when they are designed as an operations function:
- Clear triage criteria and queues.
- Fast feedback loops to improve prompts, tools and policies.
- Auditable decisions and evidence.
Human review becomes expensive when it is used as a blanket safety net; it becomes efficient when it is used for clearly defined high-risk cases (see human review operations).
How to choose the right combination
A simple rule: prevent issues as early as you can, and detect what you cannot prevent. Many systems need all layers, but not at the same intensity. Start by mapping your top failure modes (hallucinations, data leakage, unsafe actions) to the guardrail types above, then measure effectiveness in evaluation and production (see testing for AI systems).
When you can name your guardrails, you can manage them. When you can measure them, you can improve them.