In retrieval-augmented generation (RAG), most “LLM hallucinations” are actually retrieval problems. The model cannot cite what it cannot see. If you want consistent quality, you need to measure retrieval explicitly, not just final-answer quality.
Retrieval quality answers a simple question: Did we fetch the right evidence? To measure it, you need a small set of representative queries, a definition of what counts as correct evidence, and a workflow for diagnosing failures.
Build a set of golden queries
Start with 50–200 real user questions. Make sure they cover:
- Common intents (how-to, policy, troubleshooting, definitions).
- Edge cases (ambiguous phrasing, rare products, acronyms).
- High-risk topics (compliance, financial decisions, safety guidance).
For each query, label at least one “acceptable” source document (and ideally the relevant passage). This becomes your baseline dataset for regression tests.
Use practical metrics (and interpret them correctly)
Common retrieval metrics include:
- Recall@k. Whether the correct source appears in the top k results. High recall is usually the first priority.
- Precision@k. The proportion of top results that are relevant. Useful for reducing noise and token spend.
- MRR / nDCG. Measures ranking quality. Helpful once you have decent recall.
In enterprise settings, recall@k is often the gating metric: if the right document is not retrieved, the answer will degrade or drift into fabrication.
Do failure-mode analysis, not just scoring
When retrieval fails, classify the cause:
- Chunking issues. The relevant information is split or missing in chunks (see ingestion pipelines).
- Metadata and filters. Wrong tenant/region filters or missing permissions remove the correct result.
- Query mismatch. User language doesn’t match indexed language; query rewriting can help.
- Ranking issues. Relevant content is present but pushed down; rerankers or hybrid search can help.
Connect retrieval tests to product outcomes
Retrieval metrics should map to user outcomes. For example, low recall often correlates with higher escalation rates or lower citation coverage. Track those leading indicators in production (see AI observability and drift monitoring).
Finally, treat retrieval tests as part of your release pipeline. Every change to chunking, embeddings, or ranking should run the golden query suite, just like unit tests for software.