RAG systems change constantly: new documents are ingested, chunking strategies evolve, embedding models are updated, ranking is tuned, and prompts are refined. Without a benchmark harness, every change is a gamble. With a harness, changes are measured and regressions are caught early.
Define what you are evaluating
RAG has two stages: retrieval and answer generation. Treat them separately:
- Retrieval quality. Are the right sources being retrieved?
- Answer quality. Does the assistant use the sources correctly and refuse when it should?
This split helps you diagnose failures faster (see RAG root cause analysis).
Build a golden query set
Start small. A golden set can be 50-200 queries that represent your top intents. For each query, store:
- Query text and optional user context.
- Expected source documents or source types.
- Expected constraints (timeframe, region, product tier).
- Expected "should abstain" cases for unknowns.
Keep it versioned like any dataset (see evaluation datasets).
Measure retrieval quality with objective metrics
Retrieval metrics should be computed from ranked results:
- Recall@k. Did any relevant source appear in the top k?
- MRR. How high was the first relevant result?
- nDCG. How well ranked were multiple relevant results?
- Coverage. How often do you retrieve at least one source per query?
If retrieval quality is weak, fix retrieval before prompt tuning (see retrieval quality and ranking and relevance).
Check grounded answering, not just fluency
For the answer stage, focus on groundedness and policy compliance:
- Citation alignment. Do citations actually support the claim (see structured citations)?
- Abstention correctness. Does the assistant refuse or ask for clarification when evidence is missing (see answerability gates)?
- Format validity. If you require a structured output, does it validate (see structured validation)?
Rubrics help keep scoring consistent as your system evolves (see evaluation rubrics).
Run regression tests in CI
Once you have a golden set and metrics, turn it into a gate:
- Run on every retrieval/prompt/model change.
- Compare to a baseline and alert on significant drops.
- Store results with build metadata and versions.
This is the difference between shipping RAG and operating RAG (see AI testing pyramid).
Make results easy to interpret
Dashboards should answer simple questions:
- Which intents regressed?
- Did retrieval fail or did the model misuse sources?
- Which sources were missing, stale or permission-filtered?
A benchmark harness is not a research project. It is a practical tool that lets you improve RAG safely, with confidence and speed.