RAG performance is rarely about one model. It is the interplay of retrieval, context shaping, generation and freshness. A disciplined evaluation loop keeps that system honest.
Measure retrieval first. Track recall@k, MRR and semantic diversity across representative intents. Validate that document chunks carry the right scope and metadata, and detect drift when taxonomies or file formats change.
Layer grounding checks on top. Use automatic judges to score answer faithfulness to citations, completeness and policy compliance. Spot-check with human review on high-risk intents, and keep an adversarial set that includes outdated, conflicting and noisy sources.
Test freshness and routing. Build scenarios that depend on recent updates, permission changes and multi-tenant isolation. Route failures often stem from stale indexes, so include index age and ingestion lag as first-class metrics.
Evaluate end-to-end experience. Track latency budgets across retrieval, reranking and generation, and correlate with task success, escalation rates and user edits. Run A/B tests on prompt and tool variations, and guardrail results with refusal and toxicity thresholds.
Make evaluation continuous. Automate nightly runs on synthetic and captured queries, surface regressions in dashboards, and block deployments when critical metrics fall. Treat evaluation assets—datasets, judges, prompts—as versioned, peer-reviewed artefacts.