Many RAG quality problems are not model problems. They are content problems. Knowledge bases contain duplicates, near-duplicates, and conflicting versions of the same policy or procedure. Retrieval surfaces multiple versions, and the model blends them into a confident but inconsistent answer.
Define what "canonical" means
Canonical is a governance decision. For each content domain, decide which system is the source of truth and how conflicting documents are resolved (see knowledge base governance).
Use metadata to group variants
Deduplication is easier when metadata is consistent. Useful fields include:
- Document ID and stable source URL/path.
- Version number or effective date.
- Owner and domain taxonomy (see metadata strategy).
Deduplicate at ingestion, not at query time
Query-time filtering helps, but ingestion-time controls prevent mess from entering the index:
- Exact duplicate detection using hashes.
- Near-duplicate detection for repeated pages and templates.
- Chunk-level filtering so repeated boilerplate is not indexed (see ingestion pipelines).
Handle conflicts deliberately
Conflicting documents should not be left to the model. Patterns that work:
- Prefer newest effective content. Filter by effective date where appropriate.
- Prefer authoritative sources. Use a source ranking or allowlist for high-stakes domains.
- Surface conflicts. When conflicts are detected, show citations and request clarification (see citations and grounding).
Measure impact on retrieval quality and cost
Deduplication improves both quality and unit economics. Cleaner indexes reduce reranking needs and cut context size. Track retrieval quality metrics and token usage to prove the effect (see retrieval quality and FinOps).
High-performing RAG is as much content engineering as it is model engineering. Canonical sources and deduplication are foundational.