RAG systems fail quietly when ingestion is treated as a one-time ETL job. In production, ingestion is an ongoing pipeline: sources change, permissions change, and chunking strategies evolve. If you cannot reproduce what was indexed and when, you cannot debug quality issues or meet assurance expectations.
Start with source-of-truth agreements
Before any chunking decisions, clarify:
- Authoritative sources. Which systems are allowed to answer which questions.
- Ownership. Named owners for each source, with a path for deprecation.
- Freshness requirements. How quickly updates must appear in the index.
This governance layer is as important as the retrieval algorithm (see knowledge base governance).
Chunking is an information architecture decision
Chunking determines what evidence can be retrieved. Practical guidelines:
- Prefer semantic boundaries. Chunk by headings, sections, or logical units, not fixed token sizes only.
- Keep stable IDs. Every chunk should have a source ID, version, and location pointer.
- Store context metadata. Title, section headers, dates, owners, and classification tags.
For PDF-heavy environments, plan for OCR and table extraction, and treat extraction quality as a measurable KPI (see document intelligence).
Version your corpus
Indexing is not just “content in.” You need a release model:
- Snapshot IDs. Identify which corpus version was used for each response.
- Event-driven ingestion. Where possible, ingest on source change events.
- Rollback paths. If a bad batch is ingested, you should be able to revert quickly.
This aligns RAG with safe release practices used elsewhere in AI systems (see canary rollouts and incident response).
Add quality checks to ingestion
Common ingestion checks include:
- Duplicate detection and near-duplicate chunk filtering.
- Broken links and missing metadata.
- Classification enforcement (e.g., exclude restricted data from certain indexes).
- Embedding drift detection when models change.
Finally, connect ingestion to evaluation. If retrieval recall drops after a chunking change, you want to know before users do (see retrieval quality and evaluation loops).
A governed ingestion pipeline turns RAG from a prototype into an enterprise capability.