# `rag/` — Retrieval pipeline (corpus → extraction → embeddings → Chroma) The end-to-end RAG pipeline: download insurer PDFs, parse + chunk, embed with local BGE-small, persist to Chroma, run structured extraction with a Pydantic schema, and serve top-k retrieval at query time. Lineage for every artefact below is documented in [`kb/AUDIT_TRAIL.md`](../kb/AUDIT_TRAIL.md). ## Build-time scripts | File | Pipeline stage | What it produces | | --- | --- | --- | | `download_corpus.py` | 1. SOURCE → 3. DOWNLOAD | `rag/corpus//*.pdf` + `_manifest.json`. HEAD-checks + magic-byte sniff + retry on 403/timeout. | | `download_retry.py` | 3. DOWNLOAD | Retries the failures from the previous run. | | `download_regulatory.py` | 3. DOWNLOAD | IRDAI / regulatory PDFs (deferred from v1; see [ADR-017](../70-docs/60-decisions/ADR-017-irdai-corpus-playwright-rescue.md)). | | `ingest.py` | 4. PARSE → 5. CHUNK → 6. EMBED → 7. INDEX | The big one. `read_pdf_pages` (pdfplumber) → `chunk_pages` (800-tok / 120 overlap, sentence-aware) → BGE embed → `chromadb.PersistentClient.add(...)`. Carries the in-process HNSW bloat tripwire ([ADR-029](../70-docs/60-decisions/ADR-029-hnsw-bloat-tripwire.md)). | | `extract.py` | 8. STRUCTURED EXTRACTION | LLM extraction over each PDF using `schema.py::HealthPolicy` (62 fields). `get_brain_llm()` (Brain Main: Gemini 2.5 Flash primary per [ADR-040](../70-docs/60-decisions/ADR-040-google-gemini-primary.md)) — a `NimChainLLM` fallback chain (the separate `get_fast_brain_llm()` accessor was removed in the 2026-05-15 three-chain collapse) across Google → NIM → OpenRouter. Native JSON mode (`response_mime_type=application/json` on Gemini, `response_format={"type":"json_object"}` on NIM). Writes `rag/extracted/.json` + upserts `policies.duckdb`. | | `build_kb.py` | 9. SCORECARD → KB MIRROR | Runs `backend/scorecard.py` per policy and regenerates the human-readable `kb/policies/.md` tree. | | `source_map.py` | post-build | Builds `source_map.json` — every chunk → (PDF path, page, span) for citation rendering. | ## Runtime modules | File | Stage | Notes | | --- | --- | --- | | `retrieve.py` | 10. RETRIEVAL | Top-k Chroma query with policy-id / insurer-slug filters. In-process LRU cache (cap 256) keyed by `(query_norm, top_k, sorted policy_ids, sorted insurer_slugs)`. | | `schema.py` | 8. STRUCTURED EXTRACTION | The 62-field `HealthPolicy` Pydantic schema — single source of truth for the extracted JSON shape. See `rag/SCHEMA.md` for the field-by-field doc. | ## Persistent artefacts | Path | Source of truth | Notes | | --- | --- | --- | | `rag/corpus//*.pdf` | insurer CDNs | 206 PDFs (188 product PDFs across 21 insurer slugs + 18 regulatory IRDAI/NHA docs; dedup to 148 marketplace cards). Not in git — hydrated at Docker build from the companion HF dataset. | | `rag/extracted/.json` | `extract.py` | 201 JSONs, one per policy, conforming to `schema.HealthPolicy`. Generated; never hand-edit. | | `rag/vectors/chroma.sqlite3` + HNSW binaries | `ingest.py` | Persistent Chroma store. Symlinked to `rag/_hf_dataset_backup/rag/vectors/` for the offline canonical copy. | | `rag/policies.duckdb` | `extract.py` | DuckDB rollup of the 62-field JSONs; used for SQL-style filters in `backend/main.py`. | | `rag/source_map.json` | `source_map.py` | chunk_id → (pdf_path, page, span) for the citation links shown in the UI. | | `rag/SCHEMA.md` | hand-written | Field-by-field documentation of the Pydantic schema. | ## Subdirectories - `corpus/` — raw PDFs (one folder per insurer slug). Generated by `download_corpus.py`. Per-PDF folders intentionally do not carry READMEs. - `extracted/` — 62-field JSONs. Auto-generated by `extract.py`. Do not edit by hand. - `vectors/` — Chroma persistent store. Treat as opaque — re-build with `python -m rag.ingest` if corrupted. - `_hf_dataset_backup/rag/{corpus,extracted,vectors}/` — offline canonical mirror of the companion HF Dataset (`rohitsar567/insurance-bot-data`). See [ADR-020](../70-docs/60-decisions/ADR-020-code-data-split-hf-dataset.md). ## Cold-rebuild ```bash python -m rag.download_corpus python -m rag.download_retry python -m rag.extract rm -rf rag/vectors python -m rag.ingest python -m rag.build_kb python -m eval.generate_gold python -m eval.run ``` Total cost from cold: < $2 (BGE local + ~80 LLM extractions). Wall-time: ~30-40 min on a modern laptop. ## Related - [ADR-004](../70-docs/60-decisions/ADR-004-hybrid-structured-vector.md) — hybrid structured + vector retrieval rationale - [ADR-011](../70-docs/60-decisions/ADR-011-bge-local-embeddings.md) — why local BGE replaced Voyage - [ADR-018](../70-docs/60-decisions/ADR-018-chunk-size-sweep-deferred.md) — 800/120 chunk-size baseline - [ADR-020](../70-docs/60-decisions/ADR-020-code-data-split-hf-dataset.md) — code-vs-data repo split - [ADR-029](../70-docs/60-decisions/ADR-029-hnsw-bloat-tripwire.md) — HNSW bloat tripwire (3-layer defence) - [`kb/AUDIT_TRAIL.md`](../kb/AUDIT_TRAIL.md) — end-to-end lineage doc