InsuranceBot / rag /README.md
rohitsar567's picture
docs: Cluster A (count drift) + Cluster B (deleted-module refs) sweep
4c728a9
|
Raw
History Blame Contribute Delete
5.09 kB
# `rag/` β€” Retrieval pipeline (corpus β†’ extraction β†’ embeddings β†’ Chroma)
The end-to-end RAG pipeline: download insurer PDFs, parse + chunk, embed with local BGE-small, persist to Chroma, run structured extraction with a Pydantic schema, and serve top-k retrieval at query time.
Lineage for every artefact below is documented in [`kb/AUDIT_TRAIL.md`](../kb/AUDIT_TRAIL.md).
## Build-time scripts
| File | Pipeline stage | What it produces |
| --- | --- | --- |
| `download_corpus.py` | 1. SOURCE β†’ 3. DOWNLOAD | `rag/corpus/<insurer>/*.pdf` + `_manifest.json`. HEAD-checks + magic-byte sniff + retry on 403/timeout. |
| `download_retry.py` | 3. DOWNLOAD | Retries the failures from the previous run. |
| `download_regulatory.py` | 3. DOWNLOAD | IRDAI / regulatory PDFs (deferred from v1; see [ADR-017](../70-docs/60-decisions/ADR-017-irdai-corpus-playwright-rescue.md)). |
| `ingest.py` | 4. PARSE β†’ 5. CHUNK β†’ 6. EMBED β†’ 7. INDEX | The big one. `read_pdf_pages` (pdfplumber) β†’ `chunk_pages` (800-tok / 120 overlap, sentence-aware) β†’ BGE embed β†’ `chromadb.PersistentClient.add(...)`. Carries the in-process HNSW bloat tripwire ([ADR-029](../70-docs/60-decisions/ADR-029-hnsw-bloat-tripwire.md)). |
| `extract.py` | 8. STRUCTURED EXTRACTION | LLM extraction over each PDF using `schema.py::HealthPolicy` (62 fields). `get_brain_llm()` (Brain Main: Gemini 2.5 Flash primary per [ADR-040](../70-docs/60-decisions/ADR-040-google-gemini-primary.md)) β€” a `NimChainLLM` fallback chain (the separate `get_fast_brain_llm()` accessor was removed in the 2026-05-15 three-chain collapse) across Google β†’ NIM β†’ OpenRouter. Native JSON mode (`response_mime_type=application/json` on Gemini, `response_format={"type":"json_object"}` on NIM). Writes `rag/extracted/<policy_id>.json` + upserts `policies.duckdb`. |
| `build_kb.py` | 9. SCORECARD β†’ KB MIRROR | Runs `backend/scorecard.py` per policy and regenerates the human-readable `kb/policies/<id>.md` tree. |
| `source_map.py` | post-build | Builds `source_map.json` β€” every chunk β†’ (PDF path, page, span) for citation rendering. |
## Runtime modules
| File | Stage | Notes |
| --- | --- | --- |
| `retrieve.py` | 10. RETRIEVAL | Top-k Chroma query with policy-id / insurer-slug filters. In-process LRU cache (cap 256) keyed by `(query_norm, top_k, sorted policy_ids, sorted insurer_slugs)`. |
| `schema.py` | 8. STRUCTURED EXTRACTION | The 62-field `HealthPolicy` Pydantic schema β€” single source of truth for the extracted JSON shape. See `rag/SCHEMA.md` for the field-by-field doc. |
## Persistent artefacts
| Path | Source of truth | Notes |
| --- | --- | --- |
| `rag/corpus/<insurer>/*.pdf` | insurer CDNs | 206 PDFs (188 product PDFs across 21 insurer slugs + 18 regulatory IRDAI/NHA docs; dedup to 148 marketplace cards). Not in git β€” hydrated at Docker build from the companion HF dataset. |
| `rag/extracted/<policy_id>.json` | `extract.py` | 201 JSONs, one per policy, conforming to `schema.HealthPolicy`. Generated; never hand-edit. |
| `rag/vectors/chroma.sqlite3` + HNSW binaries | `ingest.py` | Persistent Chroma store. Symlinked to `rag/_hf_dataset_backup/rag/vectors/` for the offline canonical copy. |
| `rag/policies.duckdb` | `extract.py` | DuckDB rollup of the 62-field JSONs; used for SQL-style filters in `backend/main.py`. |
| `rag/source_map.json` | `source_map.py` | chunk_id β†’ (pdf_path, page, span) for the citation links shown in the UI. |
| `rag/SCHEMA.md` | hand-written | Field-by-field documentation of the Pydantic schema. |
## Subdirectories
- `corpus/` β€” raw PDFs (one folder per insurer slug). Generated by `download_corpus.py`. Per-PDF folders intentionally do not carry READMEs.
- `extracted/` β€” 62-field JSONs. Auto-generated by `extract.py`. Do not edit by hand.
- `vectors/` β€” Chroma persistent store. Treat as opaque β€” re-build with `python -m rag.ingest` if corrupted.
- `_hf_dataset_backup/rag/{corpus,extracted,vectors}/` β€” offline canonical mirror of the companion HF Dataset (`rohitsar567/insurance-bot-data`). See [ADR-020](../70-docs/60-decisions/ADR-020-code-data-split-hf-dataset.md).
## Cold-rebuild
```bash
python -m rag.download_corpus
python -m rag.download_retry
python -m rag.extract
rm -rf rag/vectors
python -m rag.ingest
python -m rag.build_kb
python -m eval.generate_gold
python -m eval.run
```
Total cost from cold: < $2 (BGE local + ~80 LLM extractions). Wall-time: ~30-40 min on a modern laptop.
## Related
- [ADR-004](../70-docs/60-decisions/ADR-004-hybrid-structured-vector.md) β€” hybrid structured + vector retrieval rationale
- [ADR-011](../70-docs/60-decisions/ADR-011-bge-local-embeddings.md) β€” why local BGE replaced Voyage
- [ADR-018](../70-docs/60-decisions/ADR-018-chunk-size-sweep-deferred.md) β€” 800/120 chunk-size baseline
- [ADR-020](../70-docs/60-decisions/ADR-020-code-data-split-hf-dataset.md) β€” code-vs-data repo split
- [ADR-029](../70-docs/60-decisions/ADR-029-hnsw-bloat-tripwire.md) β€” HNSW bloat tripwire (3-layer defence)
- [`kb/AUDIT_TRAIL.md`](../kb/AUDIT_TRAIL.md) β€” end-to-end lineage doc