Spaces:

rohitsar567
/

InsuranceBot

Sleeping

App Files Files Community

InsuranceBot / rag /README.md

rohitsar567

docs: Cluster A (count drift) + Cluster B (deleted-module refs) sweep

4c728a9 about 1 month ago

preview code

Raw

History Blame Contribute Delete

5.09 kB

	# `rag/` — Retrieval pipeline (corpus → extraction → embeddings → Chroma)

	The end-to-end RAG pipeline: download insurer PDFs, parse + chunk, embed with local BGE-small, persist to Chroma, run structured extraction with a Pydantic schema, and serve top-k retrieval at query time.

	Lineage for every artefact below is documented in [`kb/AUDIT_TRAIL.md`](../kb/AUDIT_TRAIL.md).

	## Build-time scripts

	\| File \| Pipeline stage \| What it produces \|
	\| --- \| --- \| --- \|
	\| `download_corpus.py` \| 1. SOURCE → 3. DOWNLOAD \| `rag/corpus/<insurer>/*.pdf` + `_manifest.json`. HEAD-checks + magic-byte sniff + retry on 403/timeout. \|
	\| `download_retry.py` \| 3. DOWNLOAD \| Retries the failures from the previous run. \|
	\| `download_regulatory.py` \| 3. DOWNLOAD \| IRDAI / regulatory PDFs (deferred from v1; see [ADR-017](../70-docs/60-decisions/ADR-017-irdai-corpus-playwright-rescue.md)). \|
	\| `ingest.py` \| 4. PARSE → 5. CHUNK → 6. EMBED → 7. INDEX \| The big one. `read_pdf_pages` (pdfplumber) → `chunk_pages` (800-tok / 120 overlap, sentence-aware) → BGE embed → `chromadb.PersistentClient.add(...)`. Carries the in-process HNSW bloat tripwire ([ADR-029](../70-docs/60-decisions/ADR-029-hnsw-bloat-tripwire.md)). \|
	\| `extract.py` \| 8. STRUCTURED EXTRACTION \| LLM extraction over each PDF using `schema.py::HealthPolicy` (62 fields). `get_brain_llm()` (Brain Main: Gemini 2.5 Flash primary per [ADR-040](../70-docs/60-decisions/ADR-040-google-gemini-primary.md)) — a `NimChainLLM` fallback chain (the separate `get_fast_brain_llm()` accessor was removed in the 2026-05-15 three-chain collapse) across Google → NIM → OpenRouter. Native JSON mode (`response_mime_type=application/json` on Gemini, `response_format={"type":"json_object"}` on NIM). Writes `rag/extracted/<policy_id>.json` + upserts `policies.duckdb`. \|
	\| `build_kb.py` \| 9. SCORECARD → KB MIRROR \| Runs `backend/scorecard.py` per policy and regenerates the human-readable `kb/policies/<id>.md` tree. \|
	\| `source_map.py` \| post-build \| Builds `source_map.json` — every chunk → (PDF path, page, span) for citation rendering. \|

	## Runtime modules

	\| File \| Stage \| Notes \|
	\| --- \| --- \| --- \|
	\| `retrieve.py` \| 10. RETRIEVAL \| Top-k Chroma query with policy-id / insurer-slug filters. In-process LRU cache (cap 256) keyed by `(query_norm, top_k, sorted policy_ids, sorted insurer_slugs)`. \|
	\| `schema.py` \| 8. STRUCTURED EXTRACTION \| The 62-field `HealthPolicy` Pydantic schema — single source of truth for the extracted JSON shape. See `rag/SCHEMA.md` for the field-by-field doc. \|

	## Persistent artefacts

	\| Path \| Source of truth \| Notes \|
	\| --- \| --- \| --- \|
	\| `rag/corpus/<insurer>/*.pdf` \| insurer CDNs \| 206 PDFs (188 product PDFs across 21 insurer slugs + 18 regulatory IRDAI/NHA docs; dedup to 148 marketplace cards). Not in git — hydrated at Docker build from the companion HF dataset. \|
	\| `rag/extracted/<policy_id>.json` \| `extract.py` \| 201 JSONs, one per policy, conforming to `schema.HealthPolicy`. Generated; never hand-edit. \|
	\| `rag/vectors/chroma.sqlite3` + HNSW binaries \| `ingest.py` \| Persistent Chroma store. Symlinked to `rag/_hf_dataset_backup/rag/vectors/` for the offline canonical copy. \|
	\| `rag/policies.duckdb` \| `extract.py` \| DuckDB rollup of the 62-field JSONs; used for SQL-style filters in `backend/main.py`. \|
	\| `rag/source_map.json` \| `source_map.py` \| chunk_id → (pdf_path, page, span) for the citation links shown in the UI. \|
	\| `rag/SCHEMA.md` \| hand-written \| Field-by-field documentation of the Pydantic schema. \|

	## Subdirectories

	- `corpus/` — raw PDFs (one folder per insurer slug). Generated by `download_corpus.py`. Per-PDF folders intentionally do not carry READMEs.
	- `extracted/` — 62-field JSONs. Auto-generated by `extract.py`. Do not edit by hand.
	- `vectors/` — Chroma persistent store. Treat as opaque — re-build with `python -m rag.ingest` if corrupted.
	- `_hf_dataset_backup/rag/{corpus,extracted,vectors}/` — offline canonical mirror of the companion HF Dataset (`rohitsar567/insurance-bot-data`). See [ADR-020](../70-docs/60-decisions/ADR-020-code-data-split-hf-dataset.md).

	## Cold-rebuild

	```bash
	python -m rag.download_corpus
	python -m rag.download_retry
	python -m rag.extract
	rm -rf rag/vectors
	python -m rag.ingest
	python -m rag.build_kb
	python -m eval.generate_gold
	python -m eval.run
	```

	Total cost from cold: < $2 (BGE local + ~80 LLM extractions). Wall-time: ~30-40 min on a modern laptop.

	## Related

	- [ADR-004](../70-docs/60-decisions/ADR-004-hybrid-structured-vector.md) — hybrid structured + vector retrieval rationale
	- [ADR-011](../70-docs/60-decisions/ADR-011-bge-local-embeddings.md) — why local BGE replaced Voyage
	- [ADR-018](../70-docs/60-decisions/ADR-018-chunk-size-sweep-deferred.md) — 800/120 chunk-size baseline
	- [ADR-020](../70-docs/60-decisions/ADR-020-code-data-split-hf-dataset.md) — code-vs-data repo split
	- [ADR-029](../70-docs/60-decisions/ADR-029-hnsw-bloat-tripwire.md) — HNSW bloat tripwire (3-layer defence)
	- [`kb/AUDIT_TRAIL.md`](../kb/AUDIT_TRAIL.md) — end-to-end lineage doc