Spaces:

rohitsar567
/

InsuranceBot

Sleeping

App Files Files Community

InsuranceBot / rag /README.md

rohitsar567

docs: Cluster A (count drift) + Cluster B (deleted-module refs) sweep

4c728a9 about 1 month ago

preview code

Raw

History Blame Contribute Delete

5.09 kB

`rag/` — Retrieval pipeline (corpus → extraction → embeddings → Chroma)

The end-to-end RAG pipeline: download insurer PDFs, parse + chunk, embed with local BGE-small, persist to Chroma, run structured extraction with a Pydantic schema, and serve top-k retrieval at query time.

Lineage for every artefact below is documented in kb/AUDIT_TRAIL.md.

Build-time scripts

File	Pipeline stage	What it produces
`download_corpus.py`	1. SOURCE → 3. DOWNLOAD	`rag/corpus/<insurer>/*.pdf` + `_manifest.json`. HEAD-checks + magic-byte sniff + retry on 403/timeout.
`download_retry.py`	3. DOWNLOAD	Retries the failures from the previous run.
`download_regulatory.py`	3. DOWNLOAD	IRDAI / regulatory PDFs (deferred from v1; see ADR-017).
`ingest.py`	4. PARSE → 5. CHUNK → 6. EMBED → 7. INDEX	The big one. `read_pdf_pages` (pdfplumber) → `chunk_pages` (800-tok / 120 overlap, sentence-aware) → BGE embed → `chromadb.PersistentClient.add(...)`. Carries the in-process HNSW bloat tripwire (ADR-029).
`extract.py`	8. STRUCTURED EXTRACTION	LLM extraction over each PDF using `schema.py::HealthPolicy` (62 fields). `get_brain_llm()` (Brain Main: Gemini 2.5 Flash primary per ADR-040) — a `NimChainLLM` fallback chain (the separate `get_fast_brain_llm()` accessor was removed in the 2026-05-15 three-chain collapse) across Google → NIM → OpenRouter. Native JSON mode (`response_mime_type=application/json` on Gemini, `response_format={"type":"json_object"}` on NIM). Writes `rag/extracted/<policy_id>.json` + upserts `policies.duckdb`.
`build_kb.py`	9. SCORECARD → KB MIRROR	Runs `backend/scorecard.py` per policy and regenerates the human-readable `kb/policies/<id>.md` tree.
`source_map.py`	post-build	Builds `source_map.json` — every chunk → (PDF path, page, span) for citation rendering.

Runtime modules

File	Stage	Notes
`retrieve.py`	10. RETRIEVAL	Top-k Chroma query with policy-id / insurer-slug filters. In-process LRU cache (cap 256) keyed by `(query_norm, top_k, sorted policy_ids, sorted insurer_slugs)`.
`schema.py`	8. STRUCTURED EXTRACTION	The 62-field `HealthPolicy` Pydantic schema — single source of truth for the extracted JSON shape. See `rag/SCHEMA.md` for the field-by-field doc.

Persistent artefacts

Path	Source of truth	Notes
`rag/corpus/<insurer>/*.pdf`	insurer CDNs	206 PDFs (188 product PDFs across 21 insurer slugs + 18 regulatory IRDAI/NHA docs; dedup to 148 marketplace cards). Not in git — hydrated at Docker build from the companion HF dataset.
`rag/extracted/<policy_id>.json`	`extract.py`	201 JSONs, one per policy, conforming to `schema.HealthPolicy`. Generated; never hand-edit.
`rag/vectors/chroma.sqlite3` + HNSW binaries	`ingest.py`	Persistent Chroma store. Symlinked to `rag/_hf_dataset_backup/rag/vectors/` for the offline canonical copy.
`rag/policies.duckdb`	`extract.py`	DuckDB rollup of the 62-field JSONs; used for SQL-style filters in `backend/main.py`.
`rag/source_map.json`	`source_map.py`	chunk_id → (pdf_path, page, span) for the citation links shown in the UI.
`rag/SCHEMA.md`	hand-written	Field-by-field documentation of the Pydantic schema.

Subdirectories

corpus/ — raw PDFs (one folder per insurer slug). Generated by download_corpus.py. Per-PDF folders intentionally do not carry READMEs.
extracted/ — 62-field JSONs. Auto-generated by extract.py. Do not edit by hand.
vectors/ — Chroma persistent store. Treat as opaque — re-build with python -m rag.ingest if corrupted.
_hf_dataset_backup/rag/{corpus,extracted,vectors}/ — offline canonical mirror of the companion HF Dataset (rohitsar567/insurance-bot-data). See ADR-020.

Cold-rebuild

python -m rag.download_corpus
python -m rag.download_retry
python -m rag.extract
rm -rf rag/vectors
python -m rag.ingest
python -m rag.build_kb
python -m eval.generate_gold
python -m eval.run

Total cost from cold: < $2 (BGE local + ~80 LLM extractions). Wall-time: ~30-40 min on a modern laptop.

ADR-004 — hybrid structured + vector retrieval rationale
ADR-011 — why local BGE replaced Voyage
ADR-018 — 800/120 chunk-size baseline
ADR-020 — code-vs-data repo split
ADR-029 — HNSW bloat tripwire (3-layer defence)
kb/AUDIT_TRAIL.md — end-to-end lineage doc

rag/ — Retrieval pipeline (corpus → extraction → embeddings → Chroma)