InsuranceBot / rag /README.md
rohitsar567's picture
docs: Cluster A (count drift) + Cluster B (deleted-module refs) sweep
4c728a9
|
Raw
History Blame Contribute Delete
5.09 kB

rag/ β€” Retrieval pipeline (corpus β†’ extraction β†’ embeddings β†’ Chroma)

The end-to-end RAG pipeline: download insurer PDFs, parse + chunk, embed with local BGE-small, persist to Chroma, run structured extraction with a Pydantic schema, and serve top-k retrieval at query time.

Lineage for every artefact below is documented in kb/AUDIT_TRAIL.md.

Build-time scripts

File Pipeline stage What it produces
download_corpus.py 1. SOURCE β†’ 3. DOWNLOAD rag/corpus/<insurer>/*.pdf + _manifest.json. HEAD-checks + magic-byte sniff + retry on 403/timeout.
download_retry.py 3. DOWNLOAD Retries the failures from the previous run.
download_regulatory.py 3. DOWNLOAD IRDAI / regulatory PDFs (deferred from v1; see ADR-017).
ingest.py 4. PARSE β†’ 5. CHUNK β†’ 6. EMBED β†’ 7. INDEX The big one. read_pdf_pages (pdfplumber) β†’ chunk_pages (800-tok / 120 overlap, sentence-aware) β†’ BGE embed β†’ chromadb.PersistentClient.add(...). Carries the in-process HNSW bloat tripwire (ADR-029).
extract.py 8. STRUCTURED EXTRACTION LLM extraction over each PDF using schema.py::HealthPolicy (62 fields). get_brain_llm() (Brain Main: Gemini 2.5 Flash primary per ADR-040) β€” a NimChainLLM fallback chain (the separate get_fast_brain_llm() accessor was removed in the 2026-05-15 three-chain collapse) across Google β†’ NIM β†’ OpenRouter. Native JSON mode (response_mime_type=application/json on Gemini, response_format={"type":"json_object"} on NIM). Writes rag/extracted/<policy_id>.json + upserts policies.duckdb.
build_kb.py 9. SCORECARD β†’ KB MIRROR Runs backend/scorecard.py per policy and regenerates the human-readable kb/policies/<id>.md tree.
source_map.py post-build Builds source_map.json β€” every chunk β†’ (PDF path, page, span) for citation rendering.

Runtime modules

File Stage Notes
retrieve.py 10. RETRIEVAL Top-k Chroma query with policy-id / insurer-slug filters. In-process LRU cache (cap 256) keyed by (query_norm, top_k, sorted policy_ids, sorted insurer_slugs).
schema.py 8. STRUCTURED EXTRACTION The 62-field HealthPolicy Pydantic schema β€” single source of truth for the extracted JSON shape. See rag/SCHEMA.md for the field-by-field doc.

Persistent artefacts

Path Source of truth Notes
rag/corpus/<insurer>/*.pdf insurer CDNs 206 PDFs (188 product PDFs across 21 insurer slugs + 18 regulatory IRDAI/NHA docs; dedup to 148 marketplace cards). Not in git β€” hydrated at Docker build from the companion HF dataset.
rag/extracted/<policy_id>.json extract.py 201 JSONs, one per policy, conforming to schema.HealthPolicy. Generated; never hand-edit.
rag/vectors/chroma.sqlite3 + HNSW binaries ingest.py Persistent Chroma store. Symlinked to rag/_hf_dataset_backup/rag/vectors/ for the offline canonical copy.
rag/policies.duckdb extract.py DuckDB rollup of the 62-field JSONs; used for SQL-style filters in backend/main.py.
rag/source_map.json source_map.py chunk_id β†’ (pdf_path, page, span) for the citation links shown in the UI.
rag/SCHEMA.md hand-written Field-by-field documentation of the Pydantic schema.

Subdirectories

  • corpus/ β€” raw PDFs (one folder per insurer slug). Generated by download_corpus.py. Per-PDF folders intentionally do not carry READMEs.
  • extracted/ β€” 62-field JSONs. Auto-generated by extract.py. Do not edit by hand.
  • vectors/ β€” Chroma persistent store. Treat as opaque β€” re-build with python -m rag.ingest if corrupted.
  • _hf_dataset_backup/rag/{corpus,extracted,vectors}/ β€” offline canonical mirror of the companion HF Dataset (rohitsar567/insurance-bot-data). See ADR-020.

Cold-rebuild

python -m rag.download_corpus
python -m rag.download_retry
python -m rag.extract
rm -rf rag/vectors
python -m rag.ingest
python -m rag.build_kb
python -m eval.generate_gold
python -m eval.run

Total cost from cold: < $2 (BGE local + ~80 LLM extractions). Wall-time: ~30-40 min on a modern laptop.

Related

  • ADR-004 β€” hybrid structured + vector retrieval rationale
  • ADR-011 β€” why local BGE replaced Voyage
  • ADR-018 β€” 800/120 chunk-size baseline
  • ADR-020 β€” code-vs-data repo split
  • ADR-029 β€” HNSW bloat tripwire (3-layer defence)
  • kb/AUDIT_TRAIL.md β€” end-to-end lineage doc