Spaces:
Sleeping
Sleeping
rag/ β Retrieval pipeline (corpus β extraction β embeddings β Chroma)
The end-to-end RAG pipeline: download insurer PDFs, parse + chunk, embed with local BGE-small, persist to Chroma, run structured extraction with a Pydantic schema, and serve top-k retrieval at query time.
Lineage for every artefact below is documented in kb/AUDIT_TRAIL.md.
Build-time scripts
| File | Pipeline stage | What it produces |
|---|---|---|
download_corpus.py |
1. SOURCE β 3. DOWNLOAD | rag/corpus/<insurer>/*.pdf + _manifest.json. HEAD-checks + magic-byte sniff + retry on 403/timeout. |
download_retry.py |
3. DOWNLOAD | Retries the failures from the previous run. |
download_regulatory.py |
3. DOWNLOAD | IRDAI / regulatory PDFs (deferred from v1; see ADR-017). |
ingest.py |
4. PARSE β 5. CHUNK β 6. EMBED β 7. INDEX | The big one. read_pdf_pages (pdfplumber) β chunk_pages (800-tok / 120 overlap, sentence-aware) β BGE embed β chromadb.PersistentClient.add(...). Carries the in-process HNSW bloat tripwire (ADR-029). |
extract.py |
8. STRUCTURED EXTRACTION | LLM extraction over each PDF using schema.py::HealthPolicy (62 fields). get_brain_llm() (Brain Main: Gemini 2.5 Flash primary per ADR-040) β a NimChainLLM fallback chain (the separate get_fast_brain_llm() accessor was removed in the 2026-05-15 three-chain collapse) across Google β NIM β OpenRouter. Native JSON mode (response_mime_type=application/json on Gemini, response_format={"type":"json_object"} on NIM). Writes rag/extracted/<policy_id>.json + upserts policies.duckdb. |
build_kb.py |
9. SCORECARD β KB MIRROR | Runs backend/scorecard.py per policy and regenerates the human-readable kb/policies/<id>.md tree. |
source_map.py |
post-build | Builds source_map.json β every chunk β (PDF path, page, span) for citation rendering. |
Runtime modules
| File | Stage | Notes |
|---|---|---|
retrieve.py |
10. RETRIEVAL | Top-k Chroma query with policy-id / insurer-slug filters. In-process LRU cache (cap 256) keyed by (query_norm, top_k, sorted policy_ids, sorted insurer_slugs). |
schema.py |
8. STRUCTURED EXTRACTION | The 62-field HealthPolicy Pydantic schema β single source of truth for the extracted JSON shape. See rag/SCHEMA.md for the field-by-field doc. |
Persistent artefacts
| Path | Source of truth | Notes |
|---|---|---|
rag/corpus/<insurer>/*.pdf |
insurer CDNs | 206 PDFs (188 product PDFs across 21 insurer slugs + 18 regulatory IRDAI/NHA docs; dedup to 148 marketplace cards). Not in git β hydrated at Docker build from the companion HF dataset. |
rag/extracted/<policy_id>.json |
extract.py |
201 JSONs, one per policy, conforming to schema.HealthPolicy. Generated; never hand-edit. |
rag/vectors/chroma.sqlite3 + HNSW binaries |
ingest.py |
Persistent Chroma store. Symlinked to rag/_hf_dataset_backup/rag/vectors/ for the offline canonical copy. |
rag/policies.duckdb |
extract.py |
DuckDB rollup of the 62-field JSONs; used for SQL-style filters in backend/main.py. |
rag/source_map.json |
source_map.py |
chunk_id β (pdf_path, page, span) for the citation links shown in the UI. |
rag/SCHEMA.md |
hand-written | Field-by-field documentation of the Pydantic schema. |
Subdirectories
corpus/β raw PDFs (one folder per insurer slug). Generated bydownload_corpus.py. Per-PDF folders intentionally do not carry READMEs.extracted/β 62-field JSONs. Auto-generated byextract.py. Do not edit by hand.vectors/β Chroma persistent store. Treat as opaque β re-build withpython -m rag.ingestif corrupted._hf_dataset_backup/rag/{corpus,extracted,vectors}/β offline canonical mirror of the companion HF Dataset (rohitsar567/insurance-bot-data). See ADR-020.
Cold-rebuild
python -m rag.download_corpus
python -m rag.download_retry
python -m rag.extract
rm -rf rag/vectors
python -m rag.ingest
python -m rag.build_kb
python -m eval.generate_gold
python -m eval.run
Total cost from cold: < $2 (BGE local + ~80 LLM extractions). Wall-time: ~30-40 min on a modern laptop.