InsuranceBot / 40-data /README.md
rohitsar567's picture
refactor + feat: KI-050b/053/054 β€” rename completeness + eval speed + tax questions
57ef382
|
Raw
History Blame Contribute Delete
3.67 kB

data/ β€” Runtime + marketplace data

Three classes of file live here, intentionally side-by-side:

  1. Runtime state β€” written by the live server during normal operation (profiles/, sessions/, llm_health.json, llm_usage.jsonl).
  2. Pre-computed marketplace data β€” curated artefacts the server reads on every relevant turn (policy_facts/, premiums/, reviews/).
  3. Source/lineage maps β€” human-readable manifests of where every claim traces back to (corpus_urls.md, regulatory_urls.md, information_source_map.md).

The structured policy schema and PDFs themselves live under rag/. This folder is downstream.

Top-level files

File What it is Owner
corpus_urls.md Discovery manifest β€” every PDF URL ingested into rag/corpus/. discovery agent / tools/check_link_rot.py
regulatory_urls.md IRDAI / regulatory PDF URLs. See ADR-017. discovery agent
information_source_map.md Human-readable claim β†’ URL β†’ verdict map. Master audit doc for the Source Methodology directive. Mirror of eval/info_source_map.json. tools/info_source_map.py
llm_health.json Last per-provider health-probe snapshot (latency, success, last error). Powers the admin tab. backend/llm_health.py
llm_usage.jsonl Append-only per-call log: provider, model, tokens, latency, success. Aggregated in the admin tab. backend/main.py

Subdirectories

Path Class Contents
profiles/ runtime Persistent named-profile JSON store (KI-040). One file per user, normalised-name slug. See data/profiles/README.md.
sessions/ runtime Per-session conversation state JSONs. Ephemeral β€” pruned periodically. Currently includes anonymous.json (no-name fallback).
policy_facts/ pre-computed 256 curated JSONs, one per policy variant. Each field carries {value, unit?, source_pdf_path, source_quote} provenance. The Indian-BFSI-audit-grade machine source; kb/policies/*.md are the human-readable mirror. See _curation_report.md for the three batches that built it.
policies/ pre-computed Subfolder per insurer with PDFs / supplementary text used for one-off lookups outside the main ingest pipeline.
premiums/ pre-computed illustrative_premiums.json β€” sample starting premiums pulled from PolicyBazaar / JoinDitto / Beshak + insurer rate cards (2026-05-13). Refreshed by tools/refresh_premiums.py. Illustrative only per ADR-007.
reviews/ pre-computed One JSON per insurer with IRDAI claim-settlement metrics, complaints/10K, aggregator sentiment, news tone. Index + leaderboard in reviews/INDEX.md. Source: IRDAI Annual Report 2023-24.

Provenance + KPIs

Metric Value (2026-05-14) Where to verify
Curated policy variants 256 data/policy_facts/ file count
Per-policy avg field completeness 83.5% (Batch 1) data/policy_facts/_curation_report.md
Information-source-map verdicts βœ… 798 Β· ⚠️ 321 Β· ❌ 0 Β· ⏳ 1385 eval/info_source_map.json

Related

  • kb/AUDIT_TRAIL.md β€” end-to-end lineage; data/policy_facts/ is stage 8 output
  • kb/INDEX.md β€” policy index with completeness % per file
  • ADR-007 β€” pricing is illustrative, never a real quote
  • ADR-009 β€” 19-insurer scope + 48-field schema