InsuranceBot / tools /_extraction_contract.md
rohitsar567's picture
data+scoring: verbatim-source all policy_facts, recalibrate scorecard, fix recommendation
7081aaa
|
Raw
History Blame Contribute Delete
3.48 kB

Policy-Facts Extraction Contract (v1 β€” 2026-05-16)

Single source of truth for every fill/verify agent. Deviating from this is a defect.

Goal

Replace every "value": null and every "value": 999/9999 (a poisoned "no-max" sentinel) in 40-data/policy_facts/<insurer>__<product>__<doctype>.json with a real value backed by a verbatim quote from the source PDF β€” or an honest, sourced "not stated".

Path mapping (deterministic)

40-data/policy_facts/<insurer>__<product>__<doctype>.json β†’ PDF: rag/corpus/<insurer>/<product>__<doctype>.pdf Also check the file's own source_pdf_path and max_renewal_age.source_pdf_path. If no PDF exists locally, see "Missing PDF" below β€” do NOT guess.

Per-cell rule (apply to EVERY field dict that has a value key)

Only touch a cell if its current value is null, 999, or 9999. Leave already-good values untouched (do not "improve" them).

For each such cell, read the PDF and:

  1. Value found explicitly β†’ set:
    • "value": the real value (int/number/bool/string/list per existing unit)
    • "source_quote": the verbatim sentence/clause from the PDF that states it (≀ 300 chars, copy exactly, do not paraphrase)
    • "source_pdf_path": the correct rag/corpus/...pdf path
    • "_confidence": "high"
  2. Value derivable by direct reading (e.g. table cell, "Annexure A lists 586 day-care procedures") β†’ same as above with "_confidence": "medium" and a source_quote that contains the basis.
  3. Genuinely absent from this document β†’ set:
    • "value": null
    • "source_quote": "not stated in <filename>.pdf"
    • "_confidence": "low" NEVER invent a number. NEVER use 999/9999. A sourced null beats a fake number.

max_renewal_age β€” DO NOT FILL (field removed from scoring)

SKIP this field entirely. Do not read PDFs for it, do not set it, do not add a lifelong_renewal key. Lifelong renewability is mandated by IRDAI for every health-indemnity product (since 2020) β€” it is universal, so it does not differentiate policies and has been removed from the scoring model. Leave any existing max_renewal_age cell exactly as-is (even if it is 999/null); it is dead data that will be stripped in the final cleanup pass. Spend zero time on it. Focus on every OTHER null cell.

Boolean fields (*_coverage, *_supported, ambulance_cover, etc.)

value must be true/false (JSON booleans) with a verbatim quote. "Covered under Section X" β†’ true; an exclusions-list mention β†’ false. Unclear β†’ null+low.

Hard constraints

  • Edit ONLY policy_facts files for your assigned insurer slugs. Touch no code.
  • Output must be valid JSON β€” load with json.load, write with json.dump(d, f, ensure_ascii=False, indent=2); preserve key order.
  • Every non-null filled cell MUST have a non-empty source_quote that actually occurs in the PDF text. This is verified independently afterward β€” fabricated quotes will be caught and bounced.
  • No homepage/search URLs anywhere. English-only quotes.

Missing PDF

If the mapped PDF does not exist locally: do not fill from memory. Record the file in your summary under missing_pdf and move on (a separate task is sourcing those PDFs).

Agent output (return exactly this JSON)

{"insurers": ["..."], "files_processed": N, "cells_filled": N,
 "cells_left_null_sourced": N, "lifelong_flagged": N,
 "missing_pdf": ["path", ...], "anomalies": ["short notes"]}