Spaces:

rohitsar567
/

InsuranceBot

Sleeping

App Files Files Community

InsuranceBot / tools /_extraction_contract.md

rohitsar567

data+scoring: verbatim-source all policy_facts, recalibrate scorecard, fix recommendation

7081aaa about 2 months ago

preview code

Raw

History Blame Contribute Delete

3.48 kB

Policy-Facts Extraction Contract (v1 — 2026-05-16)

Single source of truth for every fill/verify agent. Deviating from this is a defect.

Goal

Replace every "value": null and every "value": 999/9999 (a poisoned "no-max" sentinel) in 40-data/policy_facts/<insurer>__<product>__<doctype>.json with a real value backed by a verbatim quote from the source PDF — or an honest, sourced "not stated".

Path mapping (deterministic)

40-data/policy_facts/<insurer>__<product>__<doctype>.json → PDF: rag/corpus/<insurer>/<product>__<doctype>.pdf Also check the file's own source_pdf_path and max_renewal_age.source_pdf_path. If no PDF exists locally, see "Missing PDF" below — do NOT guess.

Per-cell rule (apply to EVERY field dict that has a `value` key)

Only touch a cell if its current value is null, 999, or 9999. Leave already-good values untouched (do not "improve" them).

For each such cell, read the PDF and:

Value found explicitly → set:
- "value": the real value (int/number/bool/string/list per existing unit)
- "source_quote": the verbatim sentence/clause from the PDF that states it (≤ 300 chars, copy exactly, do not paraphrase)
- "source_pdf_path": the correct rag/corpus/...pdf path
- "_confidence": "high"
Value derivable by direct reading (e.g. table cell, "Annexure A lists 586 day-care procedures") → same as above with "_confidence": "medium" and a source_quote that contains the basis.
Genuinely absent from this document → set:
- "value": null
- "source_quote": "not stated in <filename>.pdf"
- "_confidence": "low" NEVER invent a number. NEVER use 999/9999. A sourced null beats a fake number.

`max_renewal_age` — DO NOT FILL (field removed from scoring)

SKIP this field entirely. Do not read PDFs for it, do not set it, do not add a lifelong_renewal key. Lifelong renewability is mandated by IRDAI for every health-indemnity product (since 2020) — it is universal, so it does not differentiate policies and has been removed from the scoring model. Leave any existing max_renewal_age cell exactly as-is (even if it is 999/null); it is dead data that will be stripped in the final cleanup pass. Spend zero time on it. Focus on every OTHER null cell.

Boolean fields (`_coverage`, `_supported`, `ambulance_cover`, etc.)

value must be true/false (JSON booleans) with a verbatim quote. "Covered under Section X" → true; an exclusions-list mention → false. Unclear → null+low.

Hard constraints

Edit ONLY policy_facts files for your assigned insurer slugs. Touch no code.
Output must be valid JSON — load with json.load, write with json.dump(d, f, ensure_ascii=False, indent=2); preserve key order.
Every non-null filled cell MUST have a non-empty source_quote that actually occurs in the PDF text. This is verified independently afterward — fabricated quotes will be caught and bounced.
No homepage/search URLs anywhere. English-only quotes.

Missing PDF

If the mapped PDF does not exist locally: do not fill from memory. Record the file in your summary under missing_pdf and move on (a separate task is sourcing those PDFs).

Agent output (return exactly this JSON)

{"insurers": ["..."], "files_processed": N, "cells_filled": N,
 "cells_left_null_sourced": N, "lifelong_flagged": N,
 "missing_pdf": ["path", ...], "anomalies": ["short notes"]}