InsuranceBot / tools /_extraction_contract.md
rohitsar567's picture
data+scoring: verbatim-source all policy_facts, recalibrate scorecard, fix recommendation
7081aaa
|
Raw
History Blame Contribute Delete
3.48 kB
# Policy-Facts Extraction Contract (v1 β€” 2026-05-16)
Single source of truth for every fill/verify agent. Deviating from this is a defect.
## Goal
Replace every `"value": null` and every `"value": 999`/`9999` (a poisoned
"no-max" sentinel) in `40-data/policy_facts/<insurer>__<product>__<doctype>.json`
with a **real value backed by a verbatim quote from the source PDF** β€” or an
honest, sourced "not stated".
## Path mapping (deterministic)
`40-data/policy_facts/<insurer>__<product>__<doctype>.json`
β†’ PDF: `rag/corpus/<insurer>/<product>__<doctype>.pdf`
Also check the file's own `source_pdf_path` and `max_renewal_age.source_pdf_path`.
If no PDF exists locally, see "Missing PDF" below β€” do NOT guess.
## Per-cell rule (apply to EVERY field dict that has a `value` key)
Only touch a cell if its current `value` is `null`, `999`, or `9999`.
Leave already-good values untouched (do not "improve" them).
For each such cell, read the PDF and:
1. **Value found explicitly** β†’ set:
- `"value"`: the real value (int/number/bool/string/list per existing `unit`)
- `"source_quote"`: the **verbatim** sentence/clause from the PDF that states
it (≀ 300 chars, copy exactly, do not paraphrase)
- `"source_pdf_path"`: the correct `rag/corpus/...pdf` path
- `"_confidence"`: `"high"`
2. **Value derivable by direct reading** (e.g. table cell, "Annexure A lists 586
day-care procedures") β†’ same as above with `"_confidence": "medium"` and a
`source_quote` that contains the basis.
3. **Genuinely absent from this document** β†’ set:
- `"value"`: `null`
- `"source_quote"`: `"not stated in <filename>.pdf"`
- `"_confidence"`: `"low"`
NEVER invent a number. NEVER use 999/9999. A sourced null beats a fake number.
## `max_renewal_age` β€” DO NOT FILL (field removed from scoring)
SKIP this field entirely. Do not read PDFs for it, do not set it, do not add a
`lifelong_renewal` key. Lifelong renewability is mandated by IRDAI for every
health-indemnity product (since 2020) β€” it is universal, so it does not
differentiate policies and has been removed from the scoring model. Leave any
existing `max_renewal_age` cell exactly as-is (even if it is `999`/`null`);
it is dead data that will be stripped in the final cleanup pass. Spend zero
time on it. Focus on every OTHER null cell.
## Boolean fields (`*_coverage`, `*_supported`, `ambulance_cover`, etc.)
`value` must be `true`/`false` (JSON booleans) with a verbatim quote. "Covered
under Section X" β†’ true; an exclusions-list mention β†’ false. Unclear β†’ null+low.
## Hard constraints
- Edit ONLY policy_facts files for your assigned insurer slugs. Touch no code.
- Output must be valid JSON β€” load with `json.load`, write with
`json.dump(d, f, ensure_ascii=False, indent=2)`; preserve key order.
- Every non-null filled cell MUST have a non-empty `source_quote` that actually
occurs in the PDF text. This is verified independently afterward β€” fabricated
quotes will be caught and bounced.
- No homepage/search URLs anywhere. English-only quotes.
## Missing PDF
If the mapped PDF does not exist locally: do not fill from memory. Record the
file in your summary under `missing_pdf` and move on (a separate task is
sourcing those PDFs).
## Agent output (return exactly this JSON)
```json
{"insurers": ["..."], "files_processed": N, "cells_filled": N,
"cells_left_null_sourced": N, "lifelong_flagged": N,
"missing_pdf": ["path", ...], "anomalies": ["short notes"]}
```