Spaces:
Sleeping
Sleeping
| # Policy-Facts Extraction Contract (v1 β 2026-05-16) | |
| Single source of truth for every fill/verify agent. Deviating from this is a defect. | |
| ## Goal | |
| Replace every `"value": null` and every `"value": 999`/`9999` (a poisoned | |
| "no-max" sentinel) in `40-data/policy_facts/<insurer>__<product>__<doctype>.json` | |
| with a **real value backed by a verbatim quote from the source PDF** β or an | |
| honest, sourced "not stated". | |
| ## Path mapping (deterministic) | |
| `40-data/policy_facts/<insurer>__<product>__<doctype>.json` | |
| β PDF: `rag/corpus/<insurer>/<product>__<doctype>.pdf` | |
| Also check the file's own `source_pdf_path` and `max_renewal_age.source_pdf_path`. | |
| If no PDF exists locally, see "Missing PDF" below β do NOT guess. | |
| ## Per-cell rule (apply to EVERY field dict that has a `value` key) | |
| Only touch a cell if its current `value` is `null`, `999`, or `9999`. | |
| Leave already-good values untouched (do not "improve" them). | |
| For each such cell, read the PDF and: | |
| 1. **Value found explicitly** β set: | |
| - `"value"`: the real value (int/number/bool/string/list per existing `unit`) | |
| - `"source_quote"`: the **verbatim** sentence/clause from the PDF that states | |
| it (β€ 300 chars, copy exactly, do not paraphrase) | |
| - `"source_pdf_path"`: the correct `rag/corpus/...pdf` path | |
| - `"_confidence"`: `"high"` | |
| 2. **Value derivable by direct reading** (e.g. table cell, "Annexure A lists 586 | |
| day-care procedures") β same as above with `"_confidence": "medium"` and a | |
| `source_quote` that contains the basis. | |
| 3. **Genuinely absent from this document** β set: | |
| - `"value"`: `null` | |
| - `"source_quote"`: `"not stated in <filename>.pdf"` | |
| - `"_confidence"`: `"low"` | |
| NEVER invent a number. NEVER use 999/9999. A sourced null beats a fake number. | |
| ## `max_renewal_age` β DO NOT FILL (field removed from scoring) | |
| SKIP this field entirely. Do not read PDFs for it, do not set it, do not add a | |
| `lifelong_renewal` key. Lifelong renewability is mandated by IRDAI for every | |
| health-indemnity product (since 2020) β it is universal, so it does not | |
| differentiate policies and has been removed from the scoring model. Leave any | |
| existing `max_renewal_age` cell exactly as-is (even if it is `999`/`null`); | |
| it is dead data that will be stripped in the final cleanup pass. Spend zero | |
| time on it. Focus on every OTHER null cell. | |
| ## Boolean fields (`*_coverage`, `*_supported`, `ambulance_cover`, etc.) | |
| `value` must be `true`/`false` (JSON booleans) with a verbatim quote. "Covered | |
| under Section X" β true; an exclusions-list mention β false. Unclear β null+low. | |
| ## Hard constraints | |
| - Edit ONLY policy_facts files for your assigned insurer slugs. Touch no code. | |
| - Output must be valid JSON β load with `json.load`, write with | |
| `json.dump(d, f, ensure_ascii=False, indent=2)`; preserve key order. | |
| - Every non-null filled cell MUST have a non-empty `source_quote` that actually | |
| occurs in the PDF text. This is verified independently afterward β fabricated | |
| quotes will be caught and bounced. | |
| - No homepage/search URLs anywhere. English-only quotes. | |
| ## Missing PDF | |
| If the mapped PDF does not exist locally: do not fill from memory. Record the | |
| file in your summary under `missing_pdf` and move on (a separate task is | |
| sourcing those PDFs). | |
| ## Agent output (return exactly this JSON) | |
| ```json | |
| {"insurers": ["..."], "files_processed": N, "cells_filled": N, | |
| "cells_left_null_sourced": N, "lifelong_flagged": N, | |
| "missing_pdf": ["path", ...], "anomalies": ["short notes"]} | |
| ``` | |