# Policy-Facts Extraction Contract (v1 — 2026-05-16) Single source of truth for every fill/verify agent. Deviating from this is a defect. ## Goal Replace every `"value": null` and every `"value": 999`/`9999` (a poisoned "no-max" sentinel) in `40-data/policy_facts/____.json` with a **real value backed by a verbatim quote from the source PDF** — or an honest, sourced "not stated". ## Path mapping (deterministic) `40-data/policy_facts/____.json` → PDF: `rag/corpus//__.pdf` Also check the file's own `source_pdf_path` and `max_renewal_age.source_pdf_path`. If no PDF exists locally, see "Missing PDF" below — do NOT guess. ## Per-cell rule (apply to EVERY field dict that has a `value` key) Only touch a cell if its current `value` is `null`, `999`, or `9999`. Leave already-good values untouched (do not "improve" them). For each such cell, read the PDF and: 1. **Value found explicitly** → set: - `"value"`: the real value (int/number/bool/string/list per existing `unit`) - `"source_quote"`: the **verbatim** sentence/clause from the PDF that states it (≤ 300 chars, copy exactly, do not paraphrase) - `"source_pdf_path"`: the correct `rag/corpus/...pdf` path - `"_confidence"`: `"high"` 2. **Value derivable by direct reading** (e.g. table cell, "Annexure A lists 586 day-care procedures") → same as above with `"_confidence": "medium"` and a `source_quote` that contains the basis. 3. **Genuinely absent from this document** → set: - `"value"`: `null` - `"source_quote"`: `"not stated in .pdf"` - `"_confidence"`: `"low"` NEVER invent a number. NEVER use 999/9999. A sourced null beats a fake number. ## `max_renewal_age` — DO NOT FILL (field removed from scoring) SKIP this field entirely. Do not read PDFs for it, do not set it, do not add a `lifelong_renewal` key. Lifelong renewability is mandated by IRDAI for every health-indemnity product (since 2020) — it is universal, so it does not differentiate policies and has been removed from the scoring model. Leave any existing `max_renewal_age` cell exactly as-is (even if it is `999`/`null`); it is dead data that will be stripped in the final cleanup pass. Spend zero time on it. Focus on every OTHER null cell. ## Boolean fields (`*_coverage`, `*_supported`, `ambulance_cover`, etc.) `value` must be `true`/`false` (JSON booleans) with a verbatim quote. "Covered under Section X" → true; an exclusions-list mention → false. Unclear → null+low. ## Hard constraints - Edit ONLY policy_facts files for your assigned insurer slugs. Touch no code. - Output must be valid JSON — load with `json.load`, write with `json.dump(d, f, ensure_ascii=False, indent=2)`; preserve key order. - Every non-null filled cell MUST have a non-empty `source_quote` that actually occurs in the PDF text. This is verified independently afterward — fabricated quotes will be caught and bounced. - No homepage/search URLs anywhere. English-only quotes. ## Missing PDF If the mapped PDF does not exist locally: do not fill from memory. Record the file in your summary under `missing_pdf` and move on (a separate task is sourcing those PDFs). ## Agent output (return exactly this JSON) ```json {"insurers": ["..."], "files_processed": N, "cells_filled": N, "cells_left_null_sourced": N, "lifelong_flagged": N, "missing_pdf": ["path", ...], "anomalies": ["short notes"]} ```