Spaces:
Sleeping
Policy-Facts Extraction Contract (v1 β 2026-05-16)
Single source of truth for every fill/verify agent. Deviating from this is a defect.
Goal
Replace every "value": null and every "value": 999/9999 (a poisoned
"no-max" sentinel) in 40-data/policy_facts/<insurer>__<product>__<doctype>.json
with a real value backed by a verbatim quote from the source PDF β or an
honest, sourced "not stated".
Path mapping (deterministic)
40-data/policy_facts/<insurer>__<product>__<doctype>.json
β PDF: rag/corpus/<insurer>/<product>__<doctype>.pdf
Also check the file's own source_pdf_path and max_renewal_age.source_pdf_path.
If no PDF exists locally, see "Missing PDF" below β do NOT guess.
Per-cell rule (apply to EVERY field dict that has a value key)
Only touch a cell if its current value is null, 999, or 9999.
Leave already-good values untouched (do not "improve" them).
For each such cell, read the PDF and:
- Value found explicitly β set:
"value": the real value (int/number/bool/string/list per existingunit)"source_quote": the verbatim sentence/clause from the PDF that states it (β€ 300 chars, copy exactly, do not paraphrase)"source_pdf_path": the correctrag/corpus/...pdfpath"_confidence":"high"
- Value derivable by direct reading (e.g. table cell, "Annexure A lists 586
day-care procedures") β same as above with
"_confidence": "medium"and asource_quotethat contains the basis. - Genuinely absent from this document β set:
"value":null"source_quote":"not stated in <filename>.pdf""_confidence":"low"NEVER invent a number. NEVER use 999/9999. A sourced null beats a fake number.
max_renewal_age β DO NOT FILL (field removed from scoring)
SKIP this field entirely. Do not read PDFs for it, do not set it, do not add a
lifelong_renewal key. Lifelong renewability is mandated by IRDAI for every
health-indemnity product (since 2020) β it is universal, so it does not
differentiate policies and has been removed from the scoring model. Leave any
existing max_renewal_age cell exactly as-is (even if it is 999/null);
it is dead data that will be stripped in the final cleanup pass. Spend zero
time on it. Focus on every OTHER null cell.
Boolean fields (*_coverage, *_supported, ambulance_cover, etc.)
value must be true/false (JSON booleans) with a verbatim quote. "Covered
under Section X" β true; an exclusions-list mention β false. Unclear β null+low.
Hard constraints
- Edit ONLY policy_facts files for your assigned insurer slugs. Touch no code.
- Output must be valid JSON β load with
json.load, write withjson.dump(d, f, ensure_ascii=False, indent=2); preserve key order. - Every non-null filled cell MUST have a non-empty
source_quotethat actually occurs in the PDF text. This is verified independently afterward β fabricated quotes will be caught and bounced. - No homepage/search URLs anywhere. English-only quotes.
Missing PDF
If the mapped PDF does not exist locally: do not fill from memory. Record the
file in your summary under missing_pdf and move on (a separate task is
sourcing those PDFs).
Agent output (return exactly this JSON)
{"insurers": ["..."], "files_processed": N, "cells_filled": N,
"cells_left_null_sourced": N, "lifelong_flagged": N,
"missing_pdf": ["path", ...], "anomalies": ["short notes"]}