Spaces:
Sleeping
Sleeping
Re-Sourcing Contract v2 (2026-05-16) β verbatim provenance pass
Binding contract for the legacy-provenance re-extraction fleet. The earlier
pass filled NULL cells. THIS pass fixes cells that HAVE a value but whose
source_quote is a non-verbatim self-reference ("extracted from PDF data",
"NIM DeepSeek", "regex extracted from PDF text", "rag/extracted structured
JSON", etc.). The two-part verify flagged 2,904 such cells. Goal: every value
on the site traces to a real verbatim source β zero exceptions.
Which cells to FIX (in your assigned insurers' policy_facts files)
A cell qualifies if ALL of:
- it has a non-null
value(not null/""/[]; ignore 999/9999 β dead), - it is NOT
max_renewal_age(field removed β skip entirely), - it has NO
source_url(url-sourced cells e.g. day_care/network are fine β skip), - its
source_quoteis empty OR matches a provenance/pipeline note:extracted from PDF,from extracted PDF data,NIM DeepSeek,regex extracted from PDF,rag/extracted,structured JSON/field,Gx batch extract,prior pipeline,see source PDF for verbatim. Do NOT touch cells whosesource_quoteis already a real verbatim clause, nornot stated β¦sourced-nulls (legitimately empty).
For each qualifying cell β open the source_pdf_path PDF and:
- Verbatim clause supports the existing value β set
source_quoteto that exact clause (β€300 chars, copied verbatim), keepvalue, set_confidencehigh (explicit) or medium (table/derived), keepsource_pdf_path. - PDF states a DIFFERENT value β correct
valueto what the PDF says, with the verbatim clause assource_quote. Never keep a value the source contradicts. Never keep the old provenance note. - Field genuinely absent from the PDF β
value: null,source_quote: "not stated in <file>.pdf",_confidence: "low". - Source PDF is image-only / not text-extractable (fitz/pdftotext yields
< ~400 chars of text β e.g. a scanned brochure) AND no text-bearing
sibling document exists for that policy β DROP the cell:
value: null,source_quote: "source document is an image-only scan; not text-extractable (no OCR available)",_confidence: "low". (Per owner instruction: drop, do not fabricate, do not OCR.) If a text-bearing sibling doc for the SAME policy exists inrag/corpus/<insurer>/, you may source from it and updatesource_pdf_pathaccordingly.
Hard rules
- NEVER keep "extracted from PDF data"/NIM/regex/etc. as a source_quote.
- NEVER fabricate a quote. NEVER invent a number. NEVER use 999/9999.
- Edit ONLY your assigned insurers'
40-data/policy_facts/*.json. No code. - Valid JSON via
json.dump(d, f, ensure_ascii=False, indent=2), preserve key order. - Every quote you write MUST be greppable in the PDF text (whitespace- normalised) β an independent adversarial re-audit will re-open the PDFs.
Output (return exactly)
{"insurers":["..."],"files_processed":N,"cells_reverbatim":N,
"values_corrected":N,"cells_nulled_absent":N,"cells_dropped_imageonly":N,
"image_only_pdfs":["path"],"anomalies":["..."]}