Spaces:
Running
Running
Commit ·
be54038
1
Parent(s): 78046e4
Deploy PolicyTrace Hugging Face Space
Browse files- config/prompts.yaml +249 -0
- config/settings.yaml +73 -0
- docs/architecture.md +139 -0
- docs/hugging-face.md +70 -0
- requirements-dev.txt +2 -0
- requirements.txt +27 -0
- sample_data/README.md +25 -0
- sample_data/policytrace_demo_pack/manifest.json +13 -0
- scripts/generate_synthetic_policy_pack.py +451 -0
- src/agents.py +530 -0
- src/api.py +372 -0
- src/arbiter.py +268 -0
- src/main.py +223 -0
- src/pipeline.py +131 -0
- src/privacy.py +186 -0
- src/prompts.py +149 -0
- src/provenance.py +424 -0
- src/schema.py +205 -0
- src/settings.py +142 -0
- tests/__init__.py +0 -0
- tests/test_arbiter.py +303 -0
- ui/index.html +13 -0
- ui/package-lock.json +0 -0
- ui/package.json +29 -0
- ui/postcss.config.js +6 -0
- ui/src/App.tsx +16 -0
- ui/src/FieldRow.tsx +201 -0
- ui/src/PDFPane.tsx +229 -0
- ui/src/RecordPane.tsx +174 -0
- ui/src/ReviewDashboard.tsx +93 -0
- ui/src/SessionPage.tsx +82 -0
- ui/src/UploadPage.tsx +210 -0
- ui/src/api.ts +43 -0
- ui/src/assets/ai-toolstack-logo.svg +17 -0
- ui/src/index.css +31 -0
- ui/src/main.tsx +23 -0
- ui/src/store.ts +88 -0
- ui/src/types.ts +170 -0
- ui/src/vite-env.d.ts +1 -0
- ui/tailwind.config.js +26 -0
- ui/tsconfig.app.json +21 -0
- ui/tsconfig.json +7 -0
- ui/tsconfig.node.json +14 -0
- ui/vite.config.ts +16 -0
config/prompts.yaml
ADDED
|
@@ -0,0 +1,249 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# prompts.yaml — Versioned system prompts for the UK Motor Insurance IDP pipeline.
|
| 2 |
+
#
|
| 3 |
+
# HOW TO USE
|
| 4 |
+
# ──────────
|
| 5 |
+
# • Change `active_version` to switch all agents to a new prompt set.
|
| 6 |
+
# • Add a new top-level key under `prompts:` (e.g. v2) to version a new set.
|
| 7 |
+
# • Each version must define keys for every DocumentType value:
|
| 8 |
+
# Schedule | Certificate | StatementOfFact | PolicyBooklet | _generic
|
| 9 |
+
# • Restart the pipeline after editing this file; no code changes required.
|
| 10 |
+
|
| 11 |
+
active_version: "v2"
|
| 12 |
+
|
| 13 |
+
prompts:
|
| 14 |
+
v1:
|
| 15 |
+
Schedule: |
|
| 16 |
+
You are an expert UK motor insurance data extractor specialising in Policy Schedules.
|
| 17 |
+
|
| 18 |
+
A Schedule is the most authoritative document for:
|
| 19 |
+
- policy_number, insurer name, policy dates (start_date, expiry_date)
|
| 20 |
+
- Vehicle registration mark (VRM) and make/model
|
| 21 |
+
- Cover type (Comprehensive, TPFT, Third Party Only)
|
| 22 |
+
- ALL excess figures: compulsory, voluntary, windscreen replacement/repair,
|
| 23 |
+
fire & theft. Calculate accidental_damage_total = compulsory + voluntary
|
| 24 |
+
if the total is not explicitly stated.
|
| 25 |
+
- No Claims Bonus (NCB) years and whether it is protected.
|
| 26 |
+
|
| 27 |
+
Extract every figure you find. Return null for anything genuinely absent.
|
| 28 |
+
Output ONLY valid JSON matching the requested schema — no commentary.
|
| 29 |
+
|
| 30 |
+
Certificate: |
|
| 31 |
+
You are an expert UK motor insurance data extractor specialising in Certificates
|
| 32 |
+
of Motor Insurance.
|
| 33 |
+
|
| 34 |
+
A Certificate is the legal authority for:
|
| 35 |
+
- Named drivers: full name, relationship to policyholder, age, and any
|
| 36 |
+
endorsements / restrictions on each driver.
|
| 37 |
+
- Class of use: social/domestic/pleasure, commuting, business use, etc.
|
| 38 |
+
- The period of cover dates and vehicle details as confirmation cross-checks.
|
| 39 |
+
|
| 40 |
+
Capture EVERY driver listed, including the proposer/policyholder.
|
| 41 |
+
Output ONLY valid JSON matching the requested schema — no commentary.
|
| 42 |
+
|
| 43 |
+
StatementOfFact: |
|
| 44 |
+
You are an expert UK motor insurance data extractor specialising in Statements
|
| 45 |
+
of Fact (also called Proposal Forms or Statement of Insurance).
|
| 46 |
+
|
| 47 |
+
A Statement of Fact is authoritative for:
|
| 48 |
+
- Claims history: number of claims in the last N years, dates, types, at-fault status.
|
| 49 |
+
- Motoring convictions / endorsements (SP30, IN10, etc.) for all drivers.
|
| 50 |
+
- Risk details: annual mileage, overnight parking, security devices, modifications.
|
| 51 |
+
- The proposer's occupation, age, years held licence.
|
| 52 |
+
|
| 53 |
+
Note these fields in driver restrictions[] and any relevant free-text fields.
|
| 54 |
+
Output ONLY valid JSON matching the requested schema — no commentary.
|
| 55 |
+
|
| 56 |
+
PolicyBooklet: |
|
| 57 |
+
You are an expert UK motor insurance data extractor reviewing a Policy Booklet
|
| 58 |
+
(also called Terms & Conditions or Policy Wording).
|
| 59 |
+
|
| 60 |
+
A Policy Booklet rarely contains policyholder-specific data. Extract only if
|
| 61 |
+
explicitly stated:
|
| 62 |
+
- Insurer name / underwriter
|
| 63 |
+
- Any default excess or cover-type definitions that clarify ambiguous fields.
|
| 64 |
+
|
| 65 |
+
If no policyholder-specific data is present, return a minimal JSON with only
|
| 66 |
+
the insurer field populated (if visible) and nulls elsewhere.
|
| 67 |
+
Output ONLY valid JSON matching the requested schema — no commentary.
|
| 68 |
+
|
| 69 |
+
_generic: |
|
| 70 |
+
You are an expert UK motor insurance data extractor.
|
| 71 |
+
Extract all available structured data from the document text provided.
|
| 72 |
+
Output ONLY valid JSON matching the requested schema — no commentary.
|
| 73 |
+
|
| 74 |
+
# ── v2 placeholder ─────────────────────────────────────────────────────────
|
| 75 |
+
# Copy v1 keys here and iterate on individual prompts independently.
|
| 76 |
+
|
| 77 |
+
v2:
|
| 78 |
+
Schedule: |
|
| 79 |
+
You are an expert UK motor insurance data extractor specialising in Policy Schedules.
|
| 80 |
+
Extract ALL data from the document and populate the UKMotorGoldenRecord schema.
|
| 81 |
+
|
| 82 |
+
POLICY HEADER — extract:
|
| 83 |
+
- policy_number, insurer name (full legal name), product_name (e.g. "PolicyTrace Comprehensive Plus")
|
| 84 |
+
- period_of_cover: start_date and expiry_date as ISO-8601 datetime, issue_date as ISO-8601 date
|
| 85 |
+
|
| 86 |
+
VEHICLE DETAILS — extract:
|
| 87 |
+
- vrm (registration plate), make (manufacturer), model (full model name including variant/bhp)
|
| 88 |
+
- fuel_type (Electric / Petrol / Diesel / Hybrid), transmission (Automatic / Manual)
|
| 89 |
+
- estimated_value (e.g. "Market Value" or a £ amount)
|
| 90 |
+
- annual_mileage (integer), overnight_postcode, kept_location (e.g. "Drive", "Garage", "Road")
|
| 91 |
+
- security: has_security_device (bool), tracker_fitted (bool), modifications (text or "None")
|
| 92 |
+
|
| 93 |
+
DRIVER DETAILS — for EACH named driver extract:
|
| 94 |
+
- name (full name), dob (date of birth as ISO-8601 date YYYY-MM-DD)
|
| 95 |
+
- relationship ("Policyholder" / "Named Driver" / "Spouse" etc.)
|
| 96 |
+
- occupation (job title as stated), license_type ("Full UK" or "UK Provisional")
|
| 97 |
+
- is_main_driver: true only for the main/principal driver
|
| 98 |
+
- specific_excess: any driver-specific additional excess (float), null if none
|
| 99 |
+
|
| 100 |
+
COVER AND EXCESSES — extract:
|
| 101 |
+
- cover_type (Comprehensive / TPFT / Third Party Only)
|
| 102 |
+
- class_of_use (verbatim from schedule)
|
| 103 |
+
- driving_other_cars (bool, from schedule if stated)
|
| 104 |
+
- no_claims_discount: years (int), protected (bool)
|
| 105 |
+
- excess_breakdown:
|
| 106 |
+
standard_compulsory: the compulsory excess in £
|
| 107 |
+
voluntary: the voluntary excess in £
|
| 108 |
+
total_accidental_damage: COMPUTE as standard_compulsory + voluntary if not shown
|
| 109 |
+
fire: the fire-specific excess in £ (may differ from theft)
|
| 110 |
+
theft: the theft-specific excess in £ (may differ from fire)
|
| 111 |
+
windscreen_repair: windscreen repair excess in £
|
| 112 |
+
windscreen_replacement: windscreen replacement excess in £
|
| 113 |
+
own_repairer_additional_excess: additional excess for using own repairer in £
|
| 114 |
+
|
| 115 |
+
FINANCIAL SUMMARY — extract:
|
| 116 |
+
- total_annual_premium: total annual premium in £ (float)
|
| 117 |
+
- optional_extras: for each extra, use the premium amount (float) if purchased,
|
| 118 |
+
or the string "Not Selected" if not selected/included:
|
| 119 |
+
motor_legal_protection, breakdown_roadside_assistance,
|
| 120 |
+
enhanced_personal_accident, hire_car, key_cover
|
| 121 |
+
|
| 122 |
+
ADDITIONAL RISK DATA — extract:
|
| 123 |
+
- home_ownership (e.g. "Homeowner", "Not a Homeowner", "Tenant")
|
| 124 |
+
- children_under_16 (bool)
|
| 125 |
+
- number_of_cars_in_household (int)
|
| 126 |
+
- non_motoring_convictions (bool)
|
| 127 |
+
- endorsements (text, "None" if absent)
|
| 128 |
+
|
| 129 |
+
CRITICAL RULES:
|
| 130 |
+
- Fire excess and theft excess are SEPARATE fields — they may have different values.
|
| 131 |
+
- Driver DOBs must be extracted as YYYY-MM-DD dates, not as ages.
|
| 132 |
+
- Return null for any field genuinely absent. Do NOT invent data.
|
| 133 |
+
- Output ONLY valid JSON matching the UKMotorGoldenRecord schema — no commentary.
|
| 134 |
+
|
| 135 |
+
FIELD_CITATIONS — populate the `field_citations` dict with a verbatim phrase
|
| 136 |
+
copied EXACTLY from the document for each field you extract.
|
| 137 |
+
Use the dotted field path as the key.
|
| 138 |
+
The phrase must be a verbatim copy of the raw text as it appears in the document —
|
| 139 |
+
do NOT normalise, translate or paraphrase.
|
| 140 |
+
|
| 141 |
+
Required citations (include only those you actually populated):
|
| 142 |
+
"policy_header.policy_number" → e.g. "NBM-DEMO-0427"
|
| 143 |
+
"policy_header.insurer" → e.g. "Northbridge Mutual Motor Insurance Ltd"
|
| 144 |
+
"policy_header.period_of_cover.start_date" → e.g. "15/04/2026 at 00:00 hours"
|
| 145 |
+
"policy_header.period_of_cover.expiry_date" → e.g. "14/04/2027 at 23:59 hours"
|
| 146 |
+
"policy_header.period_of_cover.issue_date" → e.g. "16/03/2026"
|
| 147 |
+
"vehicle_details.vrm" → e.g. "ZX24 DEM"
|
| 148 |
+
"vehicle_details.make" → e.g. "Skoda"
|
| 149 |
+
"vehicle_details.model" → e.g. "Enyaq iV 60 62kWh 177.0 bhp"
|
| 150 |
+
"vehicle_details.fuel_type" → e.g. "Electric"
|
| 151 |
+
"vehicle_details.estimated_value" → e.g. "Market Value"
|
| 152 |
+
"vehicle_details.annual_mileage" → e.g. "7,000"
|
| 153 |
+
"vehicle_details.overnight_postcode" → e.g. "ZZ1 1ZZ"
|
| 154 |
+
"vehicle_details.kept_location" → e.g. "Drive"
|
| 155 |
+
"cover_and_excesses.cover_type" → e.g. "Comprehensive"
|
| 156 |
+
"cover_and_excesses.class_of_use" → e.g. "Social, Domestic, Pleasure and Commuting"
|
| 157 |
+
"cover_and_excesses.no_claims_discount.years" → e.g. "2 years"
|
| 158 |
+
"cover_and_excesses.excess_breakdown.standard_compulsory" → e.g. "GBP 395.00"
|
| 159 |
+
"cover_and_excesses.excess_breakdown.voluntary" → e.g. "GBP 200.00"
|
| 160 |
+
"cover_and_excesses.excess_breakdown.windscreen_repair" → e.g. "GBP 15.00"
|
| 161 |
+
"cover_and_excesses.excess_breakdown.windscreen_replacement" → e.g. "GBP 200.00"
|
| 162 |
+
"financial_summary.total_annual_premium" → e.g. "GBP 703.28"
|
| 163 |
+
For each driver[N] (N = 0, 1, 2…):
|
| 164 |
+
"driver_details[N].name" → e.g. "Alex Morgan"
|
| 165 |
+
"driver_details[N].dob" → e.g. "14/03/1991"
|
| 166 |
+
"driver_details[N].occupation" → e.g. "Product Manager"
|
| 167 |
+
"driver_details[N].license_type" → e.g. "Full UK"
|
| 168 |
+
|
| 169 |
+
Certificate: |
|
| 170 |
+
You are an expert UK motor insurance data extractor specialising in
|
| 171 |
+
Certificates of Motor Insurance.
|
| 172 |
+
|
| 173 |
+
A Certificate of Motor Insurance is the LEGAL document for road use.
|
| 174 |
+
Focus ONLY on what is legally defined in this document.
|
| 175 |
+
|
| 176 |
+
POLICY HEADER — extract:
|
| 177 |
+
- policy_number (from the certificate heading)
|
| 178 |
+
- insurer (full legal name as printed on the certificate)
|
| 179 |
+
- period_of_cover: start_date and expiry_date as ISO-8601 datetime
|
| 180 |
+
|
| 181 |
+
COVER AND EXCESSES — extract ONLY:
|
| 182 |
+
- class_of_use: copy the EXACT text of the "Limitations as to use" or
|
| 183 |
+
"Class of Use" clause verbatim (e.g. "Social, Domestic, Pleasure and Commuting")
|
| 184 |
+
- driving_other_cars: true if the certificate explicitly grants driving other cars;
|
| 185 |
+
false otherwise
|
| 186 |
+
|
| 187 |
+
DRIVER DETAILS — for EACH named person entitled to drive:
|
| 188 |
+
- name (full name as printed), relationship if stated, is_main_driver if the
|
| 189 |
+
main policyholder is identified
|
| 190 |
+
|
| 191 |
+
LEAVE AS NULL — do NOT populate these sections from a Certificate:
|
| 192 |
+
- vehicle_details (make, model, fuel_type, transmission, security, mileage, etc.)
|
| 193 |
+
- excess_breakdown (standard_compulsory, voluntary, fire, theft, windscreen, etc.)
|
| 194 |
+
- financial_summary (total_annual_premium, optional_extras)
|
| 195 |
+
- additional_risk_data
|
| 196 |
+
- driver dob, occupation, license_type, specific_excess
|
| 197 |
+
|
| 198 |
+
Output ONLY valid JSON matching the UKMotorGoldenRecord schema — no commentary.
|
| 199 |
+
|
| 200 |
+
FIELD_CITATIONS — populate the `field_citations` dict with a verbatim phrase
|
| 201 |
+
copied EXACTLY from the document for each field you extract.
|
| 202 |
+
Use the dotted field path as the key.
|
| 203 |
+
The phrase must be a verbatim copy of the raw text as it appears in the document —
|
| 204 |
+
do NOT normalise, translate or paraphrase.
|
| 205 |
+
|
| 206 |
+
Required citations (include only those you actually populated):
|
| 207 |
+
"policy_header.policy_number" → e.g. "NBM-DEMO-0427"
|
| 208 |
+
"policy_header.insurer" → e.g. "Northbridge Mutual Motor Insurance Ltd"
|
| 209 |
+
"policy_header.period_of_cover.start_date" → e.g. "15/04/2026 at 00:00 hours"
|
| 210 |
+
"policy_header.period_of_cover.expiry_date" → e.g. "14/04/2027 at 23:59 hours"
|
| 211 |
+
"cover_and_excesses.class_of_use" → e.g. "Social, Domestic, Pleasure and Commuting"
|
| 212 |
+
"cover_and_excesses.cover_type" → e.g. "Comprehensive"
|
| 213 |
+
For each driver[N] (N = 0, 1, 2…):
|
| 214 |
+
"driver_details[N].name" → e.g. "Alex Morgan"
|
| 215 |
+
|
| 216 |
+
StatementOfFact: |
|
| 217 |
+
You are an expert UK motor insurance data extractor specialising in Statements
|
| 218 |
+
of Fact (also called Proposal Forms or Statement of Insurance).
|
| 219 |
+
|
| 220 |
+
A Statement of Fact is authoritative for:
|
| 221 |
+
- Claims history: number of claims in the last N years, dates, types, at-fault status.
|
| 222 |
+
- Motoring convictions / endorsements (SP30, IN10, etc.) for all drivers.
|
| 223 |
+
- Risk details: annual mileage, overnight parking, security devices, modifications.
|
| 224 |
+
- The proposer's occupation, age, years held licence.
|
| 225 |
+
|
| 226 |
+
Extract into the UKMotorGoldenRecord schema wherever fields map cleanly.
|
| 227 |
+
Output ONLY valid JSON matching the requested schema — no commentary.
|
| 228 |
+
|
| 229 |
+
PolicyBooklet: |
|
| 230 |
+
You are an expert UK motor insurance data extractor reviewing a Policy Booklet
|
| 231 |
+
(also called Terms & Conditions or Policy Wording).
|
| 232 |
+
|
| 233 |
+
A Policy Booklet rarely contains policyholder-specific data. Extract only if
|
| 234 |
+
explicitly stated: insurer name or any policyholder-specific definitions.
|
| 235 |
+
If no policyholder-specific data is present, return an empty UKMotorGoldenRecord.
|
| 236 |
+
Output ONLY valid JSON matching the requested schema — no commentary.
|
| 237 |
+
|
| 238 |
+
_generic: |
|
| 239 |
+
You are an expert UK motor insurance data extractor.
|
| 240 |
+
Extract all available structured data from the document text provided.
|
| 241 |
+
Populate the UKMotorGoldenRecord schema as completely as possible.
|
| 242 |
+
Output ONLY valid JSON matching the requested schema — no commentary.
|
| 243 |
+
# <improved certificate prompt>
|
| 244 |
+
# StatementOfFact: |
|
| 245 |
+
# <improved sof prompt>
|
| 246 |
+
# PolicyBooklet: |
|
| 247 |
+
# <improved booklet prompt>
|
| 248 |
+
# _generic: |
|
| 249 |
+
# <improved generic prompt>
|
config/settings.yaml
ADDED
|
@@ -0,0 +1,73 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# settings.yaml — Runtime tuneables for the UK Motor Insurance IDP pipeline.
|
| 2 |
+
#
|
| 3 |
+
# HOW TO USE
|
| 4 |
+
# ──────────
|
| 5 |
+
# • Edit values here to tune behaviour without touching Python code.
|
| 6 |
+
# • Environment variables take priority over values in this file:
|
| 7 |
+
# GROQ_API_KEY — (required) your Groq API secret key
|
| 8 |
+
# GROQ_MODEL — overrides llm.model below (set in .env or shell)
|
| 9 |
+
# • Restart the pipeline after editing this file.
|
| 10 |
+
|
| 11 |
+
llm:
|
| 12 |
+
# Model served by Groq. Override at runtime via GROQ_MODEL env var.
|
| 13 |
+
model: "meta-llama/llama-4-scout-17b-16e-instruct"
|
| 14 |
+
# Fast model for document classification. Override via GROQ_CLASSIFIER_MODEL env var.
|
| 15 |
+
classifier_model: "llama-3.1-8b-instant"
|
| 16 |
+
# Number of instructor self-correction retries on Pydantic validation failure.
|
| 17 |
+
max_retries: 2
|
| 18 |
+
|
| 19 |
+
pii:
|
| 20 |
+
# Minimum Presidio confidence score (0.0–1.0) to trigger redaction.
|
| 21 |
+
score_threshold: 0.5
|
| 22 |
+
# Set to true to also redact DATE_TIME entities (breaks date extraction — use carefully).
|
| 23 |
+
mask_dates: false
|
| 24 |
+
# spaCy language code used by the Presidio NLP engine.
|
| 25 |
+
language: "en"
|
| 26 |
+
# Presidio entity types to redact before sending text to the LLM.
|
| 27 |
+
entities:
|
| 28 |
+
- PERSON
|
| 29 |
+
- PHONE_NUMBER
|
| 30 |
+
- EMAIL_ADDRESS
|
| 31 |
+
- UK_NHS
|
| 32 |
+
- UK_NIN # National Insurance Number
|
| 33 |
+
- CREDIT_CARD
|
| 34 |
+
- IBAN_CODE
|
| 35 |
+
- LOCATION # postcodes / addresses
|
| 36 |
+
- IP_ADDRESS
|
| 37 |
+
- URL
|
| 38 |
+
|
| 39 |
+
pipeline:
|
| 40 |
+
# Default output path for the Golden Record JSON.
|
| 41 |
+
output_path: "../output/golden_record.json"
|
| 42 |
+
# Default logging verbosity: DEBUG | INFO | WARNING | ERROR
|
| 43 |
+
log_level: "INFO"
|
| 44 |
+
# Session directories older than this many days are deleted on API startup. 0 = disabled.
|
| 45 |
+
session_ttl_days: 30
|
| 46 |
+
|
| 47 |
+
debug:
|
| 48 |
+
# Master switch — set to false to skip all debug artifact writing.
|
| 49 |
+
enabled: true
|
| 50 |
+
# Root folder for debug runs. Each execution creates a timestamped sub-folder.
|
| 51 |
+
output_dir: "./output/debug"
|
| 52 |
+
# Save the raw Markdown produced by docling for each PDF.
|
| 53 |
+
save_markdown: true
|
| 54 |
+
# Save the PII-masked Markdown that is actually sent to the LLM.
|
| 55 |
+
save_masked_markdown: true
|
| 56 |
+
# Save the raw UKMotorPolicy JSON extracted from each document.
|
| 57 |
+
save_extraction_json: true
|
| 58 |
+
# Append a JSONL line per document: prompt size, response time, fields populated.
|
| 59 |
+
save_metrics: true
|
| 60 |
+
|
| 61 |
+
docling:
|
| 62 |
+
# Disable OCR — UK insurance PDFs are text-based; OCR doubles memory usage per page.
|
| 63 |
+
do_ocr: false
|
| 64 |
+
# Disable deep table-structure recognition to reduce memory pressure on large PDFs.
|
| 65 |
+
do_table_structure: false
|
| 66 |
+
# Maximum pages to process per document type. null = no limit.
|
| 67 |
+
# Policy Booklet is the lowest-priority document (57+ pages) — cap it to save memory.
|
| 68 |
+
max_pages:
|
| 69 |
+
Schedule: null
|
| 70 |
+
Certificate: null
|
| 71 |
+
StatementOfFact: null
|
| 72 |
+
PolicyBooklet: 20
|
| 73 |
+
Unknown: 30
|
docs/architecture.md
ADDED
|
@@ -0,0 +1,139 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# PolicyTrace Architecture
|
| 2 |
+
|
| 3 |
+
PolicyTrace is built as a two-part application:
|
| 4 |
+
|
| 5 |
+
- A Python backend that performs PDF conversion, extraction, arbitration, provenance matching, and session storage.
|
| 6 |
+
- A React frontend that lets a human reviewer inspect every extracted field against the source PDF.
|
| 7 |
+
|
| 8 |
+
## Core Flow
|
| 9 |
+
|
| 10 |
+
```mermaid
|
| 11 |
+
sequenceDiagram
|
| 12 |
+
participant User
|
| 13 |
+
participant UI as React UI
|
| 14 |
+
participant API as FastAPI
|
| 15 |
+
participant Docling
|
| 16 |
+
participant LLM as Groq LLM
|
| 17 |
+
participant Arbiter
|
| 18 |
+
participant Prov as Provenance matcher
|
| 19 |
+
|
| 20 |
+
User->>UI: Upload PDF pack
|
| 21 |
+
UI->>API: POST /api/process
|
| 22 |
+
API->>Docling: Convert PDFs to Markdown and geometry
|
| 23 |
+
API->>API: Mask selected PII
|
| 24 |
+
API->>LLM: Classify document type
|
| 25 |
+
API->>LLM: Extract typed Golden Record fields
|
| 26 |
+
API->>Arbiter: Merge Schedule and Certificate
|
| 27 |
+
Arbiter-->>API: Golden Record plus conflicts
|
| 28 |
+
API->>Prov: Match fields to PDF text geometry
|
| 29 |
+
Prov-->>API: Field-level provenance
|
| 30 |
+
API-->>UI: Session ID
|
| 31 |
+
UI->>API: GET /api/session/{id}
|
| 32 |
+
API-->>UI: Record, provenance, conflicts
|
| 33 |
+
```
|
| 34 |
+
|
| 35 |
+
## Backend Modules
|
| 36 |
+
|
| 37 |
+
### `src/agents.py`
|
| 38 |
+
|
| 39 |
+
Responsible for document-level work:
|
| 40 |
+
|
| 41 |
+
- Convert PDF to Markdown using Docling.
|
| 42 |
+
- Build a Docling geometry corpus for provenance.
|
| 43 |
+
- Mask selected PII before LLM calls.
|
| 44 |
+
- Classify document type.
|
| 45 |
+
- Route text to specialist extraction prompts.
|
| 46 |
+
- Return a `UKMotorGoldenRecord` Pydantic model.
|
| 47 |
+
|
| 48 |
+
### `src/schema.py`
|
| 49 |
+
|
| 50 |
+
Defines the canonical output contract:
|
| 51 |
+
|
| 52 |
+
- `UKMotorGoldenRecord`
|
| 53 |
+
- policy header
|
| 54 |
+
- vehicle details
|
| 55 |
+
- driver details
|
| 56 |
+
- cover and excesses
|
| 57 |
+
- financial summary
|
| 58 |
+
- additional risk data
|
| 59 |
+
- field provenance
|
| 60 |
+
- conflict entries
|
| 61 |
+
|
| 62 |
+
The schema keeps most fields optional because each source document is only partially authoritative.
|
| 63 |
+
|
| 64 |
+
### `src/arbiter.py`
|
| 65 |
+
|
| 66 |
+
Merges Schedule and Certificate records using a hierarchy of truth.
|
| 67 |
+
|
| 68 |
+
Schedule wins for:
|
| 69 |
+
|
| 70 |
+
- vehicle details
|
| 71 |
+
- cover type
|
| 72 |
+
- no claims discount
|
| 73 |
+
- excess breakdown
|
| 74 |
+
- financial summary
|
| 75 |
+
- driver DOB, occupation, licence type
|
| 76 |
+
|
| 77 |
+
Certificate wins for:
|
| 78 |
+
|
| 79 |
+
- class of use
|
| 80 |
+
- driving other cars
|
| 81 |
+
- legal driver entitlement details when present
|
| 82 |
+
|
| 83 |
+
When two documents disagree, the arbiter records a `ConflictEntry`.
|
| 84 |
+
|
| 85 |
+
### `src/provenance.py`
|
| 86 |
+
|
| 87 |
+
Builds field-level PDF provenance after extraction.
|
| 88 |
+
|
| 89 |
+
The LLM returns canonical values, such as ISO dates and numeric amounts, but PDF text usually contains raw phrases like `15/04/2026 at 00:00 hours` or `GBP 703.28`.
|
| 90 |
+
|
| 91 |
+
To bridge that gap, prompts ask the LLM to also provide hidden `field_citations`: verbatim phrases copied from the source document. These citations are excluded from the final serialised record but used for matching against Docling text geometry.
|
| 92 |
+
|
| 93 |
+
### `src/api.py`
|
| 94 |
+
|
| 95 |
+
FastAPI service for the review UI:
|
| 96 |
+
|
| 97 |
+
- `GET /api/health`
|
| 98 |
+
- `POST /api/process`
|
| 99 |
+
- `GET /api/session/{id}`
|
| 100 |
+
- `GET /api/pdf/{session_id}/{filename}`
|
| 101 |
+
- `PATCH /api/session/{id}/review`
|
| 102 |
+
- `GET /api/session/{id}/review-state`
|
| 103 |
+
- `DELETE /api/session/{id}`
|
| 104 |
+
|
| 105 |
+
When `ui/dist` exists, the API also serves the production React app and supports direct `/session/{id}` refreshes.
|
| 106 |
+
|
| 107 |
+
## Frontend Modules
|
| 108 |
+
|
| 109 |
+
### `ui/src/UploadPage.tsx`
|
| 110 |
+
|
| 111 |
+
Upload screen for PDF packs.
|
| 112 |
+
|
| 113 |
+
### `ui/src/SessionPage.tsx`
|
| 114 |
+
|
| 115 |
+
Loads an existing session from the API so sessions can be opened directly from a URL.
|
| 116 |
+
|
| 117 |
+
### `ui/src/ReviewDashboard.tsx`
|
| 118 |
+
|
| 119 |
+
Two-column review layout: PDF viewer on the left, Golden Record fields on the right.
|
| 120 |
+
|
| 121 |
+
### `ui/src/PDFPane.tsx`
|
| 122 |
+
|
| 123 |
+
Renders PDFs with `react-pdf`, overlays provenance boxes, and scrolls to selected fields.
|
| 124 |
+
|
| 125 |
+
### `ui/src/RecordPane.tsx` and `ui/src/FieldRow.tsx`
|
| 126 |
+
|
| 127 |
+
Flatten the nested Golden Record into reviewable field rows with verify, override, and flag actions.
|
| 128 |
+
|
| 129 |
+
## Why This Architecture
|
| 130 |
+
|
| 131 |
+
The system deliberately separates concerns:
|
| 132 |
+
|
| 133 |
+
- The LLM extracts structured values.
|
| 134 |
+
- Pydantic validates the shape.
|
| 135 |
+
- The arbiter applies domain-specific source authority.
|
| 136 |
+
- Provenance is calculated after extraction instead of trusting the model to invent coordinates.
|
| 137 |
+
- The UI keeps humans in the loop where confidence, evidence, or conflicts need review.
|
| 138 |
+
|
| 139 |
+
That separation is what turns the project from a prompt demo into a deployable workflow.
|
docs/hugging-face.md
ADDED
|
@@ -0,0 +1,70 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Hugging Face Spaces Deployment
|
| 2 |
+
|
| 3 |
+
PolicyTrace should be deployed as a Docker Space because it is a FastAPI plus React application, not a pure Gradio or Streamlit app.
|
| 4 |
+
|
| 5 |
+
## Deployment Shape
|
| 6 |
+
|
| 7 |
+
The root `Dockerfile` does this:
|
| 8 |
+
|
| 9 |
+
1. Builds the React UI with Vite.
|
| 10 |
+
2. Installs the Python backend dependencies.
|
| 11 |
+
3. Downloads the small spaCy English model used by Presidio.
|
| 12 |
+
4. Copies `ui/dist` into the image.
|
| 13 |
+
5. Starts FastAPI on port `7860`.
|
| 14 |
+
6. Lets FastAPI serve both `/api/*` and the React app.
|
| 15 |
+
|
| 16 |
+
## Space Settings
|
| 17 |
+
|
| 18 |
+
Create a new Hugging Face Space:
|
| 19 |
+
|
| 20 |
+
- SDK: Docker
|
| 21 |
+
- Port: `7860`
|
| 22 |
+
- Visibility: public or private, depending on your demo plan
|
| 23 |
+
|
| 24 |
+
Add this secret in the Space settings:
|
| 25 |
+
|
| 26 |
+
```text
|
| 27 |
+
GROQ_API_KEY=your_groq_key
|
| 28 |
+
```
|
| 29 |
+
|
| 30 |
+
Optional secrets or variables:
|
| 31 |
+
|
| 32 |
+
```text
|
| 33 |
+
GROQ_MODEL=meta-llama/llama-4-scout-17b-16e-instruct
|
| 34 |
+
GROQ_CLASSIFIER_MODEL=llama-3.1-8b-instant
|
| 35 |
+
```
|
| 36 |
+
|
| 37 |
+
## Public Demo Safety
|
| 38 |
+
|
| 39 |
+
For a public Space, use only the synthetic PDFs in:
|
| 40 |
+
|
| 41 |
+
```text
|
| 42 |
+
sample_data/policytrace_demo_pack/
|
| 43 |
+
```
|
| 44 |
+
|
| 45 |
+
Do not upload real customer documents to a public demo unless you have explicit permission and strong retention controls.
|
| 46 |
+
|
| 47 |
+
## Storage Notes
|
| 48 |
+
|
| 49 |
+
Hugging Face Spaces have ephemeral storage by default. This means generated sessions may disappear when the Space restarts.
|
| 50 |
+
|
| 51 |
+
For a public portfolio demo, ephemeral storage is usually fine. For a persistent review workflow, enable persistent storage or move sessions to an external object store/database.
|
| 52 |
+
|
| 53 |
+
## Local Docker Test
|
| 54 |
+
|
| 55 |
+
Before pushing to a Space:
|
| 56 |
+
|
| 57 |
+
```powershell
|
| 58 |
+
docker build -t policytrace .
|
| 59 |
+
docker run --rm -p 7860:7860 --env-file .env policytrace
|
| 60 |
+
```
|
| 61 |
+
|
| 62 |
+
Then open:
|
| 63 |
+
|
| 64 |
+
```text
|
| 65 |
+
http://localhost:7860
|
| 66 |
+
```
|
| 67 |
+
|
| 68 |
+
## Linking From This Repo
|
| 69 |
+
|
| 70 |
+
After the Space is live, add the Space URL to the main `README.md` demo section.
|
requirements-dev.txt
ADDED
|
@@ -0,0 +1,2 @@
|
|
|
|
|
|
|
|
|
|
| 1 |
+
-r requirements.txt
|
| 2 |
+
pytest>=8.2.0
|
requirements.txt
ADDED
|
@@ -0,0 +1,27 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# ── Core pipeline ──────────────────────────────────────────────────────────
|
| 2 |
+
docling>=2.5.0
|
| 3 |
+
instructor>=1.3.0
|
| 4 |
+
groq>=0.8.0
|
| 5 |
+
pydantic>=2.7.0
|
| 6 |
+
|
| 7 |
+
# ── PII masking ────────────────────────────────────────────────────────────
|
| 8 |
+
presidio-analyzer>=2.2.354
|
| 9 |
+
presidio-anonymizer>=2.2.354
|
| 10 |
+
# spaCy model (download separately after install):
|
| 11 |
+
# python -m spacy download en_core_web_lg
|
| 12 |
+
spacy>=3.7.0
|
| 13 |
+
|
| 14 |
+
# ── Utilities ──────────────────────────────────────────────────────────────
|
| 15 |
+
python-dotenv>=1.0.0 # load GROQ_API_KEY / GROQ_MODEL from .env
|
| 16 |
+
pyyaml>=6.0.0 # parse config/settings.yaml and config/prompts.yaml
|
| 17 |
+
|
| 18 |
+
# ── API server (Visual Audit UI) ───────────────────────────────────────────
|
| 19 |
+
fastapi>=0.111.0
|
| 20 |
+
uvicorn[standard]>=0.30.0
|
| 21 |
+
python-multipart>=0.0.9 # required by FastAPI for UploadFile
|
| 22 |
+
|
| 23 |
+
# ── Provenance fuzzy matching ──────────────────────────────────────────────
|
| 24 |
+
rapidfuzz>=3.9.0
|
| 25 |
+
|
| 26 |
+
# Demo fixture generation
|
| 27 |
+
reportlab>=4.2.0
|
sample_data/README.md
ADDED
|
@@ -0,0 +1,25 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Sample Data
|
| 2 |
+
|
| 3 |
+
This folder contains synthetic demo documents for PolicyTrace.
|
| 4 |
+
|
| 5 |
+
The PDFs in `policytrace_demo_pack/` are fictional, text-based UK motor
|
| 6 |
+
insurance documents. They use invented names, policy numbers, vehicle
|
| 7 |
+
registration, insurer branding, address, and risk details. They are safe to use
|
| 8 |
+
in screenshots, demos, blog posts, GitHub examples, and Hugging Face Spaces.
|
| 9 |
+
|
| 10 |
+
Generated files:
|
| 11 |
+
|
| 12 |
+
- `Schedule of Insurance - Demo.pdf`
|
| 13 |
+
- `Certificate of Motor Insurance - Demo.pdf`
|
| 14 |
+
- `Statement of Fact - Demo.pdf`
|
| 15 |
+
- `Policy Booklet - Demo.pdf`
|
| 16 |
+
- `manifest.json`
|
| 17 |
+
|
| 18 |
+
To regenerate the pack from source:
|
| 19 |
+
|
| 20 |
+
```powershell
|
| 21 |
+
python scripts/generate_synthetic_policy_pack.py
|
| 22 |
+
```
|
| 23 |
+
|
| 24 |
+
Do not commit real customer PDFs, real policy documents, or local extraction
|
| 25 |
+
outputs to the public repository.
|
sample_data/policytrace_demo_pack/manifest.json
ADDED
|
@@ -0,0 +1,13 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"purpose": "Synthetic demo data for AI Tool Stack PolicyTrace.",
|
| 3 |
+
"warning": "No real customer, insurer, vehicle, or policy data is included.",
|
| 4 |
+
"files": [
|
| 5 |
+
"Schedule of Insurance - Demo.pdf",
|
| 6 |
+
"Certificate of Motor Insurance - Demo.pdf",
|
| 7 |
+
"Statement of Fact - Demo.pdf",
|
| 8 |
+
"Policy Booklet - Demo.pdf"
|
| 9 |
+
],
|
| 10 |
+
"expected_policy_number": "NBM-DEMO-0427",
|
| 11 |
+
"expected_vrm": "ZX24 DEM",
|
| 12 |
+
"expected_insurer": "Northbridge Mutual Motor Insurance Ltd"
|
| 13 |
+
}
|
scripts/generate_synthetic_policy_pack.py
ADDED
|
@@ -0,0 +1,451 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Generate a synthetic UK motor insurance PDF pack for demos and tests.
|
| 3 |
+
|
| 4 |
+
The PDFs are intentionally fictional: invented insurer, logo, names, address,
|
| 5 |
+
policy number, vehicle registration, and risk details. They are text-based PDFs
|
| 6 |
+
so Docling can parse them without OCR.
|
| 7 |
+
|
| 8 |
+
Run from the repository root:
|
| 9 |
+
|
| 10 |
+
python scripts/generate_synthetic_policy_pack.py
|
| 11 |
+
"""
|
| 12 |
+
from __future__ import annotations
|
| 13 |
+
|
| 14 |
+
import json
|
| 15 |
+
from pathlib import Path
|
| 16 |
+
from typing import Iterable
|
| 17 |
+
|
| 18 |
+
from reportlab.lib import colors
|
| 19 |
+
from reportlab.lib.pagesizes import A4
|
| 20 |
+
from reportlab.lib.styles import ParagraphStyle, getSampleStyleSheet
|
| 21 |
+
from reportlab.lib.units import mm
|
| 22 |
+
from reportlab.platypus import (
|
| 23 |
+
Paragraph,
|
| 24 |
+
SimpleDocTemplate,
|
| 25 |
+
Spacer,
|
| 26 |
+
Table,
|
| 27 |
+
TableStyle,
|
| 28 |
+
)
|
| 29 |
+
|
| 30 |
+
|
| 31 |
+
OUT_DIR = Path("sample_data/policytrace_demo_pack")
|
| 32 |
+
BRAND_DARK = colors.HexColor("#1F2937")
|
| 33 |
+
BRAND_BLUE = colors.HexColor("#2563EB")
|
| 34 |
+
BRAND_TEAL = colors.HexColor("#008080")
|
| 35 |
+
BRAND_PINK = colors.HexColor("#FCE7F3")
|
| 36 |
+
BRAND_LIGHT = colors.HexColor("#F8FAFC")
|
| 37 |
+
|
| 38 |
+
POLICY = {
|
| 39 |
+
"insurer": "Northbridge Mutual Motor Insurance Ltd",
|
| 40 |
+
"product_name": "PolicyTrace Comprehensive Plus",
|
| 41 |
+
"policy_number": "NBM-DEMO-0427",
|
| 42 |
+
"issue_date": "18/03/2026",
|
| 43 |
+
"start_date": "15/04/2026 at 00:00 hours",
|
| 44 |
+
"expiry_date": "14/04/2027 at 23:59 hours",
|
| 45 |
+
"policyholder": "Alex Morgan",
|
| 46 |
+
"address": "14 Demo Crescent, Sampleton, West Yorkshire, ZZ1 1ZZ",
|
| 47 |
+
"dob": "14/03/1991",
|
| 48 |
+
"occupation": "Product Manager",
|
| 49 |
+
"second_driver": "Priya Shah",
|
| 50 |
+
"second_driver_dob": "07/08/1995",
|
| 51 |
+
"second_driver_occupation": "Business Analyst",
|
| 52 |
+
"third_driver": "Jordan Reed",
|
| 53 |
+
"third_driver_dob": "11/10/1985",
|
| 54 |
+
"third_driver_occupation": "Data Administrator",
|
| 55 |
+
"vrm": "ZX24 DEM",
|
| 56 |
+
"make": "Skoda",
|
| 57 |
+
"model": "Enyaq iV 60 62kWh 177.0 bhp",
|
| 58 |
+
"fuel_type": "Electric",
|
| 59 |
+
"transmission": "Automatic",
|
| 60 |
+
"estimated_value": "Market Value",
|
| 61 |
+
"annual_mileage": "7,000 miles",
|
| 62 |
+
"overnight_postcode": "ZZ1 1ZZ",
|
| 63 |
+
"kept_location": "Drive",
|
| 64 |
+
"security_device": "Yes",
|
| 65 |
+
"tracker_fitted": "No",
|
| 66 |
+
"modifications": "No",
|
| 67 |
+
"cover_type": "Comprehensive",
|
| 68 |
+
"class_of_use": (
|
| 69 |
+
"Use for social, domestic and pleasure purposes including commuting "
|
| 70 |
+
"to and from a permanent place of work."
|
| 71 |
+
),
|
| 72 |
+
"driving_other_cars": "No",
|
| 73 |
+
"ncb_years": "2 years",
|
| 74 |
+
"ncb_protected": "No",
|
| 75 |
+
"standard_compulsory": "GBP 395.00",
|
| 76 |
+
"voluntary": "GBP 200.00",
|
| 77 |
+
"total_accidental_damage": "GBP 595.00",
|
| 78 |
+
"fire": "GBP 395.00",
|
| 79 |
+
"theft": "GBP 445.00",
|
| 80 |
+
"windscreen_repair": "GBP 15.00",
|
| 81 |
+
"windscreen_replacement": "GBP 200.00",
|
| 82 |
+
"own_repairer": "GBP 200.00",
|
| 83 |
+
"total_premium": "GBP 703.28",
|
| 84 |
+
"legal": "GBP 25.40",
|
| 85 |
+
"breakdown": "GBP 28.07",
|
| 86 |
+
"personal_accident": "GBP 20.00",
|
| 87 |
+
"hire_car": "Not selected",
|
| 88 |
+
"key_cover": "Not selected",
|
| 89 |
+
}
|
| 90 |
+
|
| 91 |
+
|
| 92 |
+
def _styles() -> dict[str, ParagraphStyle]:
|
| 93 |
+
base = getSampleStyleSheet()
|
| 94 |
+
return {
|
| 95 |
+
"title": ParagraphStyle(
|
| 96 |
+
"title",
|
| 97 |
+
parent=base["Title"],
|
| 98 |
+
fontName="Helvetica-Bold",
|
| 99 |
+
fontSize=22,
|
| 100 |
+
textColor=BRAND_DARK,
|
| 101 |
+
spaceAfter=14,
|
| 102 |
+
),
|
| 103 |
+
"subtitle": ParagraphStyle(
|
| 104 |
+
"subtitle",
|
| 105 |
+
parent=base["Normal"],
|
| 106 |
+
fontName="Helvetica",
|
| 107 |
+
fontSize=10,
|
| 108 |
+
leading=14,
|
| 109 |
+
textColor=colors.HexColor("#475569"),
|
| 110 |
+
spaceAfter=10,
|
| 111 |
+
),
|
| 112 |
+
"h2": ParagraphStyle(
|
| 113 |
+
"h2",
|
| 114 |
+
parent=base["Heading2"],
|
| 115 |
+
fontName="Helvetica-Bold",
|
| 116 |
+
fontSize=13,
|
| 117 |
+
textColor=BRAND_TEAL,
|
| 118 |
+
spaceBefore=12,
|
| 119 |
+
spaceAfter=7,
|
| 120 |
+
),
|
| 121 |
+
"body": ParagraphStyle(
|
| 122 |
+
"body",
|
| 123 |
+
parent=base["BodyText"],
|
| 124 |
+
fontName="Helvetica",
|
| 125 |
+
fontSize=9,
|
| 126 |
+
leading=12,
|
| 127 |
+
textColor=BRAND_DARK,
|
| 128 |
+
spaceAfter=6,
|
| 129 |
+
),
|
| 130 |
+
"small": ParagraphStyle(
|
| 131 |
+
"small",
|
| 132 |
+
parent=base["BodyText"],
|
| 133 |
+
fontName="Helvetica",
|
| 134 |
+
fontSize=7,
|
| 135 |
+
leading=9,
|
| 136 |
+
textColor=colors.HexColor("#64748B"),
|
| 137 |
+
),
|
| 138 |
+
}
|
| 139 |
+
|
| 140 |
+
|
| 141 |
+
def _draw_header(canvas, doc, title: str) -> None:
|
| 142 |
+
canvas.saveState()
|
| 143 |
+
width, height = A4
|
| 144 |
+
canvas.setFillColor(BRAND_DARK)
|
| 145 |
+
canvas.roundRect(16 * mm, height - 24 * mm, 42 * mm, 11 * mm, 2 * mm, fill=1, stroke=0)
|
| 146 |
+
canvas.setFillColor(BRAND_TEAL)
|
| 147 |
+
canvas.circle(22 * mm, height - 18.5 * mm, 2.6 * mm, fill=1, stroke=0)
|
| 148 |
+
canvas.setFillColor(BRAND_BLUE)
|
| 149 |
+
canvas.circle(29 * mm, height - 18.5 * mm, 2.6 * mm, fill=1, stroke=0)
|
| 150 |
+
canvas.setFillColor(colors.white)
|
| 151 |
+
canvas.setFont("Helvetica-Bold", 6)
|
| 152 |
+
canvas.drawString(36 * mm, height - 19.5 * mm, "NORTHBRIDGE")
|
| 153 |
+
canvas.setFillColor(colors.HexColor("#64748B"))
|
| 154 |
+
canvas.setFont("Helvetica", 7)
|
| 155 |
+
canvas.drawRightString(width - 16 * mm, height - 18 * mm, title)
|
| 156 |
+
canvas.setStrokeColor(colors.HexColor("#E2E8F0"))
|
| 157 |
+
canvas.line(16 * mm, height - 28 * mm, width - 16 * mm, height - 28 * mm)
|
| 158 |
+
canvas.setFont("Helvetica", 6)
|
| 159 |
+
canvas.setFillColor(colors.HexColor("#94A3B8"))
|
| 160 |
+
canvas.drawString(
|
| 161 |
+
16 * mm,
|
| 162 |
+
11 * mm,
|
| 163 |
+
"Synthetic demo document generated for AI Tool Stack PolicyTrace. No real customer or insurer data.",
|
| 164 |
+
)
|
| 165 |
+
canvas.drawRightString(width - 16 * mm, 11 * mm, f"Page {doc.page}")
|
| 166 |
+
canvas.restoreState()
|
| 167 |
+
|
| 168 |
+
|
| 169 |
+
def _table(rows: Iterable[Iterable[str]], col_widths: list[float] | None = None) -> Table:
|
| 170 |
+
data = [[Paragraph(str(cell), _styles()["body"]) for cell in row] for row in rows]
|
| 171 |
+
table = Table(data, colWidths=col_widths, hAlign="LEFT")
|
| 172 |
+
table.setStyle(
|
| 173 |
+
TableStyle(
|
| 174 |
+
[
|
| 175 |
+
("BACKGROUND", (0, 0), (-1, 0), BRAND_LIGHT),
|
| 176 |
+
("TEXTCOLOR", (0, 0), (-1, 0), BRAND_DARK),
|
| 177 |
+
("FONTNAME", (0, 0), (-1, 0), "Helvetica-Bold"),
|
| 178 |
+
("GRID", (0, 0), (-1, -1), 0.35, colors.HexColor("#CBD5E1")),
|
| 179 |
+
("VALIGN", (0, 0), (-1, -1), "TOP"),
|
| 180 |
+
("ROWBACKGROUNDS", (0, 1), (-1, -1), [colors.white, BRAND_PINK]),
|
| 181 |
+
("LEFTPADDING", (0, 0), (-1, -1), 6),
|
| 182 |
+
("RIGHTPADDING", (0, 0), (-1, -1), 6),
|
| 183 |
+
("TOPPADDING", (0, 0), (-1, -1), 5),
|
| 184 |
+
("BOTTOMPADDING", (0, 0), (-1, -1), 5),
|
| 185 |
+
]
|
| 186 |
+
)
|
| 187 |
+
)
|
| 188 |
+
return table
|
| 189 |
+
|
| 190 |
+
|
| 191 |
+
def _doc(path: Path, title: str):
|
| 192 |
+
return SimpleDocTemplate(
|
| 193 |
+
str(path),
|
| 194 |
+
pagesize=A4,
|
| 195 |
+
leftMargin=18 * mm,
|
| 196 |
+
rightMargin=18 * mm,
|
| 197 |
+
topMargin=32 * mm,
|
| 198 |
+
bottomMargin=18 * mm,
|
| 199 |
+
title=title,
|
| 200 |
+
author="AI Tool Stack",
|
| 201 |
+
)
|
| 202 |
+
|
| 203 |
+
|
| 204 |
+
def build_schedule() -> None:
|
| 205 |
+
s = _styles()
|
| 206 |
+
path = OUT_DIR / "Schedule of Insurance - Demo.pdf"
|
| 207 |
+
story = [
|
| 208 |
+
Paragraph("Car insurance schedule", s["title"]),
|
| 209 |
+
Paragraph(
|
| 210 |
+
"This schedule is a synthetic text-based PDF for the PolicyTrace demo. "
|
| 211 |
+
"Please check all details carefully and contact Northbridge Mutual if anything is incorrect.",
|
| 212 |
+
s["subtitle"],
|
| 213 |
+
),
|
| 214 |
+
_table(
|
| 215 |
+
[
|
| 216 |
+
["Policy number", POLICY["policy_number"], "Date of issue", POLICY["issue_date"]],
|
| 217 |
+
["Insurer", POLICY["insurer"], "Product", POLICY["product_name"]],
|
| 218 |
+
["Period of cover", f"{POLICY['start_date']} - {POLICY['expiry_date']}", "Cover type", POLICY["cover_type"]],
|
| 219 |
+
],
|
| 220 |
+
[33 * mm, 52 * mm, 33 * mm, 52 * mm],
|
| 221 |
+
),
|
| 222 |
+
Paragraph("Policyholder details", s["h2"]),
|
| 223 |
+
_table(
|
| 224 |
+
[
|
| 225 |
+
["Name", POLICY["policyholder"]],
|
| 226 |
+
["Address", POLICY["address"]],
|
| 227 |
+
["Date of birth", POLICY["dob"]],
|
| 228 |
+
["Occupation", POLICY["occupation"]],
|
| 229 |
+
["Children under 16", "Yes"],
|
| 230 |
+
["Home ownership status", "Not a Homeowner"],
|
| 231 |
+
["Number of cars in household", "1"],
|
| 232 |
+
["Access to other vehicles", "No access to any other vehicles"],
|
| 233 |
+
],
|
| 234 |
+
[55 * mm, 115 * mm],
|
| 235 |
+
),
|
| 236 |
+
Paragraph("Vehicle details", s["h2"]),
|
| 237 |
+
_table(
|
| 238 |
+
[
|
| 239 |
+
["Registration number", POLICY["vrm"], "Make", POLICY["make"]],
|
| 240 |
+
["Model", POLICY["model"], "Fuel type", POLICY["fuel_type"]],
|
| 241 |
+
["Transmission", POLICY["transmission"], "Estimated value", POLICY["estimated_value"]],
|
| 242 |
+
["Annual mileage", POLICY["annual_mileage"], "Overnight postcode", POLICY["overnight_postcode"]],
|
| 243 |
+
["Kept location", POLICY["kept_location"], "Security device fitted", POLICY["security_device"]],
|
| 244 |
+
["Tracker fitted", POLICY["tracker_fitted"], "Modifications", POLICY["modifications"]],
|
| 245 |
+
],
|
| 246 |
+
[38 * mm, 48 * mm, 38 * mm, 48 * mm],
|
| 247 |
+
),
|
| 248 |
+
Paragraph("Cover and no claims discount", s["h2"]),
|
| 249 |
+
_table(
|
| 250 |
+
[
|
| 251 |
+
["Class of use", POLICY["class_of_use"]],
|
| 252 |
+
["Driving other cars", POLICY["driving_other_cars"]],
|
| 253 |
+
["No claims discount", POLICY["ncb_years"]],
|
| 254 |
+
["Protected no claims discount", POLICY["ncb_protected"]],
|
| 255 |
+
],
|
| 256 |
+
[55 * mm, 115 * mm],
|
| 257 |
+
),
|
| 258 |
+
Paragraph("Excess breakdown", s["h2"]),
|
| 259 |
+
_table(
|
| 260 |
+
[
|
| 261 |
+
["Excess type", "Amount"],
|
| 262 |
+
["Standard compulsory excess", POLICY["standard_compulsory"]],
|
| 263 |
+
["Voluntary excess", POLICY["voluntary"]],
|
| 264 |
+
["Total accidental damage excess", POLICY["total_accidental_damage"]],
|
| 265 |
+
["Fire excess", POLICY["fire"]],
|
| 266 |
+
["Theft excess", POLICY["theft"]],
|
| 267 |
+
["Windscreen repair excess", POLICY["windscreen_repair"]],
|
| 268 |
+
["Windscreen replacement excess", POLICY["windscreen_replacement"]],
|
| 269 |
+
["Own repairer additional excess", POLICY["own_repairer"]],
|
| 270 |
+
],
|
| 271 |
+
[90 * mm, 50 * mm],
|
| 272 |
+
),
|
| 273 |
+
Paragraph("Driver details", s["h2"]),
|
| 274 |
+
_table(
|
| 275 |
+
[
|
| 276 |
+
["Driver name", "Date of birth", "Relationship", "Occupation", "Licence type", "Main driver", "Specific excess"],
|
| 277 |
+
[POLICY["policyholder"], POLICY["dob"], "Policyholder", POLICY["occupation"], "Full Licence UK / 2/1 / No", "Yes", ""],
|
| 278 |
+
[POLICY["second_driver"], POLICY["second_driver_dob"], "Named Driver", POLICY["second_driver_occupation"], "UK Provisional / 1/4 / No", "No", "GBP 200.00"],
|
| 279 |
+
[POLICY["third_driver"], POLICY["third_driver_dob"], "Named Driver", POLICY["third_driver_occupation"], "Full Licence UK / 5/0 / No", "No", ""],
|
| 280 |
+
],
|
| 281 |
+
[30 * mm, 24 * mm, 24 * mm, 31 * mm, 31 * mm, 18 * mm, 22 * mm],
|
| 282 |
+
),
|
| 283 |
+
Paragraph("Financial summary", s["h2"]),
|
| 284 |
+
_table(
|
| 285 |
+
[
|
| 286 |
+
["Item", "Premium"],
|
| 287 |
+
["Total annual premium", POLICY["total_premium"]],
|
| 288 |
+
["Motor legal protection", POLICY["legal"]],
|
| 289 |
+
["Breakdown roadside assistance", POLICY["breakdown"]],
|
| 290 |
+
["Enhanced personal accident", POLICY["personal_accident"]],
|
| 291 |
+
["Hire car", POLICY["hire_car"]],
|
| 292 |
+
["Key cover", POLICY["key_cover"]],
|
| 293 |
+
],
|
| 294 |
+
[90 * mm, 50 * mm],
|
| 295 |
+
),
|
| 296 |
+
]
|
| 297 |
+
_doc(path, "Schedule of Insurance - Demo").build(
|
| 298 |
+
story,
|
| 299 |
+
onFirstPage=lambda c, d: _draw_header(c, d, "Schedule of Insurance"),
|
| 300 |
+
onLaterPages=lambda c, d: _draw_header(c, d, "Schedule of Insurance"),
|
| 301 |
+
)
|
| 302 |
+
|
| 303 |
+
|
| 304 |
+
def build_certificate() -> None:
|
| 305 |
+
s = _styles()
|
| 306 |
+
path = OUT_DIR / "Certificate of Motor Insurance - Demo.pdf"
|
| 307 |
+
story = [
|
| 308 |
+
Paragraph("Certificate of Motor Insurance", s["title"]),
|
| 309 |
+
Paragraph(
|
| 310 |
+
"This is to certify that a policy of insurance has been issued for the purposes of the Road Traffic Act.",
|
| 311 |
+
s["subtitle"],
|
| 312 |
+
),
|
| 313 |
+
_table(
|
| 314 |
+
[
|
| 315 |
+
["Policy number", POLICY["policy_number"]],
|
| 316 |
+
["Insurer", POLICY["insurer"]],
|
| 317 |
+
["Effective from", POLICY["start_date"]],
|
| 318 |
+
["Expires", POLICY["expiry_date"]],
|
| 319 |
+
["Registration number", POLICY["vrm"]],
|
| 320 |
+
],
|
| 321 |
+
[55 * mm, 115 * mm],
|
| 322 |
+
),
|
| 323 |
+
Paragraph("Persons entitled to drive", s["h2"]),
|
| 324 |
+
_table(
|
| 325 |
+
[
|
| 326 |
+
["Name", "Entitlement"],
|
| 327 |
+
[POLICY["policyholder"], "The policyholder may drive the insured vehicle."],
|
| 328 |
+
[POLICY["second_driver"], "Named driver may drive the insured vehicle."],
|
| 329 |
+
[POLICY["third_driver"], "Named driver may drive the insured vehicle."],
|
| 330 |
+
],
|
| 331 |
+
[55 * mm, 115 * mm],
|
| 332 |
+
),
|
| 333 |
+
Paragraph("Limitations as to use", s["h2"]),
|
| 334 |
+
Paragraph(POLICY["class_of_use"], s["body"]),
|
| 335 |
+
Paragraph("The policy does not provide cover for driving other cars.", s["body"]),
|
| 336 |
+
Spacer(1, 8),
|
| 337 |
+
Paragraph(
|
| 338 |
+
"This certificate is fictional and is provided only as a safe demonstration fixture for the PolicyTrace project.",
|
| 339 |
+
s["small"],
|
| 340 |
+
),
|
| 341 |
+
]
|
| 342 |
+
_doc(path, "Certificate of Motor Insurance - Demo").build(
|
| 343 |
+
story,
|
| 344 |
+
onFirstPage=lambda c, d: _draw_header(c, d, "Certificate of Motor Insurance"),
|
| 345 |
+
onLaterPages=lambda c, d: _draw_header(c, d, "Certificate of Motor Insurance"),
|
| 346 |
+
)
|
| 347 |
+
|
| 348 |
+
|
| 349 |
+
def build_statement_of_fact() -> None:
|
| 350 |
+
s = _styles()
|
| 351 |
+
path = OUT_DIR / "Statement of Fact - Demo.pdf"
|
| 352 |
+
story = [
|
| 353 |
+
Paragraph("Statement of Fact", s["title"]),
|
| 354 |
+
Paragraph(
|
| 355 |
+
"These fictional facts were used to calculate the demo insurance premium.",
|
| 356 |
+
s["subtitle"],
|
| 357 |
+
),
|
| 358 |
+
_table(
|
| 359 |
+
[
|
| 360 |
+
["Policy number", POLICY["policy_number"]],
|
| 361 |
+
["Main driver", POLICY["policyholder"]],
|
| 362 |
+
["Annual mileage", POLICY["annual_mileage"]],
|
| 363 |
+
["Vehicle kept overnight", POLICY["kept_location"]],
|
| 364 |
+
["Overnight postcode", POLICY["overnight_postcode"]],
|
| 365 |
+
["Security device fitted", POLICY["security_device"]],
|
| 366 |
+
["Tracker fitted", POLICY["tracker_fitted"]],
|
| 367 |
+
["Modifications", POLICY["modifications"]],
|
| 368 |
+
["Non-motoring convictions", "No"],
|
| 369 |
+
["Endorsements", "None"],
|
| 370 |
+
["Claims in last five years", "None"],
|
| 371 |
+
],
|
| 372 |
+
[58 * mm, 112 * mm],
|
| 373 |
+
),
|
| 374 |
+
]
|
| 375 |
+
_doc(path, "Statement of Fact - Demo").build(
|
| 376 |
+
story,
|
| 377 |
+
onFirstPage=lambda c, d: _draw_header(c, d, "Statement of Fact"),
|
| 378 |
+
onLaterPages=lambda c, d: _draw_header(c, d, "Statement of Fact"),
|
| 379 |
+
)
|
| 380 |
+
|
| 381 |
+
|
| 382 |
+
def build_policy_booklet() -> None:
|
| 383 |
+
s = _styles()
|
| 384 |
+
path = OUT_DIR / "Policy Booklet - Demo.pdf"
|
| 385 |
+
story = [
|
| 386 |
+
Paragraph("Motor Insurance Policy Booklet", s["title"]),
|
| 387 |
+
Paragraph(
|
| 388 |
+
"This booklet describes generic terms for a fictional motor insurance product. "
|
| 389 |
+
"It intentionally contains little policyholder-specific data.",
|
| 390 |
+
s["subtitle"],
|
| 391 |
+
),
|
| 392 |
+
Paragraph("What is covered", s["h2"]),
|
| 393 |
+
Paragraph(
|
| 394 |
+
"Comprehensive cover may include damage to your vehicle, fire, theft, windscreen cover, "
|
| 395 |
+
"and third-party liability, subject to the terms and exclusions in this booklet.",
|
| 396 |
+
s["body"],
|
| 397 |
+
),
|
| 398 |
+
Paragraph("Claims", s["h2"]),
|
| 399 |
+
Paragraph(
|
| 400 |
+
"You must tell Northbridge Mutual Motor Insurance Ltd about any accident or loss as soon as possible. "
|
| 401 |
+
"We may ask for evidence, photographs, repair estimates, or further information.",
|
| 402 |
+
s["body"],
|
| 403 |
+
),
|
| 404 |
+
Paragraph("General exclusions", s["h2"]),
|
| 405 |
+
Paragraph(
|
| 406 |
+
"No cover is provided where the vehicle is used outside the permitted class of use, "
|
| 407 |
+
"where the driver is not entitled to drive, or where policy information is materially incorrect.",
|
| 408 |
+
s["body"],
|
| 409 |
+
),
|
| 410 |
+
Paragraph("Complaints", s["h2"]),
|
| 411 |
+
Paragraph(
|
| 412 |
+
"If you are unhappy with our service, contact the fictional complaints team at Northbridge Mutual.",
|
| 413 |
+
s["body"],
|
| 414 |
+
),
|
| 415 |
+
]
|
| 416 |
+
_doc(path, "Policy Booklet - Demo").build(
|
| 417 |
+
story,
|
| 418 |
+
onFirstPage=lambda c, d: _draw_header(c, d, "Policy Booklet"),
|
| 419 |
+
onLaterPages=lambda c, d: _draw_header(c, d, "Policy Booklet"),
|
| 420 |
+
)
|
| 421 |
+
|
| 422 |
+
|
| 423 |
+
def write_manifest() -> None:
|
| 424 |
+
manifest = {
|
| 425 |
+
"purpose": "Synthetic demo data for AI Tool Stack PolicyTrace.",
|
| 426 |
+
"warning": "No real customer, insurer, vehicle, or policy data is included.",
|
| 427 |
+
"files": [
|
| 428 |
+
"Schedule of Insurance - Demo.pdf",
|
| 429 |
+
"Certificate of Motor Insurance - Demo.pdf",
|
| 430 |
+
"Statement of Fact - Demo.pdf",
|
| 431 |
+
"Policy Booklet - Demo.pdf",
|
| 432 |
+
],
|
| 433 |
+
"expected_policy_number": POLICY["policy_number"],
|
| 434 |
+
"expected_vrm": POLICY["vrm"],
|
| 435 |
+
"expected_insurer": POLICY["insurer"],
|
| 436 |
+
}
|
| 437 |
+
(OUT_DIR / "manifest.json").write_text(json.dumps(manifest, indent=2), encoding="utf-8")
|
| 438 |
+
|
| 439 |
+
|
| 440 |
+
def main() -> None:
|
| 441 |
+
OUT_DIR.mkdir(parents=True, exist_ok=True)
|
| 442 |
+
build_schedule()
|
| 443 |
+
build_certificate()
|
| 444 |
+
build_statement_of_fact()
|
| 445 |
+
build_policy_booklet()
|
| 446 |
+
write_manifest()
|
| 447 |
+
print(f"Synthetic demo pack written to {OUT_DIR.resolve()}")
|
| 448 |
+
|
| 449 |
+
|
| 450 |
+
if __name__ == "__main__":
|
| 451 |
+
main()
|
src/agents.py
ADDED
|
@@ -0,0 +1,530 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
agents.py — Specialist document extraction agents for UK Motor Insurance.
|
| 3 |
+
|
| 4 |
+
Architecture
|
| 5 |
+
────────────
|
| 6 |
+
PDF path
|
| 7 |
+
→ docling (PDF → Markdown)
|
| 8 |
+
→ PIIMasker.mask()
|
| 9 |
+
→ InsuranceExtractionAgents.classify_document() [LLM: llama-3.1-8b-instant]
|
| 10 |
+
→ extract_schedule() | extract_certificate() [LLM: llama-4-scout-17b]
|
| 11 |
+
→ UKMotorGoldenRecord (with source_document provenance)
|
| 12 |
+
"""
|
| 13 |
+
from __future__ import annotations
|
| 14 |
+
|
| 15 |
+
import json
|
| 16 |
+
import logging
|
| 17 |
+
import os
|
| 18 |
+
import time
|
| 19 |
+
from pathlib import Path
|
| 20 |
+
from typing import Any
|
| 21 |
+
|
| 22 |
+
import instructor
|
| 23 |
+
from docling.datamodel.base_models import InputFormat
|
| 24 |
+
from docling.datamodel.pipeline_options import PdfPipelineOptions
|
| 25 |
+
from docling.document_converter import DocumentConverter, PdfFormatOption
|
| 26 |
+
from groq import Groq
|
| 27 |
+
from pydantic import ValidationError
|
| 28 |
+
|
| 29 |
+
from privacy import PIIMasker
|
| 30 |
+
from prompts import PromptRegistry
|
| 31 |
+
from schema import DocumentType, SourceMetadata, UKMotorGoldenRecord
|
| 32 |
+
from settings import settings
|
| 33 |
+
|
| 34 |
+
logger = logging.getLogger(__name__)
|
| 35 |
+
|
| 36 |
+
# ---------------------------------------------------------------------------
|
| 37 |
+
# Groq clients — extraction (instructor-wrapped) + classifier (raw Groq)
|
| 38 |
+
# ---------------------------------------------------------------------------
|
| 39 |
+
|
| 40 |
+
|
| 41 |
+
def _build_extraction_client() -> instructor.Instructor:
|
| 42 |
+
api_key = os.environ.get("GROQ_API_KEY")
|
| 43 |
+
if not api_key:
|
| 44 |
+
raise EnvironmentError(
|
| 45 |
+
"GROQ_API_KEY environment variable is not set. "
|
| 46 |
+
"Export it before running the pipeline."
|
| 47 |
+
)
|
| 48 |
+
return instructor.from_groq(Groq(api_key=api_key), mode=instructor.Mode.JSON)
|
| 49 |
+
|
| 50 |
+
|
| 51 |
+
def _build_groq_client() -> Groq:
|
| 52 |
+
api_key = os.environ.get("GROQ_API_KEY")
|
| 53 |
+
if not api_key:
|
| 54 |
+
raise EnvironmentError(
|
| 55 |
+
"GROQ_API_KEY environment variable is not set. "
|
| 56 |
+
"Export it before running the pipeline."
|
| 57 |
+
)
|
| 58 |
+
return Groq(api_key=api_key)
|
| 59 |
+
|
| 60 |
+
|
| 61 |
+
# Models resolved at import time from settings.yaml / env vars
|
| 62 |
+
_EXTRACTION_MODEL: str = settings.llm.model
|
| 63 |
+
_CLASSIFIER_MODEL: str = settings.llm.classifier_model
|
| 64 |
+
|
| 65 |
+
|
| 66 |
+
def _build_docling_converter() -> DocumentConverter:
|
| 67 |
+
"""Build a DocumentConverter configured from settings.docling."""
|
| 68 |
+
opts = PdfPipelineOptions()
|
| 69 |
+
opts.do_ocr = settings.docling.do_ocr
|
| 70 |
+
opts.do_table_structure = settings.docling.do_table_structure
|
| 71 |
+
return DocumentConverter(
|
| 72 |
+
format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=opts)}
|
| 73 |
+
)
|
| 74 |
+
|
| 75 |
+
# ---------------------------------------------------------------------------
|
| 76 |
+
# Document type classifier (keyword heuristic — fast, zero API calls)
|
| 77 |
+
# ---------------------------------------------------------------------------
|
| 78 |
+
|
| 79 |
+
_CLASSIFICATION_KEYWORDS: dict[DocumentType, list[str]] = {
|
| 80 |
+
DocumentType.SCHEDULE: [
|
| 81 |
+
# Phrases that only appear in a Schedule, not in a Certificate
|
| 82 |
+
"policy schedule",
|
| 83 |
+
"schedule of insurance",
|
| 84 |
+
"schedule number",
|
| 85 |
+
"premium payable",
|
| 86 |
+
"compulsory excess",
|
| 87 |
+
"voluntary excess",
|
| 88 |
+
"no claims bonus",
|
| 89 |
+
"ncb",
|
| 90 |
+
"windscreen excess",
|
| 91 |
+
],
|
| 92 |
+
DocumentType.CERTIFICATE: [
|
| 93 |
+
# Phrases that are definitive for a Certificate document
|
| 94 |
+
"certificate of motor insurance",
|
| 95 |
+
"motor insurance certificate",
|
| 96 |
+
"certificate number",
|
| 97 |
+
"persons entitled to drive",
|
| 98 |
+
"class of use",
|
| 99 |
+
"road traffic act",
|
| 100 |
+
"this is to certify",
|
| 101 |
+
],
|
| 102 |
+
DocumentType.STATEMENT_OF_FACT: [
|
| 103 |
+
"statement of fact",
|
| 104 |
+
"statement of insurance",
|
| 105 |
+
"proposal form",
|
| 106 |
+
"claims history",
|
| 107 |
+
"motoring convictions",
|
| 108 |
+
"annual mileage",
|
| 109 |
+
],
|
| 110 |
+
DocumentType.POLICY_BOOKLET: [
|
| 111 |
+
"policy booklet",
|
| 112 |
+
"policy wording",
|
| 113 |
+
"terms and conditions",
|
| 114 |
+
"what is covered",
|
| 115 |
+
"general conditions",
|
| 116 |
+
"complaints procedure",
|
| 117 |
+
],
|
| 118 |
+
}
|
| 119 |
+
|
| 120 |
+
|
| 121 |
+
def _keyword_classify(text: str) -> str:
|
| 122 |
+
"""Keyword heuristic fallback classifier. Returns DocumentType.value string."""
|
| 123 |
+
lower = text.lower()
|
| 124 |
+
scores: dict[DocumentType, int] = {dt: 0 for dt in _CLASSIFICATION_KEYWORDS}
|
| 125 |
+
|
| 126 |
+
for doc_type, keywords in _CLASSIFICATION_KEYWORDS.items():
|
| 127 |
+
for kw in keywords:
|
| 128 |
+
if kw in lower:
|
| 129 |
+
scores[doc_type] += 1
|
| 130 |
+
|
| 131 |
+
best_type, best_score = max(scores.items(), key=lambda kv: kv[1])
|
| 132 |
+
return best_type.value if best_score > 0 else DocumentType.UNKNOWN.value
|
| 133 |
+
|
| 134 |
+
|
| 135 |
+
def _str_to_doc_type(s: str) -> DocumentType:
|
| 136 |
+
"""Convert a string to DocumentType, falling back to UNKNOWN."""
|
| 137 |
+
try:
|
| 138 |
+
return DocumentType(s)
|
| 139 |
+
except ValueError:
|
| 140 |
+
return DocumentType.UNKNOWN
|
| 141 |
+
|
| 142 |
+
|
| 143 |
+
# ---------------------------------------------------------------------------
|
| 144 |
+
# Extraction failure sentinel
|
| 145 |
+
# ---------------------------------------------------------------------------
|
| 146 |
+
|
| 147 |
+
|
| 148 |
+
class ExtractionFailedError(RuntimeError):
|
| 149 |
+
"""
|
| 150 |
+
Raised when the LLM fails to produce a valid UKMotorGoldenRecord after
|
| 151 |
+
exhausting all retries. Callers should treat the document as failed and
|
| 152 |
+
skip it rather than propagating an empty record silently.
|
| 153 |
+
"""
|
| 154 |
+
|
| 155 |
+
|
| 156 |
+
# ---------------------------------------------------------------------------
|
| 157 |
+
# InsuranceExtractionAgents
|
| 158 |
+
# ---------------------------------------------------------------------------
|
| 159 |
+
|
| 160 |
+
|
| 161 |
+
class InsuranceExtractionAgents:
|
| 162 |
+
"""
|
| 163 |
+
Specialist extraction agents for UK Motor Insurance documents.
|
| 164 |
+
|
| 165 |
+
Uses two LLM models:
|
| 166 |
+
- llama-3.1-8b-instant — fast document type classification
|
| 167 |
+
- llama-4-scout-17b-16e — deep structured extraction (Schedule / Certificate)
|
| 168 |
+
|
| 169 |
+
Parameters
|
| 170 |
+
----------
|
| 171 |
+
masker : PIIMasker | None
|
| 172 |
+
max_retries : int
|
| 173 |
+
prompt_registry : PromptRegistry | None
|
| 174 |
+
debug_dir : Path | None
|
| 175 |
+
"""
|
| 176 |
+
|
| 177 |
+
def __init__(
|
| 178 |
+
self,
|
| 179 |
+
masker: PIIMasker | None = None,
|
| 180 |
+
max_retries: int = settings.llm.max_retries,
|
| 181 |
+
prompt_registry: PromptRegistry | None = None,
|
| 182 |
+
debug_dir: Path | None = None,
|
| 183 |
+
) -> None:
|
| 184 |
+
self._client = _build_extraction_client()
|
| 185 |
+
self._groq = _build_groq_client()
|
| 186 |
+
self._masker = masker or PIIMasker()
|
| 187 |
+
self._max_retries = max_retries
|
| 188 |
+
self._prompts = prompt_registry or PromptRegistry()
|
| 189 |
+
self._converter = _build_docling_converter()
|
| 190 |
+
self._debug_dir = debug_dir
|
| 191 |
+
|
| 192 |
+
# ------------------------------------------------------------------
|
| 193 |
+
# Public API
|
| 194 |
+
# ------------------------------------------------------------------
|
| 195 |
+
|
| 196 |
+
def classify_document(self, markdown_text: str) -> str:
|
| 197 |
+
"""
|
| 198 |
+
Use llama-3.1-8b-instant to classify the document type.
|
| 199 |
+
|
| 200 |
+
The LLM is the primary classifier. If it fails or returns an invalid
|
| 201 |
+
label, the keyword heuristic is used as a fallback. A discrepancy
|
| 202 |
+
between the two is logged as a warning to flag low-confidence cases.
|
| 203 |
+
|
| 204 |
+
Returns one of: "Schedule", "Certificate", "StatementOfFact",
|
| 205 |
+
"PolicyBooklet", "Unknown".
|
| 206 |
+
"""
|
| 207 |
+
keyword_result = _keyword_classify(markdown_text)
|
| 208 |
+
|
| 209 |
+
system_prompt = (
|
| 210 |
+
"You are a UK motor insurance document classifier.\n"
|
| 211 |
+
"Given the document text, respond with EXACTLY one word from:\n"
|
| 212 |
+
"Schedule | Certificate | StatementOfFact | PolicyBooklet | Unknown\n\n"
|
| 213 |
+
"- Schedule: Policy Schedule \u2014 excess figures, premium, NCB, "
|
| 214 |
+
"vehicle details, driver ages/DOBs.\n"
|
| 215 |
+
"- Certificate: Certificate of Motor Insurance \u2014 Road Traffic Act, "
|
| 216 |
+
"'persons entitled to drive', 'class of use'.\n"
|
| 217 |
+
"- StatementOfFact: Statement of Fact / Proposal \u2014 claims history, "
|
| 218 |
+
"convictions, annual mileage.\n"
|
| 219 |
+
"- PolicyBooklet: Policy Booklet / Wording \u2014 terms and conditions, "
|
| 220 |
+
"'what is covered', complaints.\n"
|
| 221 |
+
"- Unknown: Cannot determine.\n\n"
|
| 222 |
+
"Respond with ONLY the single classification word. No punctuation."
|
| 223 |
+
)
|
| 224 |
+
try:
|
| 225 |
+
response = self._groq.chat.completions.create(
|
| 226 |
+
model=_CLASSIFIER_MODEL,
|
| 227 |
+
messages=[
|
| 228 |
+
{"role": "system", "content": system_prompt},
|
| 229 |
+
{
|
| 230 |
+
"role": "user",
|
| 231 |
+
"content": "Classify this document:\n\n" + markdown_text[:4000],
|
| 232 |
+
},
|
| 233 |
+
],
|
| 234 |
+
max_tokens=10,
|
| 235 |
+
temperature=0,
|
| 236 |
+
)
|
| 237 |
+
llm_result = response.choices[0].message.content.strip().split()[0]
|
| 238 |
+
valid = {"Schedule", "Certificate", "StatementOfFact", "PolicyBooklet", "Unknown"}
|
| 239 |
+
if llm_result in valid:
|
| 240 |
+
if llm_result != keyword_result:
|
| 241 |
+
logger.warning(
|
| 242 |
+
"Classifier discrepancy: LLM=%s, keyword=%s "
|
| 243 |
+
"(using LLM result — verify document type)",
|
| 244 |
+
llm_result, keyword_result,
|
| 245 |
+
)
|
| 246 |
+
else:
|
| 247 |
+
logger.debug("Classifier agreement: LLM=%s \u2713", llm_result)
|
| 248 |
+
return llm_result
|
| 249 |
+
logger.warning(
|
| 250 |
+
"LLM classifier returned '%s' \u2014 falling back to keyword heuristic", llm_result
|
| 251 |
+
)
|
| 252 |
+
except Exception as exc: # noqa: BLE001
|
| 253 |
+
logger.warning(
|
| 254 |
+
"LLM classifier failed (%s) \u2014 falling back to keyword heuristic", exc
|
| 255 |
+
)
|
| 256 |
+
return keyword_result
|
| 257 |
+
|
| 258 |
+
def extract_schedule(self, markdown_text: str, filename: str) -> UKMotorGoldenRecord:
|
| 259 |
+
"""
|
| 260 |
+
Extract all financial, vehicle, and driver risk data from a Policy Schedule.
|
| 261 |
+
|
| 262 |
+
Instructs the LLM to:
|
| 263 |
+
- Compute total_accidental_damage = standard_compulsory + voluntary
|
| 264 |
+
- Extract driver DOBs and distinguish Full UK vs UK Provisional licence types
|
| 265 |
+
- Separate fire excess from theft excess (they can differ)
|
| 266 |
+
- Extract own_repairer_additional_excess if present
|
| 267 |
+
- Extract premium breakdown and optional extras (float if purchased,
|
| 268 |
+
"Not Selected" if not)
|
| 269 |
+
"""
|
| 270 |
+
return self._extract(
|
| 271 |
+
markdown_text,
|
| 272 |
+
filename,
|
| 273 |
+
DocumentType.SCHEDULE,
|
| 274 |
+
self._prompts.get(DocumentType.SCHEDULE),
|
| 275 |
+
)
|
| 276 |
+
|
| 277 |
+
def extract_certificate(self, markdown_text: str, filename: str) -> UKMotorGoldenRecord:
|
| 278 |
+
"""
|
| 279 |
+
Extract legal permissions from a Certificate of Motor Insurance.
|
| 280 |
+
|
| 281 |
+
Instructs the LLM to:
|
| 282 |
+
- Extract the exact "Limitations as to use" / class_of_use clause verbatim
|
| 283 |
+
- Extract the policy_number for cross-reference
|
| 284 |
+
- Record driving_other_cars entitlement (true/false)
|
| 285 |
+
- Leave all financial fields (excess, premium, NCB) as null
|
| 286 |
+
"""
|
| 287 |
+
return self._extract(
|
| 288 |
+
markdown_text,
|
| 289 |
+
filename,
|
| 290 |
+
DocumentType.CERTIFICATE,
|
| 291 |
+
self._prompts.get(DocumentType.CERTIFICATE),
|
| 292 |
+
)
|
| 293 |
+
|
| 294 |
+
def process(self, pdf_path: str | Path) -> tuple[UKMotorGoldenRecord, str]:
|
| 295 |
+
"""
|
| 296 |
+
Full pipeline for one PDF: PDF → Markdown → PII mask → classify → extract.
|
| 297 |
+
|
| 298 |
+
Returns
|
| 299 |
+
-------
|
| 300 |
+
tuple[UKMotorGoldenRecord, str]
|
| 301 |
+
The extracted record and the document type string (e.g. "Schedule").
|
| 302 |
+
|
| 303 |
+
Raises
|
| 304 |
+
------
|
| 305 |
+
ExtractionFailedError
|
| 306 |
+
When the LLM fails to extract a valid record after all retries.
|
| 307 |
+
"""
|
| 308 |
+
record, doc_type_str, _ = self._process_internal(Path(pdf_path), build_corpus=False)
|
| 309 |
+
return record, doc_type_str
|
| 310 |
+
|
| 311 |
+
# ------------------------------------------------------------------
|
| 312 |
+
# Private helpers
|
| 313 |
+
# ------------------------------------------------------------------
|
| 314 |
+
|
| 315 |
+
def _process_internal(
|
| 316 |
+
self,
|
| 317 |
+
pdf_path: Path,
|
| 318 |
+
build_corpus: bool,
|
| 319 |
+
) -> tuple[UKMotorGoldenRecord, str, Any]:
|
| 320 |
+
"""
|
| 321 |
+
Unified core pipeline: PDF → Markdown → PII mask → classify → extract,
|
| 322 |
+
optionally building a ProvenanceCorpus from the raw Docling IR.
|
| 323 |
+
|
| 324 |
+
Parameters
|
| 325 |
+
----------
|
| 326 |
+
pdf_path : Path
|
| 327 |
+
build_corpus : bool
|
| 328 |
+
When True, builds a ProvenanceCorpus before PII masking so the
|
| 329 |
+
original text is available for fuzzy matching.
|
| 330 |
+
|
| 331 |
+
Returns
|
| 332 |
+
-------
|
| 333 |
+
tuple[UKMotorGoldenRecord, str, ProvenanceCorpus | None]
|
| 334 |
+
(record, doc_type_str, corpus_or_None)
|
| 335 |
+
|
| 336 |
+
Raises
|
| 337 |
+
------
|
| 338 |
+
ExtractionFailedError
|
| 339 |
+
Propagated from _extract() when the LLM fails after all retries.
|
| 340 |
+
"""
|
| 341 |
+
from provenance import ProvenanceCorpus # local import — avoids circular dep
|
| 342 |
+
|
| 343 |
+
logger.info("Processing%s: %s", " (with provenance)" if build_corpus else "", pdf_path.name)
|
| 344 |
+
|
| 345 |
+
# Pre-classify from filename for page-cap selection (no API call)
|
| 346 |
+
pre_type_str = _keyword_classify(pdf_path.stem)
|
| 347 |
+
pre_doc_type = _str_to_doc_type(pre_type_str)
|
| 348 |
+
logger.debug(" Pre-classified from filename: %s", pre_type_str)
|
| 349 |
+
|
| 350 |
+
# PDF → Markdown + raw DoclingDocument
|
| 351 |
+
markdown, raw_doc = self._pdf_to_markdown_and_doc(pdf_path, pre_doc_type)
|
| 352 |
+
|
| 353 |
+
# Build corpus from original text BEFORE masking (critical for accurate fuzzy match)
|
| 354 |
+
corpus: Any = None
|
| 355 |
+
if build_corpus:
|
| 356 |
+
corpus = ProvenanceCorpus(source_filename=pdf_path.name, doc_type=pre_type_str)
|
| 357 |
+
corpus.add_from_docling(raw_doc, pdf_path.name)
|
| 358 |
+
logger.debug(" Provenance corpus: %d items", len(corpus.items))
|
| 359 |
+
|
| 360 |
+
if self._debug_dir and settings.debug.save_markdown:
|
| 361 |
+
_write_debug(self._debug_dir, f"{pdf_path.name}.md", markdown)
|
| 362 |
+
|
| 363 |
+
# PII mask
|
| 364 |
+
masked_markdown, _token_map = self._masker.mask(markdown)
|
| 365 |
+
if self._debug_dir and settings.debug.save_masked_markdown:
|
| 366 |
+
_write_debug(self._debug_dir, f"{pdf_path.name}.masked.md", masked_markdown)
|
| 367 |
+
|
| 368 |
+
# Classify
|
| 369 |
+
t0 = time.monotonic()
|
| 370 |
+
doc_type_str = self.classify_document(masked_markdown)
|
| 371 |
+
logger.info(" Classified as: %s", doc_type_str)
|
| 372 |
+
|
| 373 |
+
# Route to specialist extractor
|
| 374 |
+
if doc_type_str == "Schedule":
|
| 375 |
+
record = self.extract_schedule(masked_markdown, pdf_path.name)
|
| 376 |
+
elif doc_type_str == "Certificate":
|
| 377 |
+
record = self.extract_certificate(masked_markdown, pdf_path.name)
|
| 378 |
+
else:
|
| 379 |
+
logger.info(" Non-primary type '%s' — running generic extraction", doc_type_str)
|
| 380 |
+
record = self._extract(
|
| 381 |
+
masked_markdown,
|
| 382 |
+
pdf_path.name,
|
| 383 |
+
_str_to_doc_type(doc_type_str),
|
| 384 |
+
self._prompts.get(_str_to_doc_type(doc_type_str)),
|
| 385 |
+
)
|
| 386 |
+
|
| 387 |
+
elapsed = round(time.monotonic() - t0, 3)
|
| 388 |
+
|
| 389 |
+
record.source_document = SourceMetadata(
|
| 390 |
+
document_type=_str_to_doc_type(doc_type_str),
|
| 391 |
+
filename=pdf_path.name,
|
| 392 |
+
)
|
| 393 |
+
|
| 394 |
+
if self._debug_dir and settings.debug.save_extraction_json:
|
| 395 |
+
_write_debug(
|
| 396 |
+
self._debug_dir,
|
| 397 |
+
f"{pdf_path.name}.extraction.json",
|
| 398 |
+
record.model_dump_json(indent=2),
|
| 399 |
+
)
|
| 400 |
+
fc = getattr(record, "field_citations", None) or {}
|
| 401 |
+
logger.info(" field_citations populated by LLM: %d entries", len(fc))
|
| 402 |
+
if fc:
|
| 403 |
+
import json as _json
|
| 404 |
+
_write_debug(
|
| 405 |
+
self._debug_dir,
|
| 406 |
+
f"{pdf_path.name}.field_citations.json",
|
| 407 |
+
_json.dumps(fc, indent=2, ensure_ascii=False),
|
| 408 |
+
)
|
| 409 |
+
|
| 410 |
+
if self._debug_dir and settings.debug.save_metrics:
|
| 411 |
+
metrics: dict = {
|
| 412 |
+
"filename": pdf_path.name,
|
| 413 |
+
"doc_type": doc_type_str,
|
| 414 |
+
"extraction_model": _EXTRACTION_MODEL,
|
| 415 |
+
"classifier_model": _CLASSIFIER_MODEL,
|
| 416 |
+
"response_time_seconds": elapsed,
|
| 417 |
+
}
|
| 418 |
+
if corpus is not None:
|
| 419 |
+
metrics["corpus_items"] = len(corpus.items)
|
| 420 |
+
_append_metrics(self._debug_dir, metrics)
|
| 421 |
+
|
| 422 |
+
return record, doc_type_str, corpus
|
| 423 |
+
|
| 424 |
+
def _pdf_to_markdown(
|
| 425 |
+
self, pdf_path: Path, doc_type: DocumentType = DocumentType.UNKNOWN
|
| 426 |
+
) -> str:
|
| 427 |
+
"""Convert a PDF to Markdown using docling, respecting per-doc-type page caps."""
|
| 428 |
+
markdown, _ = self._pdf_to_markdown_and_doc(pdf_path, doc_type)
|
| 429 |
+
return markdown
|
| 430 |
+
|
| 431 |
+
def _pdf_to_markdown_and_doc(
|
| 432 |
+
self, pdf_path: Path, doc_type: DocumentType = DocumentType.UNKNOWN
|
| 433 |
+
) -> tuple[str, Any]:
|
| 434 |
+
"""Convert PDF to Markdown and also return the raw DoclingDocument for provenance."""
|
| 435 |
+
# Apply page cap during conversion (not just in Markdown export) to prevent
|
| 436 |
+
# Docling's layout model from running out of memory on large PDFs (Policy Booklet).
|
| 437 |
+
max_pg = settings.docling.max_pages.get(doc_type.value)
|
| 438 |
+
convert_kwargs: dict[str, Any] = {}
|
| 439 |
+
if max_pg is not None:
|
| 440 |
+
convert_kwargs["max_num_pages"] = max_pg
|
| 441 |
+
|
| 442 |
+
result = self._converter.convert(str(pdf_path), **convert_kwargs)
|
| 443 |
+
doc = result.document
|
| 444 |
+
markdown = doc.export_to_markdown()
|
| 445 |
+
|
| 446 |
+
if max_pg is not None:
|
| 447 |
+
separator = "\n---\n"
|
| 448 |
+
parts = markdown.split(separator)
|
| 449 |
+
if len(parts) > max_pg:
|
| 450 |
+
logger.info(
|
| 451 |
+
" Page cap applied: %s capped at %d/%d pages",
|
| 452 |
+
pdf_path.name, max_pg, len(parts),
|
| 453 |
+
)
|
| 454 |
+
markdown = separator.join(parts[:max_pg])
|
| 455 |
+
|
| 456 |
+
return markdown, doc
|
| 457 |
+
|
| 458 |
+
def process_with_provenance(
|
| 459 |
+
self, pdf_path: str | Path
|
| 460 |
+
) -> tuple[UKMotorGoldenRecord, str, Any]:
|
| 461 |
+
"""
|
| 462 |
+
Like process() but also returns a ProvenanceCorpus built from the Docling IR.
|
| 463 |
+
|
| 464 |
+
The corpus is constructed *before* PII masking so that the original text
|
| 465 |
+
strings (not masked tokens) are available for fuzzy matching.
|
| 466 |
+
|
| 467 |
+
Returns
|
| 468 |
+
-------
|
| 469 |
+
tuple[UKMotorGoldenRecord, str, ProvenanceCorpus]
|
| 470 |
+
(record, doc_type_str, corpus)
|
| 471 |
+
|
| 472 |
+
Raises
|
| 473 |
+
------
|
| 474 |
+
ExtractionFailedError
|
| 475 |
+
When the LLM fails to extract a valid record after all retries.
|
| 476 |
+
"""
|
| 477 |
+
return self._process_internal(Path(pdf_path), build_corpus=True) # type: ignore[return-value]
|
| 478 |
+
|
| 479 |
+
def _extract(
|
| 480 |
+
self,
|
| 481 |
+
text: str,
|
| 482 |
+
filename: str,
|
| 483 |
+
doc_type: DocumentType,
|
| 484 |
+
system_prompt: str,
|
| 485 |
+
) -> UKMotorGoldenRecord:
|
| 486 |
+
"""Call Groq via instructor to extract a UKMotorGoldenRecord."""
|
| 487 |
+
user_message = (
|
| 488 |
+
"Extract all motor insurance data from the following document text. "
|
| 489 |
+
"Return a JSON object that strictly conforms to the UKMotorGoldenRecord schema.\n\n"
|
| 490 |
+
f"--- DOCUMENT TEXT ---\n{text}\n--- END ---"
|
| 491 |
+
)
|
| 492 |
+
try:
|
| 493 |
+
record: UKMotorGoldenRecord = self._client.chat.completions.create(
|
| 494 |
+
model=_EXTRACTION_MODEL,
|
| 495 |
+
response_model=UKMotorGoldenRecord,
|
| 496 |
+
max_retries=self._max_retries,
|
| 497 |
+
messages=[
|
| 498 |
+
{"role": "system", "content": system_prompt.strip()},
|
| 499 |
+
{"role": "user", "content": user_message},
|
| 500 |
+
],
|
| 501 |
+
)
|
| 502 |
+
except (ValidationError, Exception) as exc:
|
| 503 |
+
raise ExtractionFailedError(
|
| 504 |
+
f"Extraction failed for {doc_type.value!r} document '{filename}' "
|
| 505 |
+
f"after {self._max_retries} retries: {exc}"
|
| 506 |
+
) from exc
|
| 507 |
+
return record
|
| 508 |
+
|
| 509 |
+
|
| 510 |
+
# ---------------------------------------------------------------------------
|
| 511 |
+
# Debug helpers (module-level so they can be unit-tested independently)
|
| 512 |
+
# ---------------------------------------------------------------------------
|
| 513 |
+
|
| 514 |
+
|
| 515 |
+
def _write_debug(debug_dir: Path, filename: str, content: str) -> None:
|
| 516 |
+
"""Write a debug artifact to disk, silently skipping on any I/O error."""
|
| 517 |
+
try:
|
| 518 |
+
(debug_dir / filename).write_text(content, encoding="utf-8")
|
| 519 |
+
logger.debug("Debug artifact saved: %s", filename)
|
| 520 |
+
except OSError as exc:
|
| 521 |
+
logger.warning("Could not write debug artifact %s: %s", filename, exc)
|
| 522 |
+
|
| 523 |
+
|
| 524 |
+
def _append_metrics(debug_dir: Path, metrics: dict) -> None:
|
| 525 |
+
"""Append a metrics dict as a JSONL line to extraction_metrics.jsonl."""
|
| 526 |
+
try:
|
| 527 |
+
with (debug_dir / "extraction_metrics.jsonl").open("a", encoding="utf-8") as fh:
|
| 528 |
+
fh.write(json.dumps(metrics) + "\n")
|
| 529 |
+
except OSError as exc:
|
| 530 |
+
logger.warning("Could not write metrics: %s", exc)
|
src/api.py
ADDED
|
@@ -0,0 +1,372 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
api.py — FastAPI server for the UK Motor Insurance Visual Audit Review UI.
|
| 3 |
+
|
| 4 |
+
Endpoints
|
| 5 |
+
─────────
|
| 6 |
+
GET /api/health
|
| 7 |
+
POST /api/process — upload PDFs, run pipeline, return session_id
|
| 8 |
+
GET /api/session/{id} — full GoldenRecordWithProvenance JSON
|
| 9 |
+
GET /api/pdf/{session_id}/{file} — serve source PDF (path-traversal safe)
|
| 10 |
+
PATCH /api/session/{id}/review — log a verify / override action
|
| 11 |
+
GET /api/session/{id}/review-state — current review state for the session
|
| 12 |
+
|
| 13 |
+
Run (from project root)
|
| 14 |
+
───────────────────────
|
| 15 |
+
uvicorn api:app --app-dir src --reload --port 8000
|
| 16 |
+
|
| 17 |
+
Or directly:
|
| 18 |
+
python src/api.py
|
| 19 |
+
"""
|
| 20 |
+
from __future__ import annotations
|
| 21 |
+
|
| 22 |
+
import json
|
| 23 |
+
import logging
|
| 24 |
+
import sys
|
| 25 |
+
import uuid
|
| 26 |
+
from datetime import datetime
|
| 27 |
+
from pathlib import Path
|
| 28 |
+
from typing import Optional
|
| 29 |
+
|
| 30 |
+
# ── Ensure src/ is on sys.path so sibling modules resolve regardless of CWD ─
|
| 31 |
+
sys.path.insert(0, str(Path(__file__).parent))
|
| 32 |
+
|
| 33 |
+
import uvicorn
|
| 34 |
+
from fastapi import FastAPI, File, HTTPException, UploadFile
|
| 35 |
+
from fastapi.middleware.cors import CORSMiddleware
|
| 36 |
+
from fastapi.responses import FileResponse, JSONResponse
|
| 37 |
+
from fastapi.staticfiles import StaticFiles
|
| 38 |
+
from pydantic import BaseModel
|
| 39 |
+
|
| 40 |
+
from agents import InsuranceExtractionAgents
|
| 41 |
+
from pipeline import run_extraction_pipeline
|
| 42 |
+
from privacy import PIIMasker
|
| 43 |
+
from provenance import build_provenance
|
| 44 |
+
from schema import GoldenRecordWithProvenance, UKMotorGoldenRecord
|
| 45 |
+
from settings import settings
|
| 46 |
+
|
| 47 |
+
logger = logging.getLogger(__name__)
|
| 48 |
+
logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(name)s — %(message)s")
|
| 49 |
+
|
| 50 |
+
# ---------------------------------------------------------------------------
|
| 51 |
+
# Session storage directory (project_root/output/sessions/<timestamp>_<uuid>/)
|
| 52 |
+
# Debug artifacts directory (project_root/output/debug/run_<timestamp>/)
|
| 53 |
+
# ---------------------------------------------------------------------------
|
| 54 |
+
|
| 55 |
+
_SESSION_DIR = Path(__file__).parent.parent / "output" / "sessions"
|
| 56 |
+
_SESSION_DIR.mkdir(parents=True, exist_ok=True)
|
| 57 |
+
|
| 58 |
+
_DEBUG_DIR = Path(__file__).parent.parent / "output" / "debug"
|
| 59 |
+
_DEBUG_DIR.mkdir(parents=True, exist_ok=True)
|
| 60 |
+
|
| 61 |
+
_STATIC_DIR = Path(__file__).parent.parent / "ui" / "dist"
|
| 62 |
+
|
| 63 |
+
# ---------------------------------------------------------------------------
|
| 64 |
+
# App
|
| 65 |
+
# ---------------------------------------------------------------------------
|
| 66 |
+
|
| 67 |
+
app = FastAPI(
|
| 68 |
+
title="UK Motor Insurance IDP — Visual Audit API",
|
| 69 |
+
version="1.0.0",
|
| 70 |
+
description=(
|
| 71 |
+
"Backend for the Human-in-the-Loop review dashboard. "
|
| 72 |
+
"Runs the extraction pipeline and exposes session-based review endpoints."
|
| 73 |
+
),
|
| 74 |
+
)
|
| 75 |
+
|
| 76 |
+
app.add_middleware(
|
| 77 |
+
CORSMiddleware,
|
| 78 |
+
allow_origins=[
|
| 79 |
+
"http://localhost:5173",
|
| 80 |
+
"http://localhost:5174",
|
| 81 |
+
"http://127.0.0.1:5173",
|
| 82 |
+
"http://127.0.0.1:5174",
|
| 83 |
+
"http://localhost:3000",
|
| 84 |
+
],
|
| 85 |
+
allow_credentials=True,
|
| 86 |
+
allow_methods=["*"],
|
| 87 |
+
allow_headers=["*"],
|
| 88 |
+
)
|
| 89 |
+
|
| 90 |
+
|
| 91 |
+
@app.on_event("startup")
|
| 92 |
+
async def _cleanup_old_sessions() -> None:
|
| 93 |
+
"""Remove session directories older than settings.pipeline.session_ttl_days on startup."""
|
| 94 |
+
import shutil
|
| 95 |
+
ttl_days = settings.pipeline.session_ttl_days
|
| 96 |
+
if ttl_days <= 0:
|
| 97 |
+
return
|
| 98 |
+
from datetime import datetime, timedelta
|
| 99 |
+
cutoff = datetime.now() - timedelta(days=ttl_days)
|
| 100 |
+
removed = 0
|
| 101 |
+
for session_dir in _SESSION_DIR.iterdir():
|
| 102 |
+
if session_dir.is_dir():
|
| 103 |
+
mtime = datetime.fromtimestamp(session_dir.stat().st_mtime)
|
| 104 |
+
if mtime < cutoff:
|
| 105 |
+
shutil.rmtree(session_dir, ignore_errors=True)
|
| 106 |
+
removed += 1
|
| 107 |
+
if removed:
|
| 108 |
+
logger.info(
|
| 109 |
+
"Startup cleanup: removed %d session(s) older than %d day(s)",
|
| 110 |
+
removed, ttl_days,
|
| 111 |
+
)
|
| 112 |
+
|
| 113 |
+
|
| 114 |
+
# ---------------------------------------------------------------------------
|
| 115 |
+
# Helpers
|
| 116 |
+
# ---------------------------------------------------------------------------
|
| 117 |
+
|
| 118 |
+
|
| 119 |
+
def _get_session_dir(session_id: str) -> Path:
|
| 120 |
+
"""Return session directory or raise 404.
|
| 121 |
+
|
| 122 |
+
Supports both old-style (uuid-only) and new-style (timestamp_uuid) folder names.
|
| 123 |
+
"""
|
| 124 |
+
# New-style: glob for any folder ending with the session UUID
|
| 125 |
+
matches = list(_SESSION_DIR.glob(f"*{session_id}"))
|
| 126 |
+
if matches:
|
| 127 |
+
return matches[0]
|
| 128 |
+
raise HTTPException(status_code=404, detail=f"Session '{session_id}' not found.")
|
| 129 |
+
|
| 130 |
+
|
| 131 |
+
def _count_leaves(obj: object) -> int:
|
| 132 |
+
if isinstance(obj, dict):
|
| 133 |
+
return sum(_count_leaves(v) for v in obj.values())
|
| 134 |
+
if isinstance(obj, list):
|
| 135 |
+
return sum(_count_leaves(v) for v in obj)
|
| 136 |
+
return 1
|
| 137 |
+
|
| 138 |
+
|
| 139 |
+
# ---------------------------------------------------------------------------
|
| 140 |
+
# Endpoints
|
| 141 |
+
# ---------------------------------------------------------------------------
|
| 142 |
+
|
| 143 |
+
|
| 144 |
+
@app.get("/api/health")
|
| 145 |
+
async def health():
|
| 146 |
+
return {"status": "ok", "version": "1.0.0"}
|
| 147 |
+
|
| 148 |
+
|
| 149 |
+
# ── POST /api/process ────────────────────────────────────────────────────────
|
| 150 |
+
|
| 151 |
+
class ProcessResponse(BaseModel):
|
| 152 |
+
session_id: str
|
| 153 |
+
fields_extracted: int
|
| 154 |
+
provenance_coverage: int # number of fields successfully located
|
| 155 |
+
|
| 156 |
+
|
| 157 |
+
@app.post("/api/process", response_model=ProcessResponse)
|
| 158 |
+
async def process_documents(files: list[UploadFile] = File(...)):
|
| 159 |
+
"""
|
| 160 |
+
Accept one or more PDF uploads, run the full extraction pipeline, and
|
| 161 |
+
persist a session containing the Golden Record + provenance index.
|
| 162 |
+
|
| 163 |
+
Returns a ``session_id`` which the UI uses for all subsequent requests.
|
| 164 |
+
|
| 165 |
+
Note: This endpoint is synchronous and may take 30–90 seconds depending
|
| 166 |
+
on Groq API response times.
|
| 167 |
+
"""
|
| 168 |
+
if not files:
|
| 169 |
+
raise HTTPException(status_code=400, detail="No files uploaded.")
|
| 170 |
+
|
| 171 |
+
pdf_files = [f for f in files if f.filename and f.filename.lower().endswith(".pdf")]
|
| 172 |
+
if not pdf_files:
|
| 173 |
+
raise HTTPException(status_code=400, detail="Only PDF files are accepted.")
|
| 174 |
+
|
| 175 |
+
# ── Create session directory (timestamp_uuid for easy sorting) ─────────
|
| 176 |
+
run_ts = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
|
| 177 |
+
session_id = str(uuid.uuid4())
|
| 178 |
+
session_folder = f"{run_ts}_{session_id}"
|
| 179 |
+
session_dir = _SESSION_DIR / session_folder
|
| 180 |
+
docs_dir = session_dir / "docs"
|
| 181 |
+
docs_dir.mkdir(parents=True, exist_ok=True)
|
| 182 |
+
|
| 183 |
+
# ── Create timestamped debug directory ────────────────────────────────
|
| 184 |
+
debug_dir: Path | None = None
|
| 185 |
+
if settings.debug.enabled:
|
| 186 |
+
debug_dir = _DEBUG_DIR / f"run_{run_ts}"
|
| 187 |
+
debug_dir.mkdir(parents=True, exist_ok=True)
|
| 188 |
+
logger.info("Debug artifacts → %s", debug_dir)
|
| 189 |
+
|
| 190 |
+
# ── Save uploaded PDFs (sanitise filenames) ───────────────────────────
|
| 191 |
+
pdf_paths: list[Path] = []
|
| 192 |
+
for upload in pdf_files:
|
| 193 |
+
safe_name = Path(upload.filename).name # strips directory components
|
| 194 |
+
dest = docs_dir / safe_name
|
| 195 |
+
dest.write_bytes(await upload.read())
|
| 196 |
+
pdf_paths.append(dest)
|
| 197 |
+
|
| 198 |
+
# ── Run pipeline with provenance ──────────────────────────────────────
|
| 199 |
+
masker = PIIMasker(mask_dates=settings.pii.mask_dates)
|
| 200 |
+
agent = InsuranceExtractionAgents(masker=masker, debug_dir=debug_dir)
|
| 201 |
+
|
| 202 |
+
golden, conflicts, corpora = run_extraction_pipeline(
|
| 203 |
+
pdf_paths=pdf_paths,
|
| 204 |
+
agent=agent,
|
| 205 |
+
with_provenance=True,
|
| 206 |
+
)
|
| 207 |
+
|
| 208 |
+
# ── Build provenance index ────────────────────────────────────────────
|
| 209 |
+
provenance_list = build_provenance(golden, corpora)
|
| 210 |
+
|
| 211 |
+
result = GoldenRecordWithProvenance(
|
| 212 |
+
record=golden,
|
| 213 |
+
provenance=provenance_list,
|
| 214 |
+
conflicts=conflicts,
|
| 215 |
+
session_id=session_id,
|
| 216 |
+
)
|
| 217 |
+
|
| 218 |
+
# ── Persist session ────────────────────────────────────────────────
|
| 219 |
+
(session_dir / "result.json").write_text(
|
| 220 |
+
result.model_dump_json(indent=2, exclude_none=True),
|
| 221 |
+
encoding="utf-8",
|
| 222 |
+
)
|
| 223 |
+
(session_dir / "review_state.json").write_text("{}", encoding="utf-8")
|
| 224 |
+
|
| 225 |
+
# Save field_citations sidecar so provenance can be re-built without re-running the LLM.
|
| 226 |
+
# (field_citations is excluded from result.json via Field(exclude=True) on the schema.)
|
| 227 |
+
fc = dict(getattr(golden, "field_citations", None) or {})
|
| 228 |
+
if fc:
|
| 229 |
+
(session_dir / "field_citations.json").write_text(
|
| 230 |
+
json.dumps(fc, indent=2, ensure_ascii=False), encoding="utf-8"
|
| 231 |
+
)
|
| 232 |
+
|
| 233 |
+
flat_fields = _count_leaves(golden.model_dump(exclude_none=True))
|
| 234 |
+
return ProcessResponse(
|
| 235 |
+
session_id=session_id,
|
| 236 |
+
fields_extracted=flat_fields,
|
| 237 |
+
provenance_coverage=len(provenance_list),
|
| 238 |
+
)
|
| 239 |
+
|
| 240 |
+
|
| 241 |
+
# ── GET /api/session/{session_id} ────────────────────────────────────────────
|
| 242 |
+
|
| 243 |
+
@app.get("/api/session/{session_id}")
|
| 244 |
+
async def get_session(session_id: str):
|
| 245 |
+
"""Return the full GoldenRecordWithProvenance for this session."""
|
| 246 |
+
session_dir = _get_session_dir(session_id)
|
| 247 |
+
result_file = session_dir / "result.json"
|
| 248 |
+
if not result_file.exists():
|
| 249 |
+
raise HTTPException(status_code=404, detail="Session result not yet available.")
|
| 250 |
+
return JSONResponse(content=json.loads(result_file.read_text(encoding="utf-8")))
|
| 251 |
+
|
| 252 |
+
|
| 253 |
+
# ── GET /api/pdf/{session_id}/{filename} ─────────────────────────────────────
|
| 254 |
+
|
| 255 |
+
@app.get("/api/pdf/{session_id}/{filename}")
|
| 256 |
+
async def serve_pdf(session_id: str, filename: str):
|
| 257 |
+
"""
|
| 258 |
+
Serve a PDF from the session's docs directory.
|
| 259 |
+
|
| 260 |
+
Path traversal is prevented by using only ``Path(filename).name``,
|
| 261 |
+
which strips any directory components from the supplied filename.
|
| 262 |
+
"""
|
| 263 |
+
session_dir = _get_session_dir(session_id)
|
| 264 |
+
safe_name = Path(filename).name
|
| 265 |
+
if not safe_name.lower().endswith(".pdf"):
|
| 266 |
+
raise HTTPException(status_code=400, detail="Only PDF files can be served.")
|
| 267 |
+
pdf_path = session_dir / "docs" / safe_name
|
| 268 |
+
if not pdf_path.exists():
|
| 269 |
+
raise HTTPException(status_code=404, detail=f"PDF '{safe_name}' not found in session.")
|
| 270 |
+
return FileResponse(
|
| 271 |
+
str(pdf_path),
|
| 272 |
+
media_type="application/pdf",
|
| 273 |
+
headers={"Content-Disposition": f'inline; filename="{safe_name}"'},
|
| 274 |
+
)
|
| 275 |
+
|
| 276 |
+
|
| 277 |
+
# ── PATCH /api/session/{session_id}/review ───────────────────────────────────
|
| 278 |
+
|
| 279 |
+
class ReviewUpdate(BaseModel):
|
| 280 |
+
field_path: str
|
| 281 |
+
action: str # "verify" | "reject" | "override"
|
| 282 |
+
overridden_value: Optional[str] = None
|
| 283 |
+
reviewer: Optional[str] = "anonymous"
|
| 284 |
+
|
| 285 |
+
|
| 286 |
+
@app.patch("/api/session/{session_id}/review")
|
| 287 |
+
async def update_review(session_id: str, update: ReviewUpdate):
|
| 288 |
+
"""Record a verify, reject, or override action for a specific field."""
|
| 289 |
+
if update.action not in {"verify", "reject", "override"}:
|
| 290 |
+
raise HTTPException(
|
| 291 |
+
status_code=422,
|
| 292 |
+
detail="action must be one of: verify, reject, override",
|
| 293 |
+
)
|
| 294 |
+
|
| 295 |
+
session_dir = _get_session_dir(session_id)
|
| 296 |
+
state_file = session_dir / "review_state.json"
|
| 297 |
+
state: dict = json.loads(state_file.read_text(encoding="utf-8")) if state_file.exists() else {}
|
| 298 |
+
|
| 299 |
+
state[update.field_path] = {
|
| 300 |
+
"action": update.action,
|
| 301 |
+
"overridden_value": update.overridden_value,
|
| 302 |
+
"reviewer": update.reviewer,
|
| 303 |
+
}
|
| 304 |
+
state_file.write_text(json.dumps(state, indent=2), encoding="utf-8")
|
| 305 |
+
return {"ok": True, "field_path": update.field_path, "action": update.action}
|
| 306 |
+
|
| 307 |
+
|
| 308 |
+
# ── GET /api/session/{session_id}/review-state ───────────────────────────────
|
| 309 |
+
|
| 310 |
+
@app.get("/api/session/{session_id}/review-state")
|
| 311 |
+
async def get_review_state(session_id: str):
|
| 312 |
+
"""Return the current review state (verify/override log) for the session."""
|
| 313 |
+
session_dir = _get_session_dir(session_id)
|
| 314 |
+
state_file = session_dir / "review_state.json"
|
| 315 |
+
if not state_file.exists():
|
| 316 |
+
return JSONResponse(content={})
|
| 317 |
+
return JSONResponse(content=json.loads(state_file.read_text(encoding="utf-8")))
|
| 318 |
+
|
| 319 |
+
# ── DELETE /api/session/{session_id} ──────────────────────────────────────────
|
| 320 |
+
|
| 321 |
+
@app.delete("/api/session/{session_id}")
|
| 322 |
+
async def delete_session(session_id: str):
|
| 323 |
+
"""
|
| 324 |
+
Permanently delete a session directory and all its contents.
|
| 325 |
+
|
| 326 |
+
This removes the uploaded PDFs, the Golden Record JSON, the review state,
|
| 327 |
+
and all debug artifacts for this session.
|
| 328 |
+
"""
|
| 329 |
+
import shutil
|
| 330 |
+
session_dir = _get_session_dir(session_id)
|
| 331 |
+
shutil.rmtree(session_dir, ignore_errors=True)
|
| 332 |
+
return {"ok": True, "session_id": session_id}
|
| 333 |
+
|
| 334 |
+
|
| 335 |
+
# ---------------------------------------------------------------------------
|
| 336 |
+
# Production UI hosting
|
| 337 |
+
# ---------------------------------------------------------------------------
|
| 338 |
+
|
| 339 |
+
if _STATIC_DIR.exists():
|
| 340 |
+
assets_dir = _STATIC_DIR / "assets"
|
| 341 |
+
if assets_dir.exists():
|
| 342 |
+
app.mount("/assets", StaticFiles(directory=str(assets_dir)), name="assets")
|
| 343 |
+
|
| 344 |
+
@app.get("/{full_path:path}", include_in_schema=False)
|
| 345 |
+
async def serve_spa(full_path: str):
|
| 346 |
+
"""
|
| 347 |
+
Serve the built React app when running as a single production service.
|
| 348 |
+
|
| 349 |
+
Vite handles the frontend during local development. In Docker/Hugging
|
| 350 |
+
Face deployments, the Dockerfile builds ui/dist and FastAPI serves it.
|
| 351 |
+
Unknown non-API paths fall back to index.html so /session/{id} works
|
| 352 |
+
after a hard refresh.
|
| 353 |
+
"""
|
| 354 |
+
requested = (_STATIC_DIR / full_path).resolve()
|
| 355 |
+
static_root = _STATIC_DIR.resolve()
|
| 356 |
+
if (
|
| 357 |
+
full_path
|
| 358 |
+
and requested.is_file()
|
| 359 |
+
and static_root in requested.parents
|
| 360 |
+
):
|
| 361 |
+
return FileResponse(str(requested))
|
| 362 |
+
return FileResponse(str(_STATIC_DIR / "index.html"))
|
| 363 |
+
|
| 364 |
+
# ---------------------------------------------------------------------------
|
| 365 |
+
# Dev entrypoint
|
| 366 |
+
# ---------------------------------------------------------------------------
|
| 367 |
+
|
| 368 |
+
if __name__ == "__main__":
|
| 369 |
+
import os
|
| 370 |
+
|
| 371 |
+
port = int(os.environ.get("PORT", "8000"))
|
| 372 |
+
uvicorn.run("api:app", host="0.0.0.0", port=port, reload=True)
|
src/arbiter.py
ADDED
|
@@ -0,0 +1,268 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
arbiter.py — Hierarchy of Truth merge for UK Motor Insurance.
|
| 3 |
+
|
| 4 |
+
The PolicyArbiter takes one Schedule extraction and one Certificate extraction
|
| 5 |
+
and produces a single authoritative UKMotorGoldenRecord.
|
| 6 |
+
|
| 7 |
+
Document Authoritative for
|
| 8 |
+
──────────────── ──────────────────────────────────────────────────
|
| 9 |
+
Schedule vehicle_details, excess_breakdown, financial_summary,
|
| 10 |
+
driver DOB / occupation / license_type, NCB, cover_type
|
| 11 |
+
Certificate class_of_use, driving_other_cars
|
| 12 |
+
"""
|
| 13 |
+
from __future__ import annotations
|
| 14 |
+
|
| 15 |
+
import logging
|
| 16 |
+
from typing import Optional
|
| 17 |
+
|
| 18 |
+
from schema import (
|
| 19 |
+
ConflictEntry,
|
| 20 |
+
CoverAndExcesses,
|
| 21 |
+
Driver,
|
| 22 |
+
ExcessBreakdown,
|
| 23 |
+
NoClaimsDiscount,
|
| 24 |
+
PeriodOfCover,
|
| 25 |
+
PolicyHeader,
|
| 26 |
+
UKMotorGoldenRecord,
|
| 27 |
+
)
|
| 28 |
+
|
| 29 |
+
logger = logging.getLogger(__name__)
|
| 30 |
+
|
| 31 |
+
# Minimum rapidfuzz token_sort_ratio to consider two driver names a match.
|
| 32 |
+
_DRIVER_NAME_MATCH_THRESHOLD = 85
|
| 33 |
+
|
| 34 |
+
|
| 35 |
+
# ---------------------------------------------------------------------------
|
| 36 |
+
# PolicyArbiter
|
| 37 |
+
# ---------------------------------------------------------------------------
|
| 38 |
+
|
| 39 |
+
|
| 40 |
+
class PolicyArbiter:
|
| 41 |
+
"""
|
| 42 |
+
Merges a Schedule extraction and a Certificate extraction into one
|
| 43 |
+
authoritative UKMotorGoldenRecord using the Hierarchy of Truth.
|
| 44 |
+
|
| 45 |
+
Usage
|
| 46 |
+
-----
|
| 47 |
+
>>> arbiter = PolicyArbiter()
|
| 48 |
+
>>> golden, conflicts = arbiter.merge_records(
|
| 49 |
+
... schedule_record, "Schedule of Insurance (1).pdf",
|
| 50 |
+
... certificate_record, "Certificate of Motor Insurance.pdf",
|
| 51 |
+
... )
|
| 52 |
+
"""
|
| 53 |
+
|
| 54 |
+
def merge_records(
|
| 55 |
+
self,
|
| 56 |
+
schedule_record: UKMotorGoldenRecord,
|
| 57 |
+
schedule_filename: str,
|
| 58 |
+
certificate_record: UKMotorGoldenRecord,
|
| 59 |
+
certificate_filename: str,
|
| 60 |
+
) -> tuple[UKMotorGoldenRecord, list[ConflictEntry]]:
|
| 61 |
+
"""
|
| 62 |
+
Merge Schedule and Certificate extractions into one Golden Record.
|
| 63 |
+
|
| 64 |
+
Schedule is master for: vehicle_details, excess_breakdown,
|
| 65 |
+
financial_summary, driver DOB/occupation/license_type, NCB, cover_type.
|
| 66 |
+
Certificate is master for: class_of_use, driving_other_cars.
|
| 67 |
+
|
| 68 |
+
Returns
|
| 69 |
+
-------
|
| 70 |
+
tuple[UKMotorGoldenRecord, list[ConflictEntry]]
|
| 71 |
+
(golden_record, list of fields where the two documents disagreed)
|
| 72 |
+
"""
|
| 73 |
+
conflicts: list[ConflictEntry] = []
|
| 74 |
+
merged = UKMotorGoldenRecord()
|
| 75 |
+
|
| 76 |
+
# ── Policy header ───────────────────────────────────────────────────
|
| 77 |
+
merged.policy_header = _merge_policy_header(schedule_record, certificate_record, conflicts)
|
| 78 |
+
|
| 79 |
+
# ── Vehicle details: Schedule is authoritative ──────────────────────
|
| 80 |
+
merged.vehicle_details = schedule_record.vehicle_details
|
| 81 |
+
|
| 82 |
+
# ── Drivers: Schedule has DOB/occupation/licence ────────────────────
|
| 83 |
+
merged.driver_details = _merge_drivers(schedule_record, certificate_record, conflicts)
|
| 84 |
+
|
| 85 |
+
# ── Cover and excesses: hybrid ──────────────────────────────────────
|
| 86 |
+
# class_of_use + driving_other_cars → Certificate
|
| 87 |
+
# cover_type + NCB + excess_breakdown → Schedule
|
| 88 |
+
merged.cover_and_excesses = _merge_cover_and_excesses(
|
| 89 |
+
schedule_record, certificate_record, conflicts
|
| 90 |
+
)
|
| 91 |
+
|
| 92 |
+
# ── Financial summary: Schedule is authoritative ────────────────────
|
| 93 |
+
merged.financial_summary = schedule_record.financial_summary
|
| 94 |
+
|
| 95 |
+
# ── Additional risk data: Schedule is authoritative ─────────────────
|
| 96 |
+
merged.additional_risk_data = schedule_record.additional_risk_data
|
| 97 |
+
|
| 98 |
+
# ── Merge field_citations from both source records ──────────────────
|
| 99 |
+
# Schedule wins on key conflicts (consistent with merge hierarchy).
|
| 100 |
+
# Stored on the merged record for provenance matching; excluded from JSON output.
|
| 101 |
+
sched_fc = dict(getattr(schedule_record, "field_citations", None) or {})
|
| 102 |
+
cert_fc = dict(getattr(certificate_record, "field_citations", None) or {})
|
| 103 |
+
merged_fc = {**cert_fc, **sched_fc}
|
| 104 |
+
if merged_fc:
|
| 105 |
+
merged.field_citations = merged_fc
|
| 106 |
+
|
| 107 |
+
if conflicts:
|
| 108 |
+
logger.info(
|
| 109 |
+
"Merge conflicts (%d): %s",
|
| 110 |
+
len(conflicts),
|
| 111 |
+
[c.field for c in conflicts],
|
| 112 |
+
)
|
| 113 |
+
|
| 114 |
+
logger.info(
|
| 115 |
+
"Merge complete: schedule='%s' + certificate='%s' — %d conflict(s)",
|
| 116 |
+
schedule_filename, certificate_filename, len(conflicts),
|
| 117 |
+
)
|
| 118 |
+
return merged, conflicts
|
| 119 |
+
|
| 120 |
+
|
| 121 |
+
# ---------------------------------------------------------------------------
|
| 122 |
+
# Private merge helpers
|
| 123 |
+
# ---------------------------------------------------------------------------
|
| 124 |
+
|
| 125 |
+
|
| 126 |
+
def _first(*values):
|
| 127 |
+
"""Return the first non-None value, or None if all are None."""
|
| 128 |
+
for v in values:
|
| 129 |
+
if v is not None:
|
| 130 |
+
return v
|
| 131 |
+
return None
|
| 132 |
+
|
| 133 |
+
|
| 134 |
+
def _check_conflict(
|
| 135 |
+
conflicts: list[ConflictEntry],
|
| 136 |
+
field: str,
|
| 137 |
+
sched_val,
|
| 138 |
+
cert_val,
|
| 139 |
+
winner: str,
|
| 140 |
+
):
|
| 141 |
+
"""
|
| 142 |
+
Detect a conflict between two scalar values, record it, and return the winner's value.
|
| 143 |
+
|
| 144 |
+
A conflict is logged only when both values are non-None *and* differ.
|
| 145 |
+
``winner`` must be ``"schedule"`` or ``"certificate"``.
|
| 146 |
+
"""
|
| 147 |
+
if sched_val is not None and cert_val is not None:
|
| 148 |
+
if str(sched_val).strip().lower() != str(cert_val).strip().lower():
|
| 149 |
+
conflicts.append(ConflictEntry(
|
| 150 |
+
field=field,
|
| 151 |
+
schedule_value=str(sched_val),
|
| 152 |
+
certificate_value=str(cert_val),
|
| 153 |
+
winner=winner,
|
| 154 |
+
))
|
| 155 |
+
if winner == "certificate":
|
| 156 |
+
return _first(cert_val, sched_val)
|
| 157 |
+
return _first(sched_val, cert_val) # schedule wins (default)
|
| 158 |
+
|
| 159 |
+
|
| 160 |
+
def _find_matching_driver(name: str, candidates: list[Driver]) -> Driver | None:
|
| 161 |
+
"""
|
| 162 |
+
Find the best-matching driver from *candidates* using fuzzy name matching.
|
| 163 |
+
|
| 164 |
+
Uses ``rapidfuzz.fuzz.token_sort_ratio`` so middle-name or word-order
|
| 165 |
+
differences (e.g. "JOHN A SMITH" vs "SMITH JOHN") still match.
|
| 166 |
+
Returns None when the best score is below ``_DRIVER_NAME_MATCH_THRESHOLD``.
|
| 167 |
+
"""
|
| 168 |
+
try:
|
| 169 |
+
from rapidfuzz import fuzz as rfuzz
|
| 170 |
+
except ImportError:
|
| 171 |
+
# Graceful fallback: exact uppercase match (original behaviour)
|
| 172 |
+
upper = name.strip().upper()
|
| 173 |
+
return next((d for d in candidates if d.name.strip().upper() == upper), None)
|
| 174 |
+
|
| 175 |
+
best_score = 0
|
| 176 |
+
best_driver: Driver | None = None
|
| 177 |
+
for candidate in candidates:
|
| 178 |
+
score = rfuzz.token_sort_ratio(name.strip(), candidate.name.strip())
|
| 179 |
+
if score > best_score:
|
| 180 |
+
best_score = score
|
| 181 |
+
best_driver = candidate
|
| 182 |
+
return best_driver if best_score >= _DRIVER_NAME_MATCH_THRESHOLD else None
|
| 183 |
+
|
| 184 |
+
|
| 185 |
+
def _merge_policy_header(
|
| 186 |
+
sched: UKMotorGoldenRecord,
|
| 187 |
+
cert: UKMotorGoldenRecord,
|
| 188 |
+
conflicts: list[ConflictEntry],
|
| 189 |
+
) -> Optional[PolicyHeader]:
|
| 190 |
+
"""Schedule is master; fill any gap from Certificate."""
|
| 191 |
+
sh = sched.policy_header or PolicyHeader()
|
| 192 |
+
ch = cert.policy_header or PolicyHeader()
|
| 193 |
+
|
| 194 |
+
poc: Optional[PeriodOfCover] = _first(sh.period_of_cover, ch.period_of_cover)
|
| 195 |
+
|
| 196 |
+
return PolicyHeader(
|
| 197 |
+
policy_number=_check_conflict(conflicts, "policy_header.policy_number", sh.policy_number, ch.policy_number, "schedule"),
|
| 198 |
+
insurer=_check_conflict(conflicts, "policy_header.insurer", sh.insurer, ch.insurer, "schedule"),
|
| 199 |
+
product_name=_check_conflict(conflicts, "policy_header.product_name", sh.product_name, ch.product_name, "schedule"),
|
| 200 |
+
period_of_cover=poc,
|
| 201 |
+
)
|
| 202 |
+
|
| 203 |
+
|
| 204 |
+
def _merge_drivers(
|
| 205 |
+
sched: UKMotorGoldenRecord,
|
| 206 |
+
cert: UKMotorGoldenRecord,
|
| 207 |
+
conflicts: list[ConflictEntry],
|
| 208 |
+
) -> list[Driver]:
|
| 209 |
+
"""
|
| 210 |
+
Schedule drivers are the base (they carry DOB, occupation, license_type).
|
| 211 |
+
For each Schedule driver, fuzzy-match against Certificate drivers and enrich
|
| 212 |
+
with relationship or is_main_driver if the Schedule record lacks them.
|
| 213 |
+
Falls back to the Certificate list when Schedule has no drivers.
|
| 214 |
+
|
| 215 |
+
Uses rapidfuzz ``token_sort_ratio`` with an 85-point threshold so minor
|
| 216 |
+
name variations (initials, hyphenation, word order) still merge correctly.
|
| 217 |
+
"""
|
| 218 |
+
sched_drivers = sched.driver_details or []
|
| 219 |
+
cert_drivers = cert.driver_details or []
|
| 220 |
+
|
| 221 |
+
if not sched_drivers:
|
| 222 |
+
return cert_drivers
|
| 223 |
+
|
| 224 |
+
merged: list[Driver] = []
|
| 225 |
+
for sd in sched_drivers:
|
| 226 |
+
cd = _find_matching_driver(sd.name, cert_drivers)
|
| 227 |
+
|
| 228 |
+
if cd is not None and sd.is_main_driver != cd.is_main_driver:
|
| 229 |
+
conflicts.append(ConflictEntry(
|
| 230 |
+
field=f"driver_details[{sd.name}].is_main_driver",
|
| 231 |
+
schedule_value=str(sd.is_main_driver),
|
| 232 |
+
certificate_value=str(cd.is_main_driver),
|
| 233 |
+
winner="schedule",
|
| 234 |
+
))
|
| 235 |
+
|
| 236 |
+
merged.append(Driver(
|
| 237 |
+
name=sd.name,
|
| 238 |
+
dob=_first(sd.dob, cd.dob if cd else None),
|
| 239 |
+
relationship=_first(sd.relationship, cd.relationship if cd else None),
|
| 240 |
+
occupation=_first(sd.occupation, cd.occupation if cd else None),
|
| 241 |
+
license_type=_first(sd.license_type, cd.license_type if cd else None),
|
| 242 |
+
is_main_driver=sd.is_main_driver or (cd.is_main_driver if cd else False),
|
| 243 |
+
specific_excess=_first(sd.specific_excess, cd.specific_excess if cd else None),
|
| 244 |
+
))
|
| 245 |
+
return merged
|
| 246 |
+
|
| 247 |
+
|
| 248 |
+
def _merge_cover_and_excesses(
|
| 249 |
+
sched: UKMotorGoldenRecord,
|
| 250 |
+
cert: UKMotorGoldenRecord,
|
| 251 |
+
conflicts: list[ConflictEntry],
|
| 252 |
+
) -> Optional[CoverAndExcesses]:
|
| 253 |
+
"""
|
| 254 |
+
Hybrid merge:
|
| 255 |
+
- class_of_use, driving_other_cars → Certificate is master
|
| 256 |
+
- cover_type, NCB, excess_breakdown → Schedule is master
|
| 257 |
+
"""
|
| 258 |
+
sc = sched.cover_and_excesses or CoverAndExcesses()
|
| 259 |
+
cc = cert.cover_and_excesses or CoverAndExcesses()
|
| 260 |
+
|
| 261 |
+
return CoverAndExcesses(
|
| 262 |
+
cover_type=_check_conflict(conflicts, "cover_and_excesses.cover_type", sc.cover_type, cc.cover_type, "schedule"),
|
| 263 |
+
no_claims_discount=_first(sc.no_claims_discount, cc.no_claims_discount),
|
| 264 |
+
excess_breakdown=_first(sc.excess_breakdown, cc.excess_breakdown),
|
| 265 |
+
# Certificate is authoritative for legal-use fields
|
| 266 |
+
class_of_use=_check_conflict(conflicts, "cover_and_excesses.class_of_use", sc.class_of_use, cc.class_of_use, "certificate"),
|
| 267 |
+
driving_other_cars=_check_conflict(conflicts, "cover_and_excesses.driving_other_cars", sc.driving_other_cars, cc.driving_other_cars, "certificate"),
|
| 268 |
+
)
|
src/main.py
ADDED
|
@@ -0,0 +1,223 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
main.py — Agentic orchestrator for UK Motor Insurance IDP.
|
| 3 |
+
|
| 4 |
+
Usage
|
| 5 |
+
-----
|
| 6 |
+
# Process all PDFs in a folder and print the Golden Record:
|
| 7 |
+
python src/main.py --input ./docs --output ./output/golden_record.json
|
| 8 |
+
|
| 9 |
+
# Verbose logging:
|
| 10 |
+
python src/main.py --input ./docs --output ./output/golden_record.json --log-level DEBUG
|
| 11 |
+
|
| 12 |
+
Environment
|
| 13 |
+
-----------
|
| 14 |
+
GROQ_API_KEY Required. Your Groq API key.
|
| 15 |
+
"""
|
| 16 |
+
from __future__ import annotations
|
| 17 |
+
|
| 18 |
+
import argparse
|
| 19 |
+
import json
|
| 20 |
+
import logging
|
| 21 |
+
import sys
|
| 22 |
+
from datetime import datetime
|
| 23 |
+
from pathlib import Path
|
| 24 |
+
|
| 25 |
+
from agents import InsuranceExtractionAgents
|
| 26 |
+
from arbiter import PolicyArbiter
|
| 27 |
+
from pipeline import run_extraction_pipeline
|
| 28 |
+
from privacy import PIIMasker
|
| 29 |
+
from schema import DocumentType, UKMotorGoldenRecord
|
| 30 |
+
from settings import settings
|
| 31 |
+
|
| 32 |
+
# ---------------------------------------------------------------------------
|
| 33 |
+
# Logging
|
| 34 |
+
# ---------------------------------------------------------------------------
|
| 35 |
+
|
| 36 |
+
logger = logging.getLogger("pipeline")
|
| 37 |
+
|
| 38 |
+
|
| 39 |
+
# ---------------------------------------------------------------------------
|
| 40 |
+
# Pipeline
|
| 41 |
+
# ---------------------------------------------------------------------------
|
| 42 |
+
|
| 43 |
+
|
| 44 |
+
class DocumentPipeline:
|
| 45 |
+
"""
|
| 46 |
+
End-to-end agentic pipeline.
|
| 47 |
+
|
| 48 |
+
Steps
|
| 49 |
+
-----
|
| 50 |
+
1. Scan *input_dir* for PDF files.
|
| 51 |
+
2. For each PDF: mask PII → classify → extract with specialist agent.
|
| 52 |
+
3. Pass all extractions to PolicyArbiter.
|
| 53 |
+
4. Persist GoldenRecord JSON (with citations and conflict log) to *output_path*.
|
| 54 |
+
"""
|
| 55 |
+
|
| 56 |
+
# Document-type priority for display ordering (matches arbiter priority)
|
| 57 |
+
_DOC_ORDER = [
|
| 58 |
+
DocumentType.SCHEDULE,
|
| 59 |
+
DocumentType.CERTIFICATE,
|
| 60 |
+
DocumentType.STATEMENT_OF_FACT,
|
| 61 |
+
DocumentType.POLICY_BOOKLET,
|
| 62 |
+
DocumentType.UNKNOWN,
|
| 63 |
+
]
|
| 64 |
+
|
| 65 |
+
def __init__(
|
| 66 |
+
self,
|
| 67 |
+
input_dir: str | Path,
|
| 68 |
+
output_path: str | Path = settings.pipeline.output_path,
|
| 69 |
+
mask_dates: bool = settings.pii.mask_dates,
|
| 70 |
+
) -> None:
|
| 71 |
+
self.input_dir = Path(input_dir)
|
| 72 |
+
self.output_path = Path(output_path)
|
| 73 |
+
|
| 74 |
+
# Create a timestamped debug run directory once per pipeline instance
|
| 75 |
+
run_ts = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
|
| 76 |
+
self.debug_dir: Path | None = None
|
| 77 |
+
if settings.debug.enabled:
|
| 78 |
+
self.debug_dir = Path(settings.debug.output_dir) / f"run_{run_ts}"
|
| 79 |
+
self.debug_dir.mkdir(parents=True, exist_ok=True)
|
| 80 |
+
logger.info("Debug artifacts → %s", self.debug_dir)
|
| 81 |
+
|
| 82 |
+
self._masker = PIIMasker(mask_dates=mask_dates)
|
| 83 |
+
self._agent = InsuranceExtractionAgents(masker=self._masker, debug_dir=self.debug_dir)
|
| 84 |
+
|
| 85 |
+
# ------------------------------------------------------------------
|
| 86 |
+
# Public API
|
| 87 |
+
# ------------------------------------------------------------------
|
| 88 |
+
|
| 89 |
+
def run(self) -> UKMotorGoldenRecord:
|
| 90 |
+
"""Execute the full pipeline and return the UKMotorGoldenRecord."""
|
| 91 |
+
pdfs = self._discover_pdfs()
|
| 92 |
+
if not pdfs:
|
| 93 |
+
raise FileNotFoundError(
|
| 94 |
+
f"No PDF files found in '{self.input_dir}'. "
|
| 95 |
+
"Ensure the folder contains at least one .pdf file."
|
| 96 |
+
)
|
| 97 |
+
|
| 98 |
+
logger.info("Found %d PDF(s): %s", len(pdfs), [p.name for p in pdfs])
|
| 99 |
+
|
| 100 |
+
# ── Stages 1 + 2: Extract + Arbitrate (shared logic via pipeline.py) ──
|
| 101 |
+
golden, conflicts, _ = run_extraction_pipeline(
|
| 102 |
+
pdf_paths=pdfs,
|
| 103 |
+
agent=self._agent,
|
| 104 |
+
with_provenance=False,
|
| 105 |
+
)
|
| 106 |
+
|
| 107 |
+
# ── Stage 3: Persist ──────────────────────────────────────────────
|
| 108 |
+
self._save(golden)
|
| 109 |
+
logger.info("Golden Record saved → %s", self.output_path)
|
| 110 |
+
|
| 111 |
+
if conflicts and self.debug_dir:
|
| 112 |
+
import json as _json
|
| 113 |
+
(self.debug_dir / "conflicts.json").write_text(
|
| 114 |
+
_json.dumps([c.model_dump() for c in conflicts], indent=2),
|
| 115 |
+
encoding="utf-8",
|
| 116 |
+
)
|
| 117 |
+
logger.info(
|
| 118 |
+
"Arbiter conflicts (%d) written → %s/conflicts.json",
|
| 119 |
+
len(conflicts), self.debug_dir,
|
| 120 |
+
)
|
| 121 |
+
|
| 122 |
+
return golden
|
| 123 |
+
|
| 124 |
+
# ------------------------------------------------------------------
|
| 125 |
+
# Private helpers
|
| 126 |
+
# ------------------------------------------------------------------
|
| 127 |
+
|
| 128 |
+
def _discover_pdfs(self) -> list[Path]:
|
| 129 |
+
"""Return PDF files sorted by document-type priority (best-effort)."""
|
| 130 |
+
if not self.input_dir.is_dir():
|
| 131 |
+
raise NotADirectoryError(f"'{self.input_dir}' is not a directory.")
|
| 132 |
+
return sorted(self.input_dir.glob("*.pdf"), key=lambda p: p.name)
|
| 133 |
+
|
| 134 |
+
def _save(self, golden: UKMotorGoldenRecord) -> None:
|
| 135 |
+
self.output_path.parent.mkdir(parents=True, exist_ok=True)
|
| 136 |
+
self.output_path.write_text(golden.model_dump_json(indent=2, exclude_none=True), encoding="utf-8")
|
| 137 |
+
|
| 138 |
+
|
| 139 |
+
# ---------------------------------------------------------------------------
|
| 140 |
+
# CLI entry point
|
| 141 |
+
# ---------------------------------------------------------------------------
|
| 142 |
+
|
| 143 |
+
|
| 144 |
+
def _parse_args() -> argparse.Namespace:
|
| 145 |
+
parser = argparse.ArgumentParser(
|
| 146 |
+
description="Agentic UK Motor Insurance IDP Pipeline",
|
| 147 |
+
formatter_class=argparse.ArgumentDefaultsHelpFormatter,
|
| 148 |
+
)
|
| 149 |
+
parser.add_argument(
|
| 150 |
+
"--input", "-i",
|
| 151 |
+
required=True,
|
| 152 |
+
help="Folder containing input PDF documents.",
|
| 153 |
+
)
|
| 154 |
+
parser.add_argument(
|
| 155 |
+
"--output", "-o",
|
| 156 |
+
default=settings.pipeline.output_path,
|
| 157 |
+
help="Output path for the Golden Record JSON.",
|
| 158 |
+
)
|
| 159 |
+
parser.add_argument(
|
| 160 |
+
"--mask-dates",
|
| 161 |
+
action="store_true",
|
| 162 |
+
default=False,
|
| 163 |
+
help="Also redact DATE_TIME entities during PII masking.",
|
| 164 |
+
)
|
| 165 |
+
parser.add_argument(
|
| 166 |
+
"--log-level",
|
| 167 |
+
default=settings.pipeline.log_level,
|
| 168 |
+
choices=["DEBUG", "INFO", "WARNING", "ERROR"],
|
| 169 |
+
help="Logging verbosity.",
|
| 170 |
+
)
|
| 171 |
+
return parser.parse_args()
|
| 172 |
+
|
| 173 |
+
|
| 174 |
+
def main() -> None:
|
| 175 |
+
args = _parse_args()
|
| 176 |
+
|
| 177 |
+
# ── Logging setup: console + optional file handler ─────────────────────
|
| 178 |
+
log_format = "%(asctime)s [%(levelname)s] %(name)s — %(message)s"
|
| 179 |
+
logging.basicConfig(
|
| 180 |
+
level=args.log_level,
|
| 181 |
+
format=log_format,
|
| 182 |
+
datefmt="%H:%M:%S",
|
| 183 |
+
stream=sys.stdout,
|
| 184 |
+
)
|
| 185 |
+
if settings.debug.enabled:
|
| 186 |
+
run_ts = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
|
| 187 |
+
log_dir = Path(settings.debug.output_dir) / f"run_{run_ts}"
|
| 188 |
+
log_dir.mkdir(parents=True, exist_ok=True)
|
| 189 |
+
file_handler = logging.FileHandler(log_dir / "pipeline.log", encoding="utf-8")
|
| 190 |
+
file_handler.setLevel(args.log_level)
|
| 191 |
+
file_handler.setFormatter(logging.Formatter(log_format, datefmt="%H:%M:%S"))
|
| 192 |
+
logging.getLogger().addHandler(file_handler)
|
| 193 |
+
logger.info("Log file: %s", log_dir / "pipeline.log")
|
| 194 |
+
|
| 195 |
+
pipeline = DocumentPipeline(
|
| 196 |
+
input_dir=args.input,
|
| 197 |
+
output_path=args.output,
|
| 198 |
+
mask_dates=args.mask_dates,
|
| 199 |
+
)
|
| 200 |
+
|
| 201 |
+
golden = pipeline.run()
|
| 202 |
+
|
| 203 |
+
# Print a compact summary to stdout
|
| 204 |
+
hdr = golden.policy_header
|
| 205 |
+
veh = golden.vehicle_details
|
| 206 |
+
cov = golden.cover_and_excesses
|
| 207 |
+
drivers = golden.driver_details or []
|
| 208 |
+
print("\n" + "=" * 60)
|
| 209 |
+
print(" GOLDEN RECORD SUMMARY")
|
| 210 |
+
print("=" * 60)
|
| 211 |
+
print(f" Policy # : {hdr.policy_number if hdr else 'N/A'}")
|
| 212 |
+
print(f" Insurer : {hdr.insurer if hdr else 'N/A'}")
|
| 213 |
+
print(f" VRM : {veh.vrm if veh else 'N/A'}")
|
| 214 |
+
print(f" Vehicle : {(veh.make + ' ' + veh.model) if veh and veh.make else 'N/A'}")
|
| 215 |
+
print(f" Cover : {cov.cover_type if cov else 'N/A'}")
|
| 216 |
+
print(f" Class of use : {cov.class_of_use if cov else 'N/A'}")
|
| 217 |
+
print(f" Drivers : {len(drivers)}")
|
| 218 |
+
print("=" * 60)
|
| 219 |
+
print(f"\nFull JSON written to: {args.output}\n")
|
| 220 |
+
|
| 221 |
+
|
| 222 |
+
if __name__ == "__main__":
|
| 223 |
+
main()
|
src/pipeline.py
ADDED
|
@@ -0,0 +1,131 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
pipeline.py — Shared PDF-routing and arbitration logic.
|
| 3 |
+
|
| 4 |
+
Both the CLI (main.py / DocumentPipeline) and the API (api.py / process_documents)
|
| 5 |
+
run the same extraction loop: route PDFs to Schedule/Certificate slots, call
|
| 6 |
+
the PolicyArbiter, and return the merged record plus any detected conflicts.
|
| 7 |
+
|
| 8 |
+
Extracting this logic here eliminates the duplication that previously existed
|
| 9 |
+
between those two entry-points and makes the behaviour easy to test in isolation.
|
| 10 |
+
|
| 11 |
+
Usage
|
| 12 |
+
-----
|
| 13 |
+
from pipeline import run_extraction_pipeline
|
| 14 |
+
|
| 15 |
+
golden, conflicts, corpora = run_extraction_pipeline(
|
| 16 |
+
pdf_paths=pdf_paths,
|
| 17 |
+
agent=agent,
|
| 18 |
+
with_provenance=True,
|
| 19 |
+
)
|
| 20 |
+
"""
|
| 21 |
+
from __future__ import annotations
|
| 22 |
+
|
| 23 |
+
import logging
|
| 24 |
+
from pathlib import Path
|
| 25 |
+
from typing import Any
|
| 26 |
+
|
| 27 |
+
from agents import ExtractionFailedError, InsuranceExtractionAgents
|
| 28 |
+
from arbiter import PolicyArbiter
|
| 29 |
+
from schema import ConflictEntry, DocumentType, UKMotorGoldenRecord
|
| 30 |
+
|
| 31 |
+
logger = logging.getLogger(__name__)
|
| 32 |
+
|
| 33 |
+
|
| 34 |
+
def run_extraction_pipeline(
|
| 35 |
+
pdf_paths: list[Path],
|
| 36 |
+
agent: InsuranceExtractionAgents,
|
| 37 |
+
*,
|
| 38 |
+
with_provenance: bool = False,
|
| 39 |
+
) -> tuple[UKMotorGoldenRecord, list[ConflictEntry], list[Any]]:
|
| 40 |
+
"""
|
| 41 |
+
Route PDFs to Schedule/Certificate slots, arbitrate, and return the results.
|
| 42 |
+
|
| 43 |
+
Parameters
|
| 44 |
+
----------
|
| 45 |
+
pdf_paths : list[Path]
|
| 46 |
+
Paths to the PDF documents to process.
|
| 47 |
+
agent : InsuranceExtractionAgents
|
| 48 |
+
Configured extraction agent (carries masker, debug_dir, prompts, etc.).
|
| 49 |
+
with_provenance : bool
|
| 50 |
+
When True, builds and returns ProvenanceCorpus objects for each PDF.
|
| 51 |
+
Set to True when running via the API (Visual Audit UI needs geometry data).
|
| 52 |
+
Set to False for the CLI path (faster, no corpus overhead).
|
| 53 |
+
|
| 54 |
+
Returns
|
| 55 |
+
-------
|
| 56 |
+
tuple[UKMotorGoldenRecord, list[ConflictEntry], list[ProvenanceCorpus]]
|
| 57 |
+
* golden_record — the merged authoritative policy record
|
| 58 |
+
* conflicts — fields where Schedule and Certificate disagreed
|
| 59 |
+
* corpora — ProvenanceCorpus objects (empty list when with_provenance=False)
|
| 60 |
+
|
| 61 |
+
Raises
|
| 62 |
+
------
|
| 63 |
+
RuntimeError
|
| 64 |
+
When neither a Schedule nor a Certificate could be extracted from any PDF.
|
| 65 |
+
"""
|
| 66 |
+
schedule_record: UKMotorGoldenRecord | None = None
|
| 67 |
+
schedule_filename = "unknown_schedule.pdf"
|
| 68 |
+
certificate_record: UKMotorGoldenRecord | None = None
|
| 69 |
+
certificate_filename = "unknown_certificate.pdf"
|
| 70 |
+
corpora: list[Any] = []
|
| 71 |
+
failed: list[str] = []
|
| 72 |
+
|
| 73 |
+
for pdf_path in pdf_paths:
|
| 74 |
+
try:
|
| 75 |
+
if with_provenance:
|
| 76 |
+
record, doc_type_str, corpus = agent.process_with_provenance(pdf_path)
|
| 77 |
+
if corpus is not None and corpus.items:
|
| 78 |
+
corpora.append(corpus)
|
| 79 |
+
else:
|
| 80 |
+
record, doc_type_str = agent.process(pdf_path)
|
| 81 |
+
|
| 82 |
+
logger.info(" ✓ %s → %s", pdf_path.name, doc_type_str)
|
| 83 |
+
|
| 84 |
+
if doc_type_str == DocumentType.SCHEDULE.value and schedule_record is None:
|
| 85 |
+
schedule_record = record
|
| 86 |
+
schedule_filename = pdf_path.name
|
| 87 |
+
elif doc_type_str == DocumentType.CERTIFICATE.value and certificate_record is None:
|
| 88 |
+
certificate_record = record
|
| 89 |
+
certificate_filename = pdf_path.name
|
| 90 |
+
else:
|
| 91 |
+
logger.info(" ~ %s (%s) — not used in merge", pdf_path.name, doc_type_str)
|
| 92 |
+
|
| 93 |
+
except ExtractionFailedError as exc:
|
| 94 |
+
logger.error(" ✗ Extraction failed for %s: %s", pdf_path.name, exc)
|
| 95 |
+
failed.append(pdf_path.name)
|
| 96 |
+
except Exception as exc: # noqa: BLE001
|
| 97 |
+
logger.error(" ✗ %s failed: %s", pdf_path.name, exc)
|
| 98 |
+
failed.append(pdf_path.name)
|
| 99 |
+
|
| 100 |
+
if failed:
|
| 101 |
+
logger.warning("Skipped %d document(s): %s", len(failed), failed)
|
| 102 |
+
|
| 103 |
+
if schedule_record is None and certificate_record is None:
|
| 104 |
+
raise RuntimeError(
|
| 105 |
+
"No Schedule or Certificate extracted. "
|
| 106 |
+
"Check GROQ_API_KEY and that the PDFs are readable."
|
| 107 |
+
)
|
| 108 |
+
|
| 109 |
+
if schedule_record is None:
|
| 110 |
+
logger.warning("No Schedule found — using empty record as fallback")
|
| 111 |
+
if certificate_record is None:
|
| 112 |
+
logger.warning("No Certificate found — using empty record as fallback")
|
| 113 |
+
|
| 114 |
+
schedule_record = schedule_record or UKMotorGoldenRecord()
|
| 115 |
+
certificate_record = certificate_record or UKMotorGoldenRecord()
|
| 116 |
+
|
| 117 |
+
logger.info("Merging Schedule + Certificate via PolicyArbiter…")
|
| 118 |
+
arbiter = PolicyArbiter()
|
| 119 |
+
golden, conflicts = arbiter.merge_records(
|
| 120 |
+
schedule_record, schedule_filename,
|
| 121 |
+
certificate_record, certificate_filename,
|
| 122 |
+
)
|
| 123 |
+
|
| 124 |
+
if conflicts:
|
| 125 |
+
logger.info(
|
| 126 |
+
"Arbiter detected %d conflict(s): %s",
|
| 127 |
+
len(conflicts),
|
| 128 |
+
[c.field for c in conflicts],
|
| 129 |
+
)
|
| 130 |
+
|
| 131 |
+
return golden, conflicts, corpora
|
src/privacy.py
ADDED
|
@@ -0,0 +1,186 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
privacy.py — PII detection and masking via Microsoft Presidio.
|
| 3 |
+
|
| 4 |
+
Entities masked before any text is sent to the LLM:
|
| 5 |
+
PERSON, PHONE_NUMBER, EMAIL_ADDRESS, UK_NHS, UK_NIN,
|
| 6 |
+
CREDIT_CARD, IBAN_CODE, DATE_TIME (opt-in), LOCATION
|
| 7 |
+
|
| 8 |
+
Usage
|
| 9 |
+
-----
|
| 10 |
+
masker = PIIMasker()
|
| 11 |
+
clean_text, mapping = masker.mask(raw_markdown)
|
| 12 |
+
# ... call LLM with clean_text ...
|
| 13 |
+
# If you ever need to restore originals:
|
| 14 |
+
restored = masker.restore(llm_output, mapping)
|
| 15 |
+
"""
|
| 16 |
+
from __future__ import annotations
|
| 17 |
+
|
| 18 |
+
import re
|
| 19 |
+
from typing import Optional
|
| 20 |
+
|
| 21 |
+
from presidio_analyzer import AnalyzerEngine, RecognizerResult
|
| 22 |
+
from presidio_analyzer.nlp_engine import NlpEngineProvider
|
| 23 |
+
from presidio_anonymizer import AnonymizerEngine
|
| 24 |
+
from presidio_anonymizer.entities import OperatorConfig
|
| 25 |
+
|
| 26 |
+
from settings import settings
|
| 27 |
+
|
| 28 |
+
|
| 29 |
+
# ---------------------------------------------------------------------------
|
| 30 |
+
# Default entity list (tuned for UK motor insurance documents)
|
| 31 |
+
# ---------------------------------------------------------------------------
|
| 32 |
+
|
| 33 |
+
UK_MOTOR_ENTITIES: list[str] = [
|
| 34 |
+
"PERSON",
|
| 35 |
+
"PHONE_NUMBER",
|
| 36 |
+
"EMAIL_ADDRESS",
|
| 37 |
+
"UK_NHS",
|
| 38 |
+
"UK_NIN", # National Insurance Number
|
| 39 |
+
"CREDIT_CARD",
|
| 40 |
+
"IBAN_CODE",
|
| 41 |
+
"LOCATION", # postcodes / addresses
|
| 42 |
+
"IP_ADDRESS",
|
| 43 |
+
"URL",
|
| 44 |
+
]
|
| 45 |
+
|
| 46 |
+
# Sentinel prefix used for replacement tokens so we can detect them reliably
|
| 47 |
+
_TOKEN_PREFIX = "MASKED_"
|
| 48 |
+
|
| 49 |
+
|
| 50 |
+
class PIIMasker:
|
| 51 |
+
"""
|
| 52 |
+
Stateless masker: call `mask()` to redact PII in a text string.
|
| 53 |
+
|
| 54 |
+
Parameters
|
| 55 |
+
----------
|
| 56 |
+
entities : list[str]
|
| 57 |
+
Presidio entity types to redact. Defaults to UK_MOTOR_ENTITIES.
|
| 58 |
+
language : str
|
| 59 |
+
ISO 639-1 language code passed to the Presidio analyzer.
|
| 60 |
+
mask_dates : bool
|
| 61 |
+
When True, DATE_TIME entities are also redacted. Default False
|
| 62 |
+
because insurance documents are date-heavy and stripping them
|
| 63 |
+
would break structured extraction.
|
| 64 |
+
score_threshold : float
|
| 65 |
+
Minimum confidence score (0-1) for a detected entity to be masked.
|
| 66 |
+
"""
|
| 67 |
+
|
| 68 |
+
def __init__(
|
| 69 |
+
self,
|
| 70 |
+
entities: Optional[list[str]] = None,
|
| 71 |
+
language: str = settings.pii.language,
|
| 72 |
+
mask_dates: bool = settings.pii.mask_dates,
|
| 73 |
+
score_threshold: float = settings.pii.score_threshold,
|
| 74 |
+
) -> None:
|
| 75 |
+
self._entities = list(entities or settings.pii.entities)
|
| 76 |
+
if mask_dates and "DATE_TIME" not in self._entities:
|
| 77 |
+
self._entities.append("DATE_TIME")
|
| 78 |
+
|
| 79 |
+
self._language = language
|
| 80 |
+
self._score_threshold = score_threshold
|
| 81 |
+
|
| 82 |
+
# Build NLP engine (spaCy en_core_web_lg preferred; falls back to sm)
|
| 83 |
+
nlp_config = {
|
| 84 |
+
"nlp_engine_name": "spacy",
|
| 85 |
+
"models": [{"lang_code": "en", "model_name": "en_core_web_lg"}],
|
| 86 |
+
}
|
| 87 |
+
try:
|
| 88 |
+
provider = NlpEngineProvider(nlp_configuration=nlp_config)
|
| 89 |
+
nlp_engine = provider.create_engine()
|
| 90 |
+
except OSError:
|
| 91 |
+
# Fall back to the small model if lg is not installed
|
| 92 |
+
nlp_config["models"][0]["model_name"] = "en_core_web_sm"
|
| 93 |
+
provider = NlpEngineProvider(nlp_configuration=nlp_config)
|
| 94 |
+
nlp_engine = provider.create_engine()
|
| 95 |
+
|
| 96 |
+
self._analyzer = AnalyzerEngine(nlp_engine=nlp_engine, supported_languages=[language])
|
| 97 |
+
self._anonymizer = AnonymizerEngine()
|
| 98 |
+
|
| 99 |
+
# ------------------------------------------------------------------
|
| 100 |
+
# Public API
|
| 101 |
+
# ------------------------------------------------------------------
|
| 102 |
+
|
| 103 |
+
def mask(self, text: str) -> tuple[str, dict[str, str]]:
|
| 104 |
+
"""
|
| 105 |
+
Redact PII in *text* and return (masked_text, token_map).
|
| 106 |
+
|
| 107 |
+
token_map maps placeholder tokens back to original values, allowing
|
| 108 |
+
optional restoration after LLM processing.
|
| 109 |
+
|
| 110 |
+
Example
|
| 111 |
+
-------
|
| 112 |
+
>>> masked, mapping = masker.mask("John Smith drives AB12 CDE")
|
| 113 |
+
>>> masked
|
| 114 |
+
'MASKED_PERSON_1 drives AB12 CDE'
|
| 115 |
+
>>> mapping
|
| 116 |
+
{'MASKED_PERSON_1': 'John Smith'}
|
| 117 |
+
"""
|
| 118 |
+
results: list[RecognizerResult] = self._analyzer.analyze(
|
| 119 |
+
text=text,
|
| 120 |
+
entities=self._entities,
|
| 121 |
+
language=self._language,
|
| 122 |
+
score_threshold=self._score_threshold,
|
| 123 |
+
)
|
| 124 |
+
|
| 125 |
+
if not results:
|
| 126 |
+
return text, {}
|
| 127 |
+
|
| 128 |
+
# Build per-entity-type counters for unique token names
|
| 129 |
+
counters: dict[str, int] = {}
|
| 130 |
+
token_map: dict[str, str] = {}
|
| 131 |
+
operators: dict[str, OperatorConfig] = {}
|
| 132 |
+
|
| 133 |
+
# Sort by position so token numbering is left-to-right and deterministic
|
| 134 |
+
results_sorted = sorted(results, key=lambda r: r.start)
|
| 135 |
+
|
| 136 |
+
# We need custom lambda operators to generate named tokens.
|
| 137 |
+
# Presidio's "replace" operator uses a fixed `new_value`; we work
|
| 138 |
+
# around this by building a value map keyed on (entity_type, original).
|
| 139 |
+
original_to_token: dict[tuple[str, str], str] = {}
|
| 140 |
+
|
| 141 |
+
for r in results_sorted:
|
| 142 |
+
original = text[r.start : r.end]
|
| 143 |
+
key = (r.entity_type, original)
|
| 144 |
+
if key not in original_to_token:
|
| 145 |
+
counters[r.entity_type] = counters.get(r.entity_type, 0) + 1
|
| 146 |
+
token = f"{_TOKEN_PREFIX}{r.entity_type}_{counters[r.entity_type]}"
|
| 147 |
+
original_to_token[key] = token
|
| 148 |
+
token_map[token] = original
|
| 149 |
+
|
| 150 |
+
# Perform replacement manually (Presidio replace operator doesn't
|
| 151 |
+
# support per-occurrence dynamic values in a single pass).
|
| 152 |
+
masked_text = _replace_spans(text, results_sorted, original_to_token)
|
| 153 |
+
return masked_text, token_map
|
| 154 |
+
|
| 155 |
+
def restore(self, text: str, token_map: dict[str, str]) -> str:
|
| 156 |
+
"""
|
| 157 |
+
Substitute masked tokens back to original PII values.
|
| 158 |
+
|
| 159 |
+
This is provided for completeness / testing; in production the LLM
|
| 160 |
+
output is kept masked and stored as-is for GDPR compliance.
|
| 161 |
+
"""
|
| 162 |
+
for token, original in token_map.items():
|
| 163 |
+
text = text.replace(token, original)
|
| 164 |
+
return text
|
| 165 |
+
|
| 166 |
+
|
| 167 |
+
# ---------------------------------------------------------------------------
|
| 168 |
+
# Internal helpers
|
| 169 |
+
# ---------------------------------------------------------------------------
|
| 170 |
+
|
| 171 |
+
|
| 172 |
+
def _replace_spans(
|
| 173 |
+
text: str,
|
| 174 |
+
results: list[RecognizerResult],
|
| 175 |
+
original_to_token: dict[tuple[str, str], str],
|
| 176 |
+
) -> str:
|
| 177 |
+
"""
|
| 178 |
+
Replace PII spans in *text* with their corresponding tokens.
|
| 179 |
+
Processes spans right-to-left to keep offset arithmetic valid.
|
| 180 |
+
"""
|
| 181 |
+
chars = list(text)
|
| 182 |
+
for r in sorted(results, key=lambda r: r.start, reverse=True):
|
| 183 |
+
original = text[r.start : r.end]
|
| 184 |
+
token = original_to_token.get((r.entity_type, original), original)
|
| 185 |
+
chars[r.start : r.end] = list(token)
|
| 186 |
+
return "".join(chars)
|
src/prompts.py
ADDED
|
@@ -0,0 +1,149 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
prompts.py — Versioned prompt registry for the UK Motor Insurance IDP pipeline.
|
| 3 |
+
|
| 4 |
+
Loads prompt text from prompts.yaml so prompts can be updated, versioned, and
|
| 5 |
+
reviewed without touching Python source code.
|
| 6 |
+
|
| 7 |
+
Usage
|
| 8 |
+
-----
|
| 9 |
+
registry = PromptRegistry() # uses active_version from YAML
|
| 10 |
+
registry = PromptRegistry(version="v2") # pin to a specific version
|
| 11 |
+
registry = PromptRegistry(config_path="custom.yaml")
|
| 12 |
+
|
| 13 |
+
system_prompt = registry.get(DocumentType.SCHEDULE)
|
| 14 |
+
print(registry.active_version) # → "v1"
|
| 15 |
+
print(registry.available_versions) # → ["v1"]
|
| 16 |
+
"""
|
| 17 |
+
from __future__ import annotations
|
| 18 |
+
|
| 19 |
+
import logging
|
| 20 |
+
from pathlib import Path
|
| 21 |
+
from typing import Optional
|
| 22 |
+
|
| 23 |
+
import yaml
|
| 24 |
+
|
| 25 |
+
from schema import DocumentType
|
| 26 |
+
|
| 27 |
+
logger = logging.getLogger(__name__)
|
| 28 |
+
|
| 29 |
+
# Default path: <project_root>/config/prompts.yaml
|
| 30 |
+
# Resolved relative to this file's location (src/ → .. → config/)
|
| 31 |
+
_DEFAULT_CONFIG = Path(__file__).parent.parent / "config" / "prompts.yaml"
|
| 32 |
+
|
| 33 |
+
# Maps DocumentType enum values → YAML keys
|
| 34 |
+
_DOC_TYPE_TO_KEY: dict[DocumentType, str] = {
|
| 35 |
+
DocumentType.SCHEDULE: "Schedule",
|
| 36 |
+
DocumentType.CERTIFICATE: "Certificate",
|
| 37 |
+
DocumentType.STATEMENT_OF_FACT: "StatementOfFact",
|
| 38 |
+
DocumentType.POLICY_BOOKLET: "PolicyBooklet",
|
| 39 |
+
DocumentType.UNKNOWN: "_generic",
|
| 40 |
+
}
|
| 41 |
+
|
| 42 |
+
_GENERIC_KEY = "_generic"
|
| 43 |
+
|
| 44 |
+
|
| 45 |
+
class PromptRegistry:
|
| 46 |
+
"""
|
| 47 |
+
Loads versioned prompts from a YAML file and resolves them by DocumentType.
|
| 48 |
+
|
| 49 |
+
Parameters
|
| 50 |
+
----------
|
| 51 |
+
config_path : str | Path | None
|
| 52 |
+
Path to the YAML file. Defaults to ``src/prompts.yaml`` (sibling of
|
| 53 |
+
this module).
|
| 54 |
+
version : str | None
|
| 55 |
+
Prompt version to activate (e.g. ``"v1"``, ``"v2"``).
|
| 56 |
+
Defaults to the ``active_version`` key in the YAML file.
|
| 57 |
+
"""
|
| 58 |
+
|
| 59 |
+
def __init__(
|
| 60 |
+
self,
|
| 61 |
+
config_path: Optional[str | Path] = None,
|
| 62 |
+
version: Optional[str] = None,
|
| 63 |
+
) -> None:
|
| 64 |
+
self._config_path = Path(config_path) if config_path else _DEFAULT_CONFIG
|
| 65 |
+
self._raw = self._load_yaml()
|
| 66 |
+
self._active_version = version or self._raw.get("active_version", "v1")
|
| 67 |
+
self._prompts = self._resolve_version(self._active_version)
|
| 68 |
+
|
| 69 |
+
logger.info(
|
| 70 |
+
"PromptRegistry loaded: version=%s, path=%s",
|
| 71 |
+
self._active_version,
|
| 72 |
+
self._config_path,
|
| 73 |
+
)
|
| 74 |
+
|
| 75 |
+
# ------------------------------------------------------------------
|
| 76 |
+
# Public API
|
| 77 |
+
# ------------------------------------------------------------------
|
| 78 |
+
|
| 79 |
+
@property
|
| 80 |
+
def active_version(self) -> str:
|
| 81 |
+
"""The currently active prompt version string."""
|
| 82 |
+
return self._active_version
|
| 83 |
+
|
| 84 |
+
@property
|
| 85 |
+
def available_versions(self) -> list[str]:
|
| 86 |
+
"""All version keys defined in the YAML file."""
|
| 87 |
+
return list(self._raw.get("prompts", {}).keys())
|
| 88 |
+
|
| 89 |
+
def get(self, doc_type: DocumentType) -> str:
|
| 90 |
+
"""
|
| 91 |
+
Return the system prompt for a given DocumentType.
|
| 92 |
+
|
| 93 |
+
Falls back to the ``_generic`` prompt if the specific key is missing.
|
| 94 |
+
Raises ``KeyError`` if ``_generic`` is also absent (misconfigured YAML).
|
| 95 |
+
"""
|
| 96 |
+
key = _DOC_TYPE_TO_KEY.get(doc_type, _GENERIC_KEY)
|
| 97 |
+
prompt = self._prompts.get(key) or self._prompts.get(_GENERIC_KEY)
|
| 98 |
+
if not prompt:
|
| 99 |
+
raise KeyError(
|
| 100 |
+
f"No prompt found for DocumentType '{doc_type.value}' in version "
|
| 101 |
+
f"'{self._active_version}' of {self._config_path}. "
|
| 102 |
+
f"Ensure '{key}' or '{_GENERIC_KEY}' is defined."
|
| 103 |
+
)
|
| 104 |
+
return prompt.strip()
|
| 105 |
+
|
| 106 |
+
def reload(self) -> None:
|
| 107 |
+
"""
|
| 108 |
+
Hot-reload prompts from disk without restarting the process.
|
| 109 |
+
|
| 110 |
+
Useful in long-running services when prompts.yaml is updated in place.
|
| 111 |
+
"""
|
| 112 |
+
self._raw = self._load_yaml()
|
| 113 |
+
self._prompts = self._resolve_version(self._active_version)
|
| 114 |
+
logger.info("PromptRegistry reloaded from %s", self._config_path)
|
| 115 |
+
|
| 116 |
+
def switch_version(self, version: str) -> None:
|
| 117 |
+
"""
|
| 118 |
+
Switch the active prompt version at runtime.
|
| 119 |
+
|
| 120 |
+
Parameters
|
| 121 |
+
----------
|
| 122 |
+
version : str
|
| 123 |
+
Must be a key present under ``prompts:`` in the YAML file.
|
| 124 |
+
"""
|
| 125 |
+
self._prompts = self._resolve_version(version)
|
| 126 |
+
self._active_version = version
|
| 127 |
+
logger.info("PromptRegistry switched to version '%s'", version)
|
| 128 |
+
|
| 129 |
+
# ------------------------------------------------------------------
|
| 130 |
+
# Private helpers
|
| 131 |
+
# ------------------------------------------------------------------
|
| 132 |
+
|
| 133 |
+
def _load_yaml(self) -> dict:
|
| 134 |
+
if not self._config_path.exists():
|
| 135 |
+
raise FileNotFoundError(
|
| 136 |
+
f"Prompt configuration not found: {self._config_path}"
|
| 137 |
+
)
|
| 138 |
+
with self._config_path.open(encoding="utf-8") as fh:
|
| 139 |
+
return yaml.safe_load(fh) or {}
|
| 140 |
+
|
| 141 |
+
def _resolve_version(self, version: str) -> dict[str, str]:
|
| 142 |
+
versions = self._raw.get("prompts", {})
|
| 143 |
+
if version not in versions:
|
| 144 |
+
available = list(versions.keys())
|
| 145 |
+
raise ValueError(
|
| 146 |
+
f"Prompt version '{version}' not found in {self._config_path}. "
|
| 147 |
+
f"Available versions: {available}"
|
| 148 |
+
)
|
| 149 |
+
return versions[version]
|
src/provenance.py
ADDED
|
@@ -0,0 +1,424 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
provenance.py — Post-extraction provenance mapping for the Visual Audit UI.
|
| 3 |
+
|
| 4 |
+
After the LLM extracts a flat Golden Record, this module walks the record and
|
| 5 |
+
fuzzy-matches each extracted value against a ProvenanceCorpus built from the
|
| 6 |
+
Docling document IR. The LLM is never asked to self-report geometry — that
|
| 7 |
+
would cause hallucinations; this module handles localisation as a pure
|
| 8 |
+
post-processing step.
|
| 9 |
+
|
| 10 |
+
Coordinate convention
|
| 11 |
+
─────────────────────
|
| 12 |
+
Docling bbox : PDF space — origin bottom-left, y increases upward, unit = pt
|
| 13 |
+
Stored bbox : Browser % — origin top-left, y increases downward, range 0–100
|
| 14 |
+
|
| 15 |
+
Conversion (per axis):
|
| 16 |
+
x0% = bbox.l / page_width * 100
|
| 17 |
+
y0% = (page_height - bbox.t) / page_height * 100 # top of element
|
| 18 |
+
x1% = bbox.r / page_width * 100
|
| 19 |
+
y1% = (page_height - bbox.b) / page_height * 100 # bottom of element
|
| 20 |
+
"""
|
| 21 |
+
from __future__ import annotations
|
| 22 |
+
|
| 23 |
+
import logging
|
| 24 |
+
import re
|
| 25 |
+
from dataclasses import dataclass
|
| 26 |
+
from typing import Any, Iterator
|
| 27 |
+
|
| 28 |
+
logger = logging.getLogger(__name__)
|
| 29 |
+
|
| 30 |
+
# ── Matching parameters ──────────────────────────────────────────────────────
|
| 31 |
+
_MATCH_THRESHOLD = 78 # minimum rapidfuzz WRatio (0–100) for normalised-value fallback
|
| 32 |
+
_CITATION_THRESHOLD = 88 # minimum partial_ratio for LLM-supplied verbatim citation quotes
|
| 33 |
+
_MIN_VALUE_LEN = 4 # skip matching for values shorter than this (too ambiguous)
|
| 34 |
+
|
| 35 |
+
# Leaf field names whose values are boolean-like and would match too broadly
|
| 36 |
+
_SKIP_LEAF_NAMES = {
|
| 37 |
+
"is_main_driver", "protected", "has_security_device",
|
| 38 |
+
"tracker_fitted", "driving_other_cars",
|
| 39 |
+
}
|
| 40 |
+
|
| 41 |
+
# Top-level section names to skip entirely.
|
| 42 |
+
# `source_document` and `field_citations` are internal provenance fields —
|
| 43 |
+
# they don't contain verbatim PDF values so matching against them is meaningless.
|
| 44 |
+
_SKIP_SECTION_NAMES = {"source_document", "field_citations"}
|
| 45 |
+
|
| 46 |
+
# Document types whose corpora are unreliable for field-level matching.
|
| 47 |
+
# Policy Booklets contain generic boilerplate — matching against them produces
|
| 48 |
+
# false positives for almost every field ("Full", "UK", date digits, etc.).
|
| 49 |
+
_EXCLUDE_FROM_MATCHING: set[str] = {"PolicyBooklet", "Unknown"}
|
| 50 |
+
|
| 51 |
+
# Padding added to each bbox for display. The Docling bbox is a tight text
|
| 52 |
+
# box (~1% page height per line) which is hard to see. We expand it so the
|
| 53 |
+
# highlight is clearly visible without losing positional accuracy.
|
| 54 |
+
_BBOX_PAD_X = 0.4 # % to expand left/right
|
| 55 |
+
_BBOX_PAD_Y = 0.6 # % to expand top/bottom
|
| 56 |
+
_BBOX_MIN_H = 2.0 # % minimum height after padding
|
| 57 |
+
|
| 58 |
+
|
| 59 |
+
# ---------------------------------------------------------------------------
|
| 60 |
+
# Corpus data structures
|
| 61 |
+
# ---------------------------------------------------------------------------
|
| 62 |
+
|
| 63 |
+
|
| 64 |
+
@dataclass
|
| 65 |
+
class CorpusItem:
|
| 66 |
+
"""One text element from a Docling DoclingDocument, with browser % geometry."""
|
| 67 |
+
|
| 68 |
+
text: str
|
| 69 |
+
page: int
|
| 70 |
+
bbox: list[float] # [x0%, y0%, x1%, y1%] — top-left origin, 0–100
|
| 71 |
+
source_filename: str
|
| 72 |
+
|
| 73 |
+
|
| 74 |
+
class ProvenanceCorpus:
|
| 75 |
+
"""All extractable text elements from one PDF, with their page geometry."""
|
| 76 |
+
|
| 77 |
+
def __init__(self, source_filename: str = "", doc_type: str = "Unknown") -> None:
|
| 78 |
+
self.source_filename = source_filename
|
| 79 |
+
self.doc_type = doc_type # e.g. "Schedule", "Certificate", "PolicyBooklet"
|
| 80 |
+
self.items: list[CorpusItem] = []
|
| 81 |
+
|
| 82 |
+
# ------------------------------------------------------------------
|
| 83 |
+
# Public API
|
| 84 |
+
# ------------------------------------------------------------------
|
| 85 |
+
|
| 86 |
+
def add_from_docling(self, doc: Any, filename: str) -> None:
|
| 87 |
+
"""
|
| 88 |
+
Populate the corpus from a Docling DoclingDocument.
|
| 89 |
+
|
| 90 |
+
Safely handles API variations across docling versions — logs a warning
|
| 91 |
+
rather than propagating exceptions, so the calling pipeline stays alive
|
| 92 |
+
even if provenance extraction fails.
|
| 93 |
+
"""
|
| 94 |
+
self.source_filename = filename
|
| 95 |
+
try:
|
| 96 |
+
self._extract_items(doc, filename)
|
| 97 |
+
logger.debug(
|
| 98 |
+
"Corpus '%s': %d items, %d pages",
|
| 99 |
+
filename, len(self.items), self._count_pages(doc),
|
| 100 |
+
)
|
| 101 |
+
except Exception as exc: # noqa: BLE001
|
| 102 |
+
logger.warning(
|
| 103 |
+
"Provenance extraction skipped for '%s': %s", filename, exc
|
| 104 |
+
)
|
| 105 |
+
|
| 106 |
+
# ------------------------------------------------------------------
|
| 107 |
+
# Private helpers
|
| 108 |
+
# ------------------------------------------------------------------
|
| 109 |
+
|
| 110 |
+
def _extract_items(self, doc: Any, filename: str) -> None:
|
| 111 |
+
page_sizes = _build_page_sizes(doc)
|
| 112 |
+
if not page_sizes:
|
| 113 |
+
logger.debug("No page size data for '%s' — provenance skipped", filename)
|
| 114 |
+
return
|
| 115 |
+
|
| 116 |
+
for item in _iter_items(doc):
|
| 117 |
+
text = _item_text(item)
|
| 118 |
+
if not text or len(text) < 2:
|
| 119 |
+
continue
|
| 120 |
+
for prov in getattr(item, "prov", []):
|
| 121 |
+
self._add_prov_item(prov, text, filename, page_sizes)
|
| 122 |
+
|
| 123 |
+
def _add_prov_item(
|
| 124 |
+
self,
|
| 125 |
+
prov: Any,
|
| 126 |
+
text: str,
|
| 127 |
+
filename: str,
|
| 128 |
+
page_sizes: dict[int, tuple[float, float]],
|
| 129 |
+
) -> None:
|
| 130 |
+
page_no = getattr(prov, "page_no", None)
|
| 131 |
+
if page_no is None:
|
| 132 |
+
return
|
| 133 |
+
page_no = int(page_no)
|
| 134 |
+
if page_no not in page_sizes:
|
| 135 |
+
return
|
| 136 |
+
|
| 137 |
+
pw, ph = page_sizes[page_no]
|
| 138 |
+
bbox = getattr(prov, "bbox", None)
|
| 139 |
+
if bbox is None:
|
| 140 |
+
return
|
| 141 |
+
|
| 142 |
+
l = float(getattr(bbox, "l", 0))
|
| 143 |
+
t_v = float(getattr(bbox, "t", ph)) # top in PDF space (high y value)
|
| 144 |
+
r = float(getattr(bbox, "r", pw))
|
| 145 |
+
b = float(getattr(bbox, "b", 0)) # bottom in PDF space (low y value)
|
| 146 |
+
|
| 147 |
+
# Convert: PDF (bottom-left origin, pts) → browser % (top-left origin)
|
| 148 |
+
x0 = _clamp(l / pw * 100)
|
| 149 |
+
y0 = _clamp((ph - t_v) / ph * 100) # top of element in browser coords
|
| 150 |
+
x1 = _clamp(r / pw * 100)
|
| 151 |
+
y1 = _clamp((ph - b) / ph * 100) # bottom of element in browser coords
|
| 152 |
+
|
| 153 |
+
self.items.append(CorpusItem(
|
| 154 |
+
text=text,
|
| 155 |
+
page=page_no,
|
| 156 |
+
bbox=[round(x0, 3), round(y0, 3), round(x1, 3), round(y1, 3)],
|
| 157 |
+
source_filename=filename,
|
| 158 |
+
))
|
| 159 |
+
|
| 160 |
+
@staticmethod
|
| 161 |
+
def _count_pages(doc: Any) -> int:
|
| 162 |
+
return len(getattr(doc, "pages", {}))
|
| 163 |
+
|
| 164 |
+
|
| 165 |
+
# ---------------------------------------------------------------------------
|
| 166 |
+
# Module-level helpers for corpus building
|
| 167 |
+
# ---------------------------------------------------------------------------
|
| 168 |
+
|
| 169 |
+
|
| 170 |
+
def _build_page_sizes(doc: Any) -> dict[int, tuple[float, float]]:
|
| 171 |
+
sizes: dict[int, tuple[float, float]] = {}
|
| 172 |
+
for page_no, page_item in getattr(doc, "pages", {}).items():
|
| 173 |
+
size = getattr(page_item, "size", None)
|
| 174 |
+
if size:
|
| 175 |
+
w = float(getattr(size, "width", 0))
|
| 176 |
+
h = float(getattr(size, "height", 0))
|
| 177 |
+
if w > 0 and h > 0:
|
| 178 |
+
sizes[int(page_no)] = (w, h)
|
| 179 |
+
return sizes
|
| 180 |
+
|
| 181 |
+
|
| 182 |
+
def _iter_items(doc: Any):
|
| 183 |
+
"""Yield all document items, trying iterate_items() first then .texts/.tables."""
|
| 184 |
+
try:
|
| 185 |
+
for item, _level in doc.iterate_items():
|
| 186 |
+
yield item
|
| 187 |
+
except AttributeError:
|
| 188 |
+
for item in getattr(doc, "texts", []):
|
| 189 |
+
yield item
|
| 190 |
+
for item in getattr(doc, "tables", []):
|
| 191 |
+
yield item
|
| 192 |
+
|
| 193 |
+
|
| 194 |
+
def _item_text(item: Any) -> str:
|
| 195 |
+
"""Extract a string from a Docling TextItem or TableItem."""
|
| 196 |
+
text = getattr(item, "text", None)
|
| 197 |
+
if text is not None:
|
| 198 |
+
return str(text).strip()
|
| 199 |
+
# TableItem: concatenate all cell text into one searchable blob
|
| 200 |
+
data = getattr(item, "data", None)
|
| 201 |
+
if data is not None:
|
| 202 |
+
cells = [
|
| 203 |
+
str(getattr(cell, "text", "")).strip()
|
| 204 |
+
for row in getattr(data, "grid", [])
|
| 205 |
+
for cell in row
|
| 206 |
+
]
|
| 207 |
+
return " | ".join(c for c in cells if c)
|
| 208 |
+
return ""
|
| 209 |
+
|
| 210 |
+
|
| 211 |
+
def _clamp(v: float) -> float:
|
| 212 |
+
return max(0.0, min(100.0, v))
|
| 213 |
+
|
| 214 |
+
|
| 215 |
+
# ---------------------------------------------------------------------------
|
| 216 |
+
# Field-level provenance builder (main public function)
|
| 217 |
+
# ---------------------------------------------------------------------------
|
| 218 |
+
|
| 219 |
+
|
| 220 |
+
def build_provenance(
|
| 221 |
+
record: Any, # UKMotorGoldenRecord
|
| 222 |
+
corpora: list[ProvenanceCorpus],
|
| 223 |
+
) -> list[Any]: # list[FieldProvenance]
|
| 224 |
+
"""
|
| 225 |
+
Walk the Golden Record and fuzzy-match each extracted value against all
|
| 226 |
+
trusted corpora (Schedule, Certificate, StatementOfFact).
|
| 227 |
+
|
| 228 |
+
Policy Booklet corpora are excluded — they contain generic boilerplate
|
| 229 |
+
that produces false positives for almost every field value.
|
| 230 |
+
|
| 231 |
+
Returns a ``FieldProvenance`` entry for every field that can be located
|
| 232 |
+
above the match threshold. Fields with no good corpus match are omitted —
|
| 233 |
+
the UI shows them as "No location data".
|
| 234 |
+
"""
|
| 235 |
+
from schema import FieldProvenance, Location # local import avoids circular dep
|
| 236 |
+
|
| 237 |
+
try:
|
| 238 |
+
from rapidfuzz import fuzz as rfuzz
|
| 239 |
+
except ImportError:
|
| 240 |
+
logger.warning(
|
| 241 |
+
"rapidfuzz not installed — provenance matching disabled. "
|
| 242 |
+
"Run: pip install rapidfuzz"
|
| 243 |
+
)
|
| 244 |
+
return []
|
| 245 |
+
|
| 246 |
+
# Filter to trusted corpora only (exclude Policy Booklet and Unknown docs)
|
| 247 |
+
trusted_corpora = [
|
| 248 |
+
c for c in corpora if c.doc_type not in _EXCLUDE_FROM_MATCHING
|
| 249 |
+
]
|
| 250 |
+
if not trusted_corpora:
|
| 251 |
+
logger.warning(
|
| 252 |
+
"No trusted corpora available — all %d corpus/corpora are excluded "
|
| 253 |
+
"(types: %s). Provenance will be empty.",
|
| 254 |
+
len(corpora),
|
| 255 |
+
[c.doc_type for c in corpora],
|
| 256 |
+
)
|
| 257 |
+
return []
|
| 258 |
+
|
| 259 |
+
# LLM-supplied verbatim source quotes: field_path → raw text phrase.
|
| 260 |
+
# These are always preferred over the normalised extracted value because
|
| 261 |
+
# the LLM copies them directly from the document (e.g. "15/04/2026 at 00:00
|
| 262 |
+
# hours" rather than the ISO "2026-04-15T00:00:00" we store in the record).
|
| 263 |
+
citation_map: dict[str, str] = dict(getattr(record, "field_citations", None) or {})
|
| 264 |
+
logger.info(" field_citations from LLM: %d entries", len(citation_map))
|
| 265 |
+
|
| 266 |
+
results: list[FieldProvenance] = []
|
| 267 |
+
citation_hits = 0
|
| 268 |
+
# Track assigned positions to avoid two fields pointing to the same corpus item.
|
| 269 |
+
# Key: (source_filename, page, x0, y0) — unpadded, original corpus position.
|
| 270 |
+
used_positions: set[tuple] = set()
|
| 271 |
+
|
| 272 |
+
for field_path, value_str in _walk_record(record):
|
| 273 |
+
leaf = field_path.split(".")[-1].strip("[]0123456789")
|
| 274 |
+
if leaf in _SKIP_LEAF_NAMES:
|
| 275 |
+
continue
|
| 276 |
+
|
| 277 |
+
# Prefer the verbatim citation quote; fall back to the normalised value.
|
| 278 |
+
# For ISO dates/datetimes also try UK DD/MM/YYYY format as a secondary fallback.
|
| 279 |
+
search_str = citation_map.get(field_path, value_str)
|
| 280 |
+
alt_search: str | None = None
|
| 281 |
+
if field_path not in citation_map:
|
| 282 |
+
alt_search = _iso_to_uk_date(value_str)
|
| 283 |
+
|
| 284 |
+
if len(search_str) < _MIN_VALUE_LEN:
|
| 285 |
+
continue
|
| 286 |
+
|
| 287 |
+
using_citation = field_path in citation_map
|
| 288 |
+
# When matching a citation quote use partial_ratio — the quote is a
|
| 289 |
+
# verbatim substring of the document and WRatio penalises length disparity.
|
| 290 |
+
# For normalised fallback values use WRatio to avoid short false matches.
|
| 291 |
+
score_fn = rfuzz.partial_ratio if using_citation else rfuzz.WRatio
|
| 292 |
+
threshold = _CITATION_THRESHOLD if using_citation else _MATCH_THRESHOLD
|
| 293 |
+
|
| 294 |
+
# Find best match, preferring positions not yet assigned to another field.
|
| 295 |
+
best_score = 0
|
| 296 |
+
best_item: CorpusItem | None = None
|
| 297 |
+
best_unused_score = 0
|
| 298 |
+
best_unused_item: CorpusItem | None = None
|
| 299 |
+
|
| 300 |
+
for corpus in trusted_corpora:
|
| 301 |
+
for item in corpus.items:
|
| 302 |
+
score = score_fn(search_str.lower(), item.text.lower())
|
| 303 |
+
# Also try UK-formatted date if available
|
| 304 |
+
if alt_search and score < threshold:
|
| 305 |
+
alt_score = rfuzz.partial_ratio(alt_search, item.text.lower())
|
| 306 |
+
if alt_score > score:
|
| 307 |
+
score = alt_score
|
| 308 |
+
pos_key = (item.source_filename, item.page, item.bbox[0], item.bbox[1])
|
| 309 |
+
if score > best_score:
|
| 310 |
+
best_score = score
|
| 311 |
+
best_item = item
|
| 312 |
+
if score > best_unused_score and pos_key not in used_positions:
|
| 313 |
+
best_unused_score = score
|
| 314 |
+
best_unused_item = item
|
| 315 |
+
|
| 316 |
+
# Prefer an unused position if it scores above threshold,
|
| 317 |
+
# otherwise fall back to best overall (may share a location).
|
| 318 |
+
if best_unused_item is not None and best_unused_score >= threshold:
|
| 319 |
+
chosen_item = best_unused_item
|
| 320 |
+
chosen_score = best_unused_score
|
| 321 |
+
elif best_item is not None and best_score >= threshold:
|
| 322 |
+
chosen_item = best_item
|
| 323 |
+
chosen_score = best_score
|
| 324 |
+
else:
|
| 325 |
+
continue
|
| 326 |
+
|
| 327 |
+
pos_key = (chosen_item.source_filename, chosen_item.page, chosen_item.bbox[0], chosen_item.bbox[1])
|
| 328 |
+
used_positions.add(pos_key)
|
| 329 |
+
|
| 330 |
+
if using_citation:
|
| 331 |
+
citation_hits += 1
|
| 332 |
+
results.append(FieldProvenance(
|
| 333 |
+
field_path=field_path,
|
| 334 |
+
extracted_value=value_str,
|
| 335 |
+
matched_text=chosen_item.text[:200], # truncate very long table blobs
|
| 336 |
+
match_score=round(chosen_score / 100.0, 3),
|
| 337 |
+
source_filename=chosen_item.source_filename,
|
| 338 |
+
location=Location(
|
| 339 |
+
page=chosen_item.page,
|
| 340 |
+
bbox=_padded_bbox(chosen_item.bbox),
|
| 341 |
+
),
|
| 342 |
+
))
|
| 343 |
+
|
| 344 |
+
total = _count_total_fields(record)
|
| 345 |
+
logger.info(
|
| 346 |
+
"Provenance: %d / %d fields located (%d via citation quotes, %d via fuzzy fallback) "
|
| 347 |
+
"— trusted corpora: %s",
|
| 348 |
+
len(results), total,
|
| 349 |
+
citation_hits, len(results) - citation_hits,
|
| 350 |
+
[c.source_filename for c in trusted_corpora],
|
| 351 |
+
)
|
| 352 |
+
return results
|
| 353 |
+
|
| 354 |
+
|
| 355 |
+
# ---------------------------------------------------------------------------
|
| 356 |
+
# Field-walking helpers
|
| 357 |
+
# ---------------------------------------------------------------------------
|
| 358 |
+
|
| 359 |
+
|
| 360 |
+
def _walk_record(record: Any) -> Iterator[tuple[str, str]]:
|
| 361 |
+
"""Yield (field_path, string_value) for all non-None leaf values in the record."""
|
| 362 |
+
data = record.model_dump(exclude_none=True)
|
| 363 |
+
yield from _walk_dict(data, "")
|
| 364 |
+
|
| 365 |
+
|
| 366 |
+
def _walk_dict(d: dict, prefix: str) -> Iterator[tuple[str, str]]:
|
| 367 |
+
for key, val in d.items():
|
| 368 |
+
# Skip whole sections that produce unreliable or irrelevant matches
|
| 369 |
+
top_key = prefix.split(".")[0].split("[")[0] if prefix else key
|
| 370 |
+
if key in _SKIP_SECTION_NAMES or top_key in _SKIP_SECTION_NAMES:
|
| 371 |
+
continue
|
| 372 |
+
path = f"{prefix}.{key}" if prefix else key
|
| 373 |
+
if isinstance(val, dict):
|
| 374 |
+
yield from _walk_dict(val, path)
|
| 375 |
+
elif isinstance(val, list):
|
| 376 |
+
yield from _walk_list(val, path)
|
| 377 |
+
elif val is not None:
|
| 378 |
+
yield path, str(val)
|
| 379 |
+
|
| 380 |
+
|
| 381 |
+
def _walk_list(lst: list, prefix: str) -> Iterator[tuple[str, str]]:
|
| 382 |
+
for i, item in enumerate(lst):
|
| 383 |
+
path = f"{prefix}[{i}]"
|
| 384 |
+
if isinstance(item, dict):
|
| 385 |
+
yield from _walk_dict(item, path)
|
| 386 |
+
elif item is not None:
|
| 387 |
+
yield path, str(item)
|
| 388 |
+
|
| 389 |
+
|
| 390 |
+
def _count_total_fields(record: Any) -> int:
|
| 391 |
+
data = record.model_dump(exclude_none=True)
|
| 392 |
+
return sum(1 for _ in _walk_dict(data, ""))
|
| 393 |
+
|
| 394 |
+
|
| 395 |
+
# ISO 8601 date/datetime patterns → UK DD/MM/YYYY
|
| 396 |
+
_ISO_DATE_RE = re.compile(r'^(\d{4})-(\d{2})-(\d{2})')
|
| 397 |
+
|
| 398 |
+
|
| 399 |
+
def _iso_to_uk_date(value: str) -> str | None:
|
| 400 |
+
"""Convert ISO date/datetime string to UK DD/MM/YYYY for document matching.
|
| 401 |
+
|
| 402 |
+
Returns the UK-format string (e.g. "15/04/2026") if value looks like an
|
| 403 |
+
ISO date, otherwise returns None.
|
| 404 |
+
"""
|
| 405 |
+
m = _ISO_DATE_RE.match(value.strip())
|
| 406 |
+
if m:
|
| 407 |
+
yyyy, mm, dd = m.group(1), m.group(2), m.group(3)
|
| 408 |
+
return f"{dd}/{mm}/{yyyy}"
|
| 409 |
+
return None
|
| 410 |
+
|
| 411 |
+
|
| 412 |
+
def _padded_bbox(bbox: list[float]) -> list[float]:
|
| 413 |
+
"""Expand a tight Docling text bbox so highlights are clearly visible in the UI."""
|
| 414 |
+
x0, y0, x1, y1 = bbox
|
| 415 |
+
x0 = _clamp(x0 - _BBOX_PAD_X)
|
| 416 |
+
y0 = _clamp(y0 - _BBOX_PAD_Y)
|
| 417 |
+
x1 = _clamp(x1 + _BBOX_PAD_X)
|
| 418 |
+
y1 = _clamp(y1 + _BBOX_PAD_Y)
|
| 419 |
+
# Enforce minimum height so single-line text is always visible
|
| 420 |
+
if (y1 - y0) < _BBOX_MIN_H:
|
| 421 |
+
mid = (y0 + y1) / 2
|
| 422 |
+
y0 = _clamp(mid - _BBOX_MIN_H / 2)
|
| 423 |
+
y1 = _clamp(mid + _BBOX_MIN_H / 2)
|
| 424 |
+
return [round(x0, 3), round(y0, 3), round(x1, 3), round(y1, 3)]
|
src/schema.py
ADDED
|
@@ -0,0 +1,205 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
schema.py — Canonical Pydantic V2 data models for UK Motor Insurance extraction.
|
| 3 |
+
|
| 4 |
+
UKMotorGoldenRecord is the top-level output produced by the pipeline.
|
| 5 |
+
All sub-model fields are Optional to support partial per-document extractions;
|
| 6 |
+
the Arbiter produces the final complete record.
|
| 7 |
+
|
| 8 |
+
DocumentType and SourceMetadata are internal provenance types excluded from
|
| 9 |
+
the serialised Golden Record output (source_document uses Field(exclude=True)).
|
| 10 |
+
"""
|
| 11 |
+
from __future__ import annotations
|
| 12 |
+
|
| 13 |
+
from datetime import date, datetime
|
| 14 |
+
from enum import Enum
|
| 15 |
+
from typing import Dict, List, Optional, Union
|
| 16 |
+
|
| 17 |
+
from pydantic import BaseModel, Field
|
| 18 |
+
|
| 19 |
+
|
| 20 |
+
# ---------------------------------------------------------------------------
|
| 21 |
+
# Internal provenance (not in the serialised output)
|
| 22 |
+
# ---------------------------------------------------------------------------
|
| 23 |
+
|
| 24 |
+
|
| 25 |
+
class DocumentType(str, Enum):
|
| 26 |
+
"""Source document classification used for provenance and priority routing."""
|
| 27 |
+
|
| 28 |
+
SCHEDULE = "Schedule"
|
| 29 |
+
CERTIFICATE = "Certificate"
|
| 30 |
+
STATEMENT_OF_FACT = "StatementOfFact"
|
| 31 |
+
POLICY_BOOKLET = "PolicyBooklet"
|
| 32 |
+
UNKNOWN = "Unknown"
|
| 33 |
+
|
| 34 |
+
|
| 35 |
+
class SourceMetadata(BaseModel):
|
| 36 |
+
"""Attached to every extraction so the arbiter can trace data lineage."""
|
| 37 |
+
|
| 38 |
+
document_type: DocumentType = DocumentType.UNKNOWN
|
| 39 |
+
filename: str = ""
|
| 40 |
+
page_count: Optional[int] = None
|
| 41 |
+
|
| 42 |
+
|
| 43 |
+
# ---------------------------------------------------------------------------
|
| 44 |
+
# Golden Record sub-models
|
| 45 |
+
# ---------------------------------------------------------------------------
|
| 46 |
+
|
| 47 |
+
|
| 48 |
+
class PeriodOfCover(BaseModel):
|
| 49 |
+
start_date: Optional[datetime] = None
|
| 50 |
+
expiry_date: Optional[datetime] = None
|
| 51 |
+
issue_date: Optional[date] = None
|
| 52 |
+
|
| 53 |
+
|
| 54 |
+
class PolicyHeader(BaseModel):
|
| 55 |
+
policy_number: Optional[str] = None
|
| 56 |
+
insurer: Optional[str] = None
|
| 57 |
+
product_name: Optional[str] = None
|
| 58 |
+
period_of_cover: Optional[PeriodOfCover] = None
|
| 59 |
+
|
| 60 |
+
|
| 61 |
+
class SecurityDetails(BaseModel):
|
| 62 |
+
has_security_device: Optional[bool] = None
|
| 63 |
+
tracker_fitted: Optional[bool] = None
|
| 64 |
+
modifications: Optional[str] = None
|
| 65 |
+
|
| 66 |
+
|
| 67 |
+
class VehicleDetails(BaseModel):
|
| 68 |
+
vrm: Optional[str] = None
|
| 69 |
+
make: Optional[str] = None
|
| 70 |
+
model: Optional[str] = None
|
| 71 |
+
fuel_type: Optional[str] = None
|
| 72 |
+
transmission: Optional[str] = None
|
| 73 |
+
estimated_value: Optional[str] = None
|
| 74 |
+
annual_mileage: Optional[int] = None
|
| 75 |
+
overnight_postcode: Optional[str] = None
|
| 76 |
+
kept_location: Optional[str] = None
|
| 77 |
+
security: Optional[SecurityDetails] = None
|
| 78 |
+
|
| 79 |
+
|
| 80 |
+
class Driver(BaseModel):
|
| 81 |
+
name: str
|
| 82 |
+
dob: Optional[date] = None
|
| 83 |
+
relationship: Optional[str] = None
|
| 84 |
+
occupation: Optional[str] = None
|
| 85 |
+
license_type: Optional[str] = None
|
| 86 |
+
is_main_driver: bool = False
|
| 87 |
+
specific_excess: Optional[float] = None
|
| 88 |
+
|
| 89 |
+
|
| 90 |
+
class NoClaimsDiscount(BaseModel):
|
| 91 |
+
years: Optional[int] = None
|
| 92 |
+
protected: Optional[bool] = None
|
| 93 |
+
|
| 94 |
+
|
| 95 |
+
class ExcessBreakdown(BaseModel):
|
| 96 |
+
standard_compulsory: Optional[float] = None
|
| 97 |
+
voluntary: Optional[float] = None
|
| 98 |
+
total_accidental_damage: Optional[float] = None
|
| 99 |
+
fire: Optional[float] = None
|
| 100 |
+
theft: Optional[float] = None
|
| 101 |
+
windscreen_repair: Optional[float] = None
|
| 102 |
+
windscreen_replacement: Optional[float] = None
|
| 103 |
+
own_repairer_additional_excess: Optional[float] = None
|
| 104 |
+
|
| 105 |
+
|
| 106 |
+
class CoverAndExcesses(BaseModel):
|
| 107 |
+
cover_type: Optional[str] = None
|
| 108 |
+
class_of_use: Optional[str] = None
|
| 109 |
+
driving_other_cars: Optional[bool] = None
|
| 110 |
+
no_claims_discount: Optional[NoClaimsDiscount] = None
|
| 111 |
+
excess_breakdown: Optional[ExcessBreakdown] = None
|
| 112 |
+
|
| 113 |
+
|
| 114 |
+
class OptionalExtras(BaseModel):
|
| 115 |
+
motor_legal_protection: Optional[Union[float, str]] = None
|
| 116 |
+
breakdown_roadside_assistance: Optional[Union[float, str]] = None
|
| 117 |
+
enhanced_personal_accident: Optional[Union[float, str]] = None
|
| 118 |
+
hire_car: Optional[Union[float, str]] = None
|
| 119 |
+
key_cover: Optional[Union[float, str]] = None
|
| 120 |
+
|
| 121 |
+
|
| 122 |
+
class FinancialSummary(BaseModel):
|
| 123 |
+
total_annual_premium: Optional[float] = None
|
| 124 |
+
optional_extras: Optional[OptionalExtras] = None
|
| 125 |
+
|
| 126 |
+
|
| 127 |
+
class AdditionalRiskData(BaseModel):
|
| 128 |
+
home_ownership: Optional[str] = None
|
| 129 |
+
children_under_16: Optional[bool] = None
|
| 130 |
+
number_of_cars_in_household: Optional[int] = None
|
| 131 |
+
non_motoring_convictions: Optional[bool] = None
|
| 132 |
+
endorsements: Optional[str] = None
|
| 133 |
+
|
| 134 |
+
|
| 135 |
+
# ---------------------------------------------------------------------------
|
| 136 |
+
# Top-level Golden Record
|
| 137 |
+
# ---------------------------------------------------------------------------
|
| 138 |
+
|
| 139 |
+
|
| 140 |
+
class UKMotorGoldenRecord(BaseModel):
|
| 141 |
+
"""
|
| 142 |
+
Final authoritative policy record produced by the Arbiter.
|
| 143 |
+
|
| 144 |
+
All section fields are Optional so that partial per-document extractions
|
| 145 |
+
remain valid Pydantic objects. source_document is internal provenance
|
| 146 |
+
and is excluded from model_dump_json().
|
| 147 |
+
"""
|
| 148 |
+
|
| 149 |
+
policy_header: Optional[PolicyHeader] = None
|
| 150 |
+
vehicle_details: Optional[VehicleDetails] = None
|
| 151 |
+
driver_details: List[Driver] = Field(default_factory=list)
|
| 152 |
+
cover_and_excesses: Optional[CoverAndExcesses] = None
|
| 153 |
+
financial_summary: Optional[FinancialSummary] = None
|
| 154 |
+
additional_risk_data: Optional[AdditionalRiskData] = None
|
| 155 |
+
|
| 156 |
+
# Verbatim source quotes for provenance matching.
|
| 157 |
+
# The LLM populates this mapping field_path → exact phrase copied from the document.
|
| 158 |
+
# Used by provenance.py to locate each field in the PDF even when the extracted
|
| 159 |
+
# value has been normalised (ISO dates, £ amounts, etc.).
|
| 160 |
+
# Excluded from the final serialised output so it doesn't appear in downstream JSON.
|
| 161 |
+
field_citations: Optional[Dict[str, str]] = Field(default=None, exclude=True)
|
| 162 |
+
|
| 163 |
+
# Internal provenance — excluded from serialised output
|
| 164 |
+
source_document: Optional[SourceMetadata] = Field(default=None, exclude=True)
|
| 165 |
+
|
| 166 |
+
|
| 167 |
+
# ---------------------------------------------------------------------------
|
| 168 |
+
# Provenance and Human-in-the-Loop review models
|
| 169 |
+
# ---------------------------------------------------------------------------
|
| 170 |
+
|
| 171 |
+
|
| 172 |
+
class Location(BaseModel):
|
| 173 |
+
"""Geometric location of a field's source text, in browser % coords (top-left origin)."""
|
| 174 |
+
|
| 175 |
+
page: int
|
| 176 |
+
bbox: List[float] # [x0%, y0%, x1%, y1%]
|
| 177 |
+
|
| 178 |
+
|
| 179 |
+
class FieldProvenance(BaseModel):
|
| 180 |
+
"""Maps one Golden Record field to its source text element in the PDF."""
|
| 181 |
+
|
| 182 |
+
field_path: str # e.g. "vehicle_details.vrm"
|
| 183 |
+
extracted_value: str # the value produced by the LLM
|
| 184 |
+
matched_text: str # the corpus snippet that best matches it
|
| 185 |
+
match_score: float # 0.0–1.0 (1.0 = perfect)
|
| 186 |
+
source_filename: str # which PDF this came from
|
| 187 |
+
location: Location # page + bbox in browser % coords
|
| 188 |
+
|
| 189 |
+
|
| 190 |
+
class ConflictEntry(BaseModel):
|
| 191 |
+
"""Records a field where Schedule and Certificate held different values."""
|
| 192 |
+
|
| 193 |
+
field: str # dotted field path, e.g. "policy_header.policy_number"
|
| 194 |
+
schedule_value: Optional[str] = None
|
| 195 |
+
certificate_value: Optional[str] = None
|
| 196 |
+
winner: str # "schedule" | "certificate" | "fallback"
|
| 197 |
+
|
| 198 |
+
|
| 199 |
+
class GoldenRecordWithProvenance(BaseModel):
|
| 200 |
+
"""Full pipeline output for the Visual Audit Review UI."""
|
| 201 |
+
|
| 202 |
+
record: UKMotorGoldenRecord
|
| 203 |
+
provenance: List[FieldProvenance] = Field(default_factory=list)
|
| 204 |
+
conflicts: List[ConflictEntry] = Field(default_factory=list)
|
| 205 |
+
session_id: Optional[str] = None
|
src/settings.py
ADDED
|
@@ -0,0 +1,142 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
settings.py — Pipeline configuration loader.
|
| 3 |
+
|
| 4 |
+
Merges values from config/settings.yaml with environment variable overrides.
|
| 5 |
+
Also calls load_dotenv() so importing this module anywhere in the pipeline
|
| 6 |
+
is sufficient to activate .env — no separate setup needed.
|
| 7 |
+
|
| 8 |
+
Precedence (highest → lowest)
|
| 9 |
+
──────────────────────────────
|
| 10 |
+
1. Environment variables (GROQ_MODEL, etc.)
|
| 11 |
+
2. config/settings.yaml
|
| 12 |
+
3. Pydantic model field defaults (safety net)
|
| 13 |
+
|
| 14 |
+
Usage
|
| 15 |
+
-----
|
| 16 |
+
from settings import settings
|
| 17 |
+
|
| 18 |
+
model = settings.llm.model # respects GROQ_MODEL env var
|
| 19 |
+
retries = settings.llm.max_retries
|
| 20 |
+
thresh = settings.pii.score_threshold
|
| 21 |
+
"""
|
| 22 |
+
from __future__ import annotations
|
| 23 |
+
|
| 24 |
+
import logging
|
| 25 |
+
import os
|
| 26 |
+
from pathlib import Path
|
| 27 |
+
from typing import Optional
|
| 28 |
+
|
| 29 |
+
import yaml
|
| 30 |
+
from dotenv import load_dotenv
|
| 31 |
+
from pydantic import BaseModel, Field
|
| 32 |
+
|
| 33 |
+
# Load .env file before anything else reads os.environ
|
| 34 |
+
load_dotenv()
|
| 35 |
+
|
| 36 |
+
logger = logging.getLogger(__name__)
|
| 37 |
+
|
| 38 |
+
_DEFAULT_CONFIG_PATH = Path(__file__).parent.parent / "config" / "settings.yaml"
|
| 39 |
+
|
| 40 |
+
# ---------------------------------------------------------------------------
|
| 41 |
+
# Sub-models
|
| 42 |
+
# ---------------------------------------------------------------------------
|
| 43 |
+
|
| 44 |
+
_DEFAULT_ENTITIES = [
|
| 45 |
+
"PERSON", "PHONE_NUMBER", "EMAIL_ADDRESS",
|
| 46 |
+
"UK_NHS", "UK_NIN", "CREDIT_CARD", "IBAN_CODE",
|
| 47 |
+
"LOCATION", "IP_ADDRESS", "URL",
|
| 48 |
+
]
|
| 49 |
+
|
| 50 |
+
|
| 51 |
+
class LLMSettings(BaseModel):
|
| 52 |
+
model: str = "llama-3.3-70b-versatile"
|
| 53 |
+
classifier_model: str = "llama-3.1-8b-instant"
|
| 54 |
+
max_retries: int = 2
|
| 55 |
+
|
| 56 |
+
|
| 57 |
+
class PIISettings(BaseModel):
|
| 58 |
+
score_threshold: float = 0.5
|
| 59 |
+
mask_dates: bool = False
|
| 60 |
+
language: str = "en"
|
| 61 |
+
entities: list[str] = Field(default_factory=lambda: list(_DEFAULT_ENTITIES))
|
| 62 |
+
|
| 63 |
+
|
| 64 |
+
class PipelineSettings(BaseModel):
|
| 65 |
+
output_path: str = "./output/golden_record.json"
|
| 66 |
+
log_level: str = "INFO"
|
| 67 |
+
session_ttl_days: int = 30 # sessions older than this are removed on API startup (0 = disabled)
|
| 68 |
+
|
| 69 |
+
|
| 70 |
+
class DebugSettings(BaseModel):
|
| 71 |
+
enabled: bool = True
|
| 72 |
+
output_dir: str = "./output/debug"
|
| 73 |
+
save_markdown: bool = True
|
| 74 |
+
save_masked_markdown: bool = True
|
| 75 |
+
save_extraction_json: bool = True
|
| 76 |
+
save_metrics: bool = True
|
| 77 |
+
|
| 78 |
+
|
| 79 |
+
class DoclingSettings(BaseModel):
|
| 80 |
+
do_ocr: bool = False
|
| 81 |
+
do_table_structure: bool = False
|
| 82 |
+
# Per-document-type page caps (None = no limit)
|
| 83 |
+
max_pages: dict[str, int | None] = Field(
|
| 84 |
+
default_factory=lambda: {
|
| 85 |
+
"Schedule": None,
|
| 86 |
+
"Certificate": None,
|
| 87 |
+
"StatementOfFact": None,
|
| 88 |
+
"PolicyBooklet": 20,
|
| 89 |
+
"Unknown": 30,
|
| 90 |
+
}
|
| 91 |
+
)
|
| 92 |
+
|
| 93 |
+
|
| 94 |
+
class Settings(BaseModel):
|
| 95 |
+
llm: LLMSettings = Field(default_factory=LLMSettings)
|
| 96 |
+
pii: PIISettings = Field(default_factory=PIISettings)
|
| 97 |
+
pipeline: PipelineSettings = Field(default_factory=PipelineSettings)
|
| 98 |
+
debug: DebugSettings = Field(default_factory=DebugSettings)
|
| 99 |
+
docling: DoclingSettings = Field(default_factory=DoclingSettings)
|
| 100 |
+
|
| 101 |
+
@classmethod
|
| 102 |
+
def load(cls, config_path: Optional[str | Path] = None) -> "Settings":
|
| 103 |
+
"""
|
| 104 |
+
Load settings from YAML, then apply environment variable overrides.
|
| 105 |
+
|
| 106 |
+
Parameters
|
| 107 |
+
----------
|
| 108 |
+
config_path : str | Path | None
|
| 109 |
+
Path to a settings YAML file. Defaults to config/settings.yaml.
|
| 110 |
+
"""
|
| 111 |
+
path = Path(config_path) if config_path else _DEFAULT_CONFIG_PATH
|
| 112 |
+
data: dict = {}
|
| 113 |
+
|
| 114 |
+
if path.exists():
|
| 115 |
+
with path.open(encoding="utf-8") as fh:
|
| 116 |
+
data = yaml.safe_load(fh) or {}
|
| 117 |
+
logger.debug("Settings loaded from %s", path)
|
| 118 |
+
else:
|
| 119 |
+
logger.warning(
|
| 120 |
+
"Settings file not found at %s — using defaults.", path
|
| 121 |
+
)
|
| 122 |
+
|
| 123 |
+
instance = cls.model_validate(data)
|
| 124 |
+
|
| 125 |
+
# ── Environment variable overrides ─────────────────────────────────
|
| 126 |
+
# GROQ_MODEL wins over both settings.yaml and the Pydantic default.
|
| 127 |
+
if groq_model := os.environ.get("GROQ_MODEL"):
|
| 128 |
+
instance.llm.model = groq_model
|
| 129 |
+
logger.debug("LLM model overridden by GROQ_MODEL env var: %s", groq_model)
|
| 130 |
+
|
| 131 |
+
if classifier_model := os.environ.get("GROQ_CLASSIFIER_MODEL"):
|
| 132 |
+
instance.llm.classifier_model = classifier_model
|
| 133 |
+
logger.debug("Classifier model overridden by GROQ_CLASSIFIER_MODEL env var: %s", classifier_model)
|
| 134 |
+
|
| 135 |
+
return instance
|
| 136 |
+
|
| 137 |
+
|
| 138 |
+
# ---------------------------------------------------------------------------
|
| 139 |
+
# Module-level singleton — import this everywhere
|
| 140 |
+
# ---------------------------------------------------------------------------
|
| 141 |
+
|
| 142 |
+
settings = Settings.load()
|
tests/__init__.py
ADDED
|
File without changes
|
tests/test_arbiter.py
ADDED
|
@@ -0,0 +1,303 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
tests/test_arbiter.py — Unit tests for PolicyArbiter.
|
| 3 |
+
|
| 4 |
+
These tests exercise the merge logic in isolation using pure fixture data,
|
| 5 |
+
with no LLM calls or file I/O. Run with:
|
| 6 |
+
|
| 7 |
+
pytest tests/test_arbiter.py -v
|
| 8 |
+
|
| 9 |
+
(From project root with the virtual-env activated.)
|
| 10 |
+
"""
|
| 11 |
+
from __future__ import annotations
|
| 12 |
+
|
| 13 |
+
import sys
|
| 14 |
+
from pathlib import Path
|
| 15 |
+
|
| 16 |
+
# Allow importing from src/ without installing the package
|
| 17 |
+
sys.path.insert(0, str(Path(__file__).parent.parent / "src"))
|
| 18 |
+
|
| 19 |
+
import pytest
|
| 20 |
+
|
| 21 |
+
from arbiter import PolicyArbiter
|
| 22 |
+
from schema import (
|
| 23 |
+
AdditionalRiskData,
|
| 24 |
+
ConflictEntry,
|
| 25 |
+
CoverAndExcesses,
|
| 26 |
+
Driver,
|
| 27 |
+
ExcessBreakdown,
|
| 28 |
+
FinancialSummary,
|
| 29 |
+
NoClaimsDiscount,
|
| 30 |
+
OptionalExtras,
|
| 31 |
+
PeriodOfCover,
|
| 32 |
+
PolicyHeader,
|
| 33 |
+
UKMotorGoldenRecord,
|
| 34 |
+
VehicleDetails,
|
| 35 |
+
)
|
| 36 |
+
|
| 37 |
+
|
| 38 |
+
# ---------------------------------------------------------------------------
|
| 39 |
+
# Fixtures
|
| 40 |
+
# ---------------------------------------------------------------------------
|
| 41 |
+
|
| 42 |
+
def _make_schedule(
|
| 43 |
+
policy_number: str = "POL-001",
|
| 44 |
+
insurer: str = "TestInsurer Ltd",
|
| 45 |
+
cover_type: str = "Comprehensive",
|
| 46 |
+
ncb_years: int = 3,
|
| 47 |
+
class_of_use: str | None = None,
|
| 48 |
+
drivers: list[dict] | None = None,
|
| 49 |
+
excess_compulsory: float = 250.0,
|
| 50 |
+
excess_voluntary: float = 150.0,
|
| 51 |
+
premium: float = 600.0,
|
| 52 |
+
vrm: str = "AB12 XYZ",
|
| 53 |
+
) -> UKMotorGoldenRecord:
|
| 54 |
+
drv_list = [
|
| 55 |
+
Driver(**d) for d in (drivers or [{"name": "ALICE SMITH", "is_main_driver": True}])
|
| 56 |
+
]
|
| 57 |
+
return UKMotorGoldenRecord(
|
| 58 |
+
policy_header=PolicyHeader(policy_number=policy_number, insurer=insurer),
|
| 59 |
+
vehicle_details=VehicleDetails(vrm=vrm, make="Toyota", model="Corolla"),
|
| 60 |
+
driver_details=drv_list,
|
| 61 |
+
cover_and_excesses=CoverAndExcesses(
|
| 62 |
+
cover_type=cover_type,
|
| 63 |
+
class_of_use=class_of_use,
|
| 64 |
+
no_claims_discount=NoClaimsDiscount(years=ncb_years, protected=False),
|
| 65 |
+
excess_breakdown=ExcessBreakdown(
|
| 66 |
+
standard_compulsory=excess_compulsory,
|
| 67 |
+
voluntary=excess_voluntary,
|
| 68 |
+
total_accidental_damage=excess_compulsory + excess_voluntary,
|
| 69 |
+
),
|
| 70 |
+
),
|
| 71 |
+
financial_summary=FinancialSummary(
|
| 72 |
+
total_annual_premium=premium,
|
| 73 |
+
optional_extras=OptionalExtras(),
|
| 74 |
+
),
|
| 75 |
+
additional_risk_data=AdditionalRiskData(home_ownership="Owned"),
|
| 76 |
+
)
|
| 77 |
+
|
| 78 |
+
|
| 79 |
+
def _make_certificate(
|
| 80 |
+
policy_number: str = "POL-001",
|
| 81 |
+
class_of_use: str = "Social, Domestic and Pleasure",
|
| 82 |
+
driving_other_cars: bool = False,
|
| 83 |
+
drivers: list[dict] | None = None,
|
| 84 |
+
insurer: str | None = None,
|
| 85 |
+
) -> UKMotorGoldenRecord:
|
| 86 |
+
drv_list = [
|
| 87 |
+
Driver(**d) for d in (drivers or [{"name": "ALICE SMITH", "is_main_driver": True}])
|
| 88 |
+
]
|
| 89 |
+
return UKMotorGoldenRecord(
|
| 90 |
+
policy_header=PolicyHeader(
|
| 91 |
+
policy_number=policy_number,
|
| 92 |
+
insurer=insurer,
|
| 93 |
+
),
|
| 94 |
+
driver_details=drv_list,
|
| 95 |
+
cover_and_excesses=CoverAndExcesses(
|
| 96 |
+
class_of_use=class_of_use,
|
| 97 |
+
driving_other_cars=driving_other_cars,
|
| 98 |
+
),
|
| 99 |
+
)
|
| 100 |
+
|
| 101 |
+
|
| 102 |
+
# ---------------------------------------------------------------------------
|
| 103 |
+
# Basic merge tests
|
| 104 |
+
# ---------------------------------------------------------------------------
|
| 105 |
+
|
| 106 |
+
class TestBasicMerge:
|
| 107 |
+
def test_returns_tuple_with_conflicts_list(self):
|
| 108 |
+
arbiter = PolicyArbiter()
|
| 109 |
+
sched = _make_schedule()
|
| 110 |
+
cert = _make_certificate()
|
| 111 |
+
result = arbiter.merge_records(sched, "sched.pdf", cert, "cert.pdf")
|
| 112 |
+
assert isinstance(result, tuple)
|
| 113 |
+
golden, conflicts = result
|
| 114 |
+
assert isinstance(conflicts, list)
|
| 115 |
+
|
| 116 |
+
def test_vehicle_details_from_schedule(self):
|
| 117 |
+
arbiter = PolicyArbiter()
|
| 118 |
+
sched = _make_schedule(vrm="AB12 XYZ")
|
| 119 |
+
cert = _make_certificate()
|
| 120 |
+
golden, _ = arbiter.merge_records(sched, "s.pdf", cert, "c.pdf")
|
| 121 |
+
assert golden.vehicle_details is not None
|
| 122 |
+
assert golden.vehicle_details.vrm == "AB12 XYZ"
|
| 123 |
+
|
| 124 |
+
def test_class_of_use_from_certificate(self):
|
| 125 |
+
arbiter = PolicyArbiter()
|
| 126 |
+
sched = _make_schedule(class_of_use="Social") # schedule has one
|
| 127 |
+
cert = _make_certificate(class_of_use="Social, Domestic and Pleasure") # cert is master
|
| 128 |
+
golden, conflicts = arbiter.merge_records(sched, "s.pdf", cert, "c.pdf")
|
| 129 |
+
assert golden.cover_and_excesses.class_of_use == "Social, Domestic and Pleasure"
|
| 130 |
+
|
| 131 |
+
def test_cover_type_from_schedule(self):
|
| 132 |
+
arbiter = PolicyArbiter()
|
| 133 |
+
sched = _make_schedule(cover_type="Comprehensive")
|
| 134 |
+
cert = _make_certificate()
|
| 135 |
+
golden, _ = arbiter.merge_records(sched, "s.pdf", cert, "c.pdf")
|
| 136 |
+
assert golden.cover_and_excesses.cover_type == "Comprehensive"
|
| 137 |
+
|
| 138 |
+
def test_financial_summary_from_schedule(self):
|
| 139 |
+
arbiter = PolicyArbiter()
|
| 140 |
+
sched = _make_schedule(premium=750.0)
|
| 141 |
+
cert = _make_certificate()
|
| 142 |
+
golden, _ = arbiter.merge_records(sched, "s.pdf", cert, "c.pdf")
|
| 143 |
+
assert golden.financial_summary.total_annual_premium == 750.0
|
| 144 |
+
|
| 145 |
+
def test_additional_risk_data_from_schedule(self):
|
| 146 |
+
arbiter = PolicyArbiter()
|
| 147 |
+
sched = _make_schedule()
|
| 148 |
+
cert = _make_certificate()
|
| 149 |
+
golden, _ = arbiter.merge_records(sched, "s.pdf", cert, "c.pdf")
|
| 150 |
+
assert golden.additional_risk_data.home_ownership == "Owned"
|
| 151 |
+
|
| 152 |
+
|
| 153 |
+
# ---------------------------------------------------------------------------
|
| 154 |
+
# One-sided merge (missing Schedule or Certificate)
|
| 155 |
+
# ---------------------------------------------------------------------------
|
| 156 |
+
|
| 157 |
+
class TestOneSidedMerge:
|
| 158 |
+
def test_empty_schedule_uses_certificate_drivers(self):
|
| 159 |
+
arbiter = PolicyArbiter()
|
| 160 |
+
sched = UKMotorGoldenRecord() # empty
|
| 161 |
+
cert = _make_certificate(
|
| 162 |
+
drivers=[{"name": "BOB JONES", "is_main_driver": True}]
|
| 163 |
+
)
|
| 164 |
+
golden, _ = arbiter.merge_records(sched, "s.pdf", cert, "c.pdf")
|
| 165 |
+
assert len(golden.driver_details) == 1
|
| 166 |
+
assert golden.driver_details[0].name == "BOB JONES"
|
| 167 |
+
|
| 168 |
+
def test_empty_certificate_still_merges(self):
|
| 169 |
+
arbiter = PolicyArbiter()
|
| 170 |
+
sched = _make_schedule()
|
| 171 |
+
cert = UKMotorGoldenRecord() # empty
|
| 172 |
+
golden, _ = arbiter.merge_records(sched, "s.pdf", cert, "c.pdf")
|
| 173 |
+
assert golden.vehicle_details is not None
|
| 174 |
+
assert golden.cover_and_excesses is not None
|
| 175 |
+
|
| 176 |
+
def test_policy_number_fallback_to_certificate(self):
|
| 177 |
+
arbiter = PolicyArbiter()
|
| 178 |
+
sched = UKMotorGoldenRecord(policy_header=PolicyHeader(policy_number=None))
|
| 179 |
+
cert = _make_certificate(policy_number="CERT-999")
|
| 180 |
+
golden, _ = arbiter.merge_records(sched, "s.pdf", cert, "c.pdf")
|
| 181 |
+
assert golden.policy_header.policy_number == "CERT-999"
|
| 182 |
+
|
| 183 |
+
|
| 184 |
+
# ---------------------------------------------------------------------------
|
| 185 |
+
# Conflict detection
|
| 186 |
+
# ---------------------------------------------------------------------------
|
| 187 |
+
|
| 188 |
+
class TestConflictDetection:
|
| 189 |
+
def test_no_conflicts_when_values_match(self):
|
| 190 |
+
arbiter = PolicyArbiter()
|
| 191 |
+
sched = _make_schedule(policy_number="POL-001", insurer="Insurer A")
|
| 192 |
+
cert = _make_certificate(policy_number="POL-001")
|
| 193 |
+
_, conflicts = arbiter.merge_records(sched, "s.pdf", cert, "c.pdf")
|
| 194 |
+
policy_number_conflicts = [c for c in conflicts if c.field == "policy_header.policy_number"]
|
| 195 |
+
assert policy_number_conflicts == []
|
| 196 |
+
|
| 197 |
+
def test_conflict_logged_for_differing_policy_numbers(self):
|
| 198 |
+
arbiter = PolicyArbiter()
|
| 199 |
+
sched = _make_schedule(policy_number="POL-001")
|
| 200 |
+
cert = _make_certificate(policy_number="POL-002")
|
| 201 |
+
golden, conflicts = arbiter.merge_records(sched, "s.pdf", cert, "c.pdf")
|
| 202 |
+
conflict_fields = [c.field for c in conflicts]
|
| 203 |
+
assert "policy_header.policy_number" in conflict_fields
|
| 204 |
+
# Schedule wins
|
| 205 |
+
assert golden.policy_header.policy_number == "POL-001"
|
| 206 |
+
|
| 207 |
+
def test_conflict_entry_has_both_values(self):
|
| 208 |
+
arbiter = PolicyArbiter()
|
| 209 |
+
sched = _make_schedule(policy_number="SCHED-100")
|
| 210 |
+
cert = _make_certificate(policy_number="CERT-200")
|
| 211 |
+
_, conflicts = arbiter.merge_records(sched, "s.pdf", cert, "c.pdf")
|
| 212 |
+
c = next(x for x in conflicts if x.field == "policy_header.policy_number")
|
| 213 |
+
assert c.schedule_value == "SCHED-100"
|
| 214 |
+
assert c.certificate_value == "CERT-200"
|
| 215 |
+
assert c.winner == "schedule"
|
| 216 |
+
|
| 217 |
+
def test_class_of_use_conflict_certificate_wins(self):
|
| 218 |
+
arbiter = PolicyArbiter()
|
| 219 |
+
sched = _make_schedule(class_of_use="Social Only")
|
| 220 |
+
cert = _make_certificate(class_of_use="Social, Domestic and Pleasure")
|
| 221 |
+
golden, conflicts = arbiter.merge_records(sched, "s.pdf", cert, "c.pdf")
|
| 222 |
+
c = next((x for x in conflicts if x.field == "cover_and_excesses.class_of_use"), None)
|
| 223 |
+
assert c is not None
|
| 224 |
+
assert c.winner == "certificate"
|
| 225 |
+
assert golden.cover_and_excesses.class_of_use == "Social, Domestic and Pleasure"
|
| 226 |
+
|
| 227 |
+
|
| 228 |
+
# ---------------------------------------------------------------------------
|
| 229 |
+
# Driver merging
|
| 230 |
+
# ---------------------------------------------------------------------------
|
| 231 |
+
|
| 232 |
+
class TestDriverMerge:
|
| 233 |
+
def test_exact_name_match_enriches_driver(self):
|
| 234 |
+
arbiter = PolicyArbiter()
|
| 235 |
+
sched = _make_schedule(
|
| 236 |
+
drivers=[{"name": "ALICE SMITH", "is_main_driver": True, "dob": None, "relationship": None}]
|
| 237 |
+
)
|
| 238 |
+
cert = _make_certificate(
|
| 239 |
+
drivers=[{"name": "ALICE SMITH", "is_main_driver": True, "relationship": "Proposer"}]
|
| 240 |
+
)
|
| 241 |
+
golden, _ = arbiter.merge_records(sched, "s.pdf", cert, "c.pdf")
|
| 242 |
+
assert golden.driver_details[0].relationship == "Proposer"
|
| 243 |
+
|
| 244 |
+
def test_fuzzy_name_match_merges(self):
|
| 245 |
+
"""Names with minor differences (e.g. missing middle initial) should still match."""
|
| 246 |
+
arbiter = PolicyArbiter()
|
| 247 |
+
sched = _make_schedule(
|
| 248 |
+
drivers=[{"name": "ALICE J SMITH", "is_main_driver": True}]
|
| 249 |
+
)
|
| 250 |
+
cert = _make_certificate(
|
| 251 |
+
drivers=[{"name": "ALICE SMITH", "is_main_driver": True, "relationship": "Proposer"}]
|
| 252 |
+
)
|
| 253 |
+
golden, _ = arbiter.merge_records(sched, "s.pdf", cert, "c.pdf")
|
| 254 |
+
assert golden.driver_details[0].relationship == "Proposer"
|
| 255 |
+
|
| 256 |
+
def test_unmatched_driver_has_no_cert_enrichment(self):
|
| 257 |
+
"""A driver with a completely different name gets no cert data."""
|
| 258 |
+
arbiter = PolicyArbiter()
|
| 259 |
+
sched = _make_schedule(
|
| 260 |
+
drivers=[{"name": "ALICE SMITH", "is_main_driver": True}]
|
| 261 |
+
)
|
| 262 |
+
cert = _make_certificate(
|
| 263 |
+
drivers=[{"name": "BOB JONES", "is_main_driver": True, "relationship": "Spouse"}]
|
| 264 |
+
)
|
| 265 |
+
golden, _ = arbiter.merge_records(sched, "s.pdf", cert, "c.pdf")
|
| 266 |
+
alice = golden.driver_details[0]
|
| 267 |
+
assert alice.name == "ALICE SMITH"
|
| 268 |
+
assert alice.relationship is None # no cert match, so no enrichment
|
| 269 |
+
|
| 270 |
+
|
| 271 |
+
# ---------------------------------------------------------------------------
|
| 272 |
+
# field_citations merging
|
| 273 |
+
# ---------------------------------------------------------------------------
|
| 274 |
+
|
| 275 |
+
class TestFieldCitationsMerge:
|
| 276 |
+
def test_schedule_citations_win_on_conflict(self):
|
| 277 |
+
arbiter = PolicyArbiter()
|
| 278 |
+
sched = _make_schedule()
|
| 279 |
+
cert = _make_certificate()
|
| 280 |
+
sched.field_citations = {
|
| 281 |
+
"vehicle_details.vrm": "AB12 XYZ",
|
| 282 |
+
"policy_header.policy_number": "POL-001 from schedule",
|
| 283 |
+
}
|
| 284 |
+
cert.field_citations = {
|
| 285 |
+
"policy_header.policy_number": "POL-001 from cert",
|
| 286 |
+
"cover_and_excesses.class_of_use": "Social, Domestic and Pleasure",
|
| 287 |
+
}
|
| 288 |
+
golden, _ = arbiter.merge_records(sched, "s.pdf", cert, "c.pdf")
|
| 289 |
+
fc = golden.field_citations or {}
|
| 290 |
+
# Schedule wins the shared key
|
| 291 |
+
assert fc.get("policy_header.policy_number") == "POL-001 from schedule"
|
| 292 |
+
# Cert-only key survives
|
| 293 |
+
assert fc.get("cover_and_excesses.class_of_use") == "Social, Domestic and Pleasure"
|
| 294 |
+
# Schedule-only key survives
|
| 295 |
+
assert fc.get("vehicle_details.vrm") == "AB12 XYZ"
|
| 296 |
+
|
| 297 |
+
def test_empty_citations_produce_none(self):
|
| 298 |
+
arbiter = PolicyArbiter()
|
| 299 |
+
sched = _make_schedule()
|
| 300 |
+
cert = _make_certificate()
|
| 301 |
+
golden, _ = arbiter.merge_records(sched, "s.pdf", cert, "c.pdf")
|
| 302 |
+
# Neither side has citations → merged record has None
|
| 303 |
+
assert golden.field_citations is None
|
ui/index.html
ADDED
|
@@ -0,0 +1,13 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
<!doctype html>
|
| 2 |
+
<html lang="en">
|
| 3 |
+
<head>
|
| 4 |
+
<meta charset="UTF-8" />
|
| 5 |
+
<link rel="icon" type="image/svg+xml" href="/vite.svg" />
|
| 6 |
+
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
|
| 7 |
+
<title>PolicyTrace — Motor Insurance IDP · AI Tool Stack</title>
|
| 8 |
+
</head>
|
| 9 |
+
<body>
|
| 10 |
+
<div id="root"></div>
|
| 11 |
+
<script type="module" src="/src/main.tsx"></script>
|
| 12 |
+
</body>
|
| 13 |
+
</html>
|
ui/package-lock.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
ui/package.json
ADDED
|
@@ -0,0 +1,29 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"name": "motor-policy-review-ui",
|
| 3 |
+
"private": true,
|
| 4 |
+
"version": "0.1.0",
|
| 5 |
+
"type": "module",
|
| 6 |
+
"scripts": {
|
| 7 |
+
"dev": "vite",
|
| 8 |
+
"build": "tsc && vite build",
|
| 9 |
+
"preview": "vite preview"
|
| 10 |
+
},
|
| 11 |
+
"dependencies": {
|
| 12 |
+
"axios": "^1.7.2",
|
| 13 |
+
"react": "^18.3.0",
|
| 14 |
+
"react-dom": "^18.3.0",
|
| 15 |
+
"react-pdf": "^9.1.0",
|
| 16 |
+
"react-router-dom": "^7.15.1",
|
| 17 |
+
"zustand": "^4.5.2"
|
| 18 |
+
},
|
| 19 |
+
"devDependencies": {
|
| 20 |
+
"@types/react": "^18.3.3",
|
| 21 |
+
"@types/react-dom": "^18.3.0",
|
| 22 |
+
"@vitejs/plugin-react": "^4.3.1",
|
| 23 |
+
"autoprefixer": "^10.4.19",
|
| 24 |
+
"postcss": "^8.4.39",
|
| 25 |
+
"tailwindcss": "^3.4.6",
|
| 26 |
+
"typescript": "^5.5.3",
|
| 27 |
+
"vite": "^5.3.4"
|
| 28 |
+
}
|
| 29 |
+
}
|
ui/postcss.config.js
ADDED
|
@@ -0,0 +1,6 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
export default {
|
| 2 |
+
plugins: {
|
| 3 |
+
tailwindcss: {},
|
| 4 |
+
autoprefixer: {},
|
| 5 |
+
},
|
| 6 |
+
}
|
ui/src/App.tsx
ADDED
|
@@ -0,0 +1,16 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import { Route, Routes } from 'react-router-dom'
|
| 2 |
+
import { UploadPage } from './UploadPage'
|
| 3 |
+
import { SessionPage } from './SessionPage'
|
| 4 |
+
|
| 5 |
+
export default function App() {
|
| 6 |
+
return (
|
| 7 |
+
<Routes>
|
| 8 |
+
<Route path="/" element={<UploadPage />} />
|
| 9 |
+
<Route path="/session/:sessionId" element={<SessionPage />} />
|
| 10 |
+
{/* Catch-all: redirect unknown paths to upload */}
|
| 11 |
+
<Route path="*" element={<UploadPage />} />
|
| 12 |
+
</Routes>
|
| 13 |
+
)
|
| 14 |
+
}
|
| 15 |
+
|
| 16 |
+
|
ui/src/FieldRow.tsx
ADDED
|
@@ -0,0 +1,201 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import { type CSSProperties, useState } from 'react'
|
| 2 |
+
import type { FieldEntry, FieldReview } from './types'
|
| 3 |
+
import { useStore } from './store'
|
| 4 |
+
|
| 5 |
+
interface Props {
|
| 6 |
+
entry: FieldEntry
|
| 7 |
+
sessionId: string
|
| 8 |
+
isActive: boolean
|
| 9 |
+
review?: FieldReview
|
| 10 |
+
onClick: () => void
|
| 11 |
+
}
|
| 12 |
+
|
| 13 |
+
export function FieldRow({ entry, sessionId, isActive, review, onClick }: Props) {
|
| 14 |
+
const [editing, setEditing] = useState(false)
|
| 15 |
+
const [editValue, setEditValue] = useState(entry.value ?? '')
|
| 16 |
+
const verifyField = useStore((s) => s.verifyField)
|
| 17 |
+
const overrideField = useStore((s) => s.overrideField)
|
| 18 |
+
const rejectField = useStore((s) => s.rejectField)
|
| 19 |
+
|
| 20 |
+
const displayValue = review?.action === 'override' && review.overridden_value != null
|
| 21 |
+
? review.overridden_value
|
| 22 |
+
: entry.value
|
| 23 |
+
|
| 24 |
+
const isVerified = review?.action === 'verify'
|
| 25 |
+
const isRejected = review?.action === 'reject'
|
| 26 |
+
const isOverridden = review?.action === 'override'
|
| 27 |
+
|
| 28 |
+
const borderStyle: CSSProperties = isVerified
|
| 29 |
+
? { borderColor: '#16a34a', backgroundColor: '#f0fdf4' }
|
| 30 |
+
: isRejected
|
| 31 |
+
? { borderColor: '#fca5a5', backgroundColor: '#fef2f2' }
|
| 32 |
+
: isOverridden
|
| 33 |
+
? { borderColor: '#2563EB', backgroundColor: '#eff6ff' }
|
| 34 |
+
: isActive
|
| 35 |
+
? { borderColor: '#008080', backgroundColor: '#f0fdfc' }
|
| 36 |
+
: { borderColor: 'transparent', backgroundColor: '#ffffff' }
|
| 37 |
+
|
| 38 |
+
const handleSaveOverride = async () => {
|
| 39 |
+
await overrideField(sessionId, entry.fieldPath, editValue)
|
| 40 |
+
setEditing(false)
|
| 41 |
+
}
|
| 42 |
+
|
| 43 |
+
return (
|
| 44 |
+
<div
|
| 45 |
+
className="rounded-lg border px-3 py-2 cursor-pointer transition-all hover:shadow-sm"
|
| 46 |
+
style={borderStyle}
|
| 47 |
+
onClick={onClick}
|
| 48 |
+
>
|
| 49 |
+
<div className="flex items-start gap-2">
|
| 50 |
+
{/* Label + value */}
|
| 51 |
+
<div className="flex-1 min-w-0">
|
| 52 |
+
<div className="flex items-center gap-2 flex-wrap">
|
| 53 |
+
<span className="text-xs font-semibold text-gray-500 shrink-0">
|
| 54 |
+
{entry.label}
|
| 55 |
+
</span>
|
| 56 |
+
{isVerified && (
|
| 57 |
+
<span className="inline-flex items-center gap-0.5 text-xs text-green-700 font-medium">
|
| 58 |
+
<CheckIcon /> Verified
|
| 59 |
+
</span>
|
| 60 |
+
)}
|
| 61 |
+
{isOverridden && (
|
| 62 |
+
<span className="text-xs text-blue-700 font-medium">Overridden</span>
|
| 63 |
+
)}
|
| 64 |
+
{isRejected && (
|
| 65 |
+
<span className="text-xs text-red-600 font-medium">Flagged</span>
|
| 66 |
+
)}
|
| 67 |
+
</div>
|
| 68 |
+
|
| 69 |
+
{/* Value */}
|
| 70 |
+
{editing ? (
|
| 71 |
+
<div
|
| 72 |
+
className="flex gap-2 mt-1"
|
| 73 |
+
onClick={(e) => e.stopPropagation()}
|
| 74 |
+
>
|
| 75 |
+
<input
|
| 76 |
+
autoFocus
|
| 77 |
+
className="flex-1 text-xs border rounded px-2 py-1 focus:outline-none focus:ring-1 focus:ring-blue-400"
|
| 78 |
+
value={editValue}
|
| 79 |
+
onChange={(e) => setEditValue(e.target.value)}
|
| 80 |
+
onKeyDown={(e) => {
|
| 81 |
+
if (e.key === 'Enter') handleSaveOverride()
|
| 82 |
+
if (e.key === 'Escape') setEditing(false)
|
| 83 |
+
}}
|
| 84 |
+
/>
|
| 85 |
+
<button
|
| 86 |
+
onClick={handleSaveOverride}
|
| 87 |
+
className="text-xs px-2 py-1 bg-blue-600 text-white rounded hover:bg-blue-700"
|
| 88 |
+
>
|
| 89 |
+
Save
|
| 90 |
+
</button>
|
| 91 |
+
<button
|
| 92 |
+
onClick={() => setEditing(false)}
|
| 93 |
+
className="text-xs px-2 py-1 bg-gray-200 text-gray-700 rounded hover:bg-gray-300"
|
| 94 |
+
>
|
| 95 |
+
Cancel
|
| 96 |
+
</button>
|
| 97 |
+
</div>
|
| 98 |
+
) : (
|
| 99 |
+
<p className="text-sm text-gray-800 mt-0.5 truncate">
|
| 100 |
+
{displayValue ?? (
|
| 101 |
+
<span className="text-gray-300 italic">Not extracted</span>
|
| 102 |
+
)}
|
| 103 |
+
</p>
|
| 104 |
+
)}
|
| 105 |
+
|
| 106 |
+
{/* Provenance source hint — or explicit "no location" notice */}
|
| 107 |
+
{!editing && (
|
| 108 |
+
entry.provenance ? (
|
| 109 |
+
<p className="text-xs text-gray-400 mt-0.5 truncate">
|
| 110 |
+
{entry.provenance.source_filename} · p.{entry.provenance.location.page} ·{' '}
|
| 111 |
+
<span className="italic">"{entry.provenance.matched_text.slice(0, 60)}{entry.provenance.matched_text.length > 60 ? '…' : ''}"</span>
|
| 112 |
+
</p>
|
| 113 |
+
) : (
|
| 114 |
+
<p className="text-xs mt-0.5">
|
| 115 |
+
<span className="inline-flex items-center gap-1 px-1.5 py-0.5 rounded bg-gray-100 text-gray-400 font-medium">
|
| 116 |
+
<span aria-hidden>—</span> No location data
|
| 117 |
+
</span>
|
| 118 |
+
</p>
|
| 119 |
+
)
|
| 120 |
+
)}
|
| 121 |
+
</div>
|
| 122 |
+
|
| 123 |
+
{/* Right side: confidence badge + action buttons */}
|
| 124 |
+
<div
|
| 125 |
+
className="flex items-center gap-1 flex-shrink-0"
|
| 126 |
+
onClick={(e) => e.stopPropagation()}
|
| 127 |
+
>
|
| 128 |
+
{entry.provenance && (
|
| 129 |
+
<ConfidenceBadge score={entry.provenance.match_score} />
|
| 130 |
+
)}
|
| 131 |
+
|
| 132 |
+
{/* Verify */}
|
| 133 |
+
<button
|
| 134 |
+
title="Mark as verified"
|
| 135 |
+
onClick={() => verifyField(sessionId, entry.fieldPath)}
|
| 136 |
+
className={`w-7 h-7 rounded flex items-center justify-center text-sm transition-colors ${
|
| 137 |
+
isVerified
|
| 138 |
+
? 'bg-green-500 text-white'
|
| 139 |
+
: 'bg-gray-100 text-gray-500 hover:bg-green-100 hover:text-green-700'
|
| 140 |
+
}`}
|
| 141 |
+
>
|
| 142 |
+
✓
|
| 143 |
+
</button>
|
| 144 |
+
|
| 145 |
+
{/* Edit */}
|
| 146 |
+
<button
|
| 147 |
+
title="Override value"
|
| 148 |
+
onClick={() => {
|
| 149 |
+
setEditValue(displayValue ?? '')
|
| 150 |
+
setEditing(true)
|
| 151 |
+
}}
|
| 152 |
+
className="w-7 h-7 rounded flex items-center justify-center text-sm transition-colors"
|
| 153 |
+
style={{ backgroundColor: '#f3f4f6', color: '#6b7280' }}
|
| 154 |
+
onMouseEnter={e => { (e.currentTarget as HTMLElement).style.backgroundColor = '#eff6ff'; (e.currentTarget as HTMLElement).style.color = '#2563EB' }}
|
| 155 |
+
onMouseLeave={e => { (e.currentTarget as HTMLElement).style.backgroundColor = '#f3f4f6'; (e.currentTarget as HTMLElement).style.color = '#6b7280' }}
|
| 156 |
+
>
|
| 157 |
+
✎
|
| 158 |
+
</button>
|
| 159 |
+
|
| 160 |
+
{/* Flag */}
|
| 161 |
+
<button
|
| 162 |
+
title="Flag for review"
|
| 163 |
+
onClick={() => rejectField(sessionId, entry.fieldPath)}
|
| 164 |
+
className={`w-7 h-7 rounded flex items-center justify-center text-sm transition-colors ${
|
| 165 |
+
isRejected
|
| 166 |
+
? 'bg-red-500 text-white'
|
| 167 |
+
: 'bg-gray-100 text-gray-500 hover:bg-red-100 hover:text-red-600'
|
| 168 |
+
}`}
|
| 169 |
+
>
|
| 170 |
+
⚑
|
| 171 |
+
</button>
|
| 172 |
+
</div>
|
| 173 |
+
</div>
|
| 174 |
+
</div>
|
| 175 |
+
)
|
| 176 |
+
}
|
| 177 |
+
|
| 178 |
+
function ConfidenceBadge({ score }: { score: number }) {
|
| 179 |
+
const pct = Math.round(score * 100)
|
| 180 |
+
const [bg, text] =
|
| 181 |
+
pct >= 90
|
| 182 |
+
? ['bg-green-100 text-green-700', '']
|
| 183 |
+
: pct >= 70
|
| 184 |
+
? ['bg-yellow-100 text-yellow-700', '']
|
| 185 |
+
: ['bg-red-100 text-red-600', '']
|
| 186 |
+
|
| 187 |
+
return (
|
| 188 |
+
<span className={`text-xs font-mono px-1.5 py-0.5 rounded ${bg} ${text}`}>
|
| 189 |
+
{pct}%
|
| 190 |
+
</span>
|
| 191 |
+
)
|
| 192 |
+
}
|
| 193 |
+
|
| 194 |
+
function CheckIcon() {
|
| 195 |
+
return (
|
| 196 |
+
<svg className="w-3 h-3" viewBox="0 0 12 12" fill="currentColor">
|
| 197 |
+
<path d="M10 3L5 8.5 2 5.5" stroke="currentColor" strokeWidth="1.5"
|
| 198 |
+
strokeLinecap="round" strokeLinejoin="round" fill="none" />
|
| 199 |
+
</svg>
|
| 200 |
+
)
|
| 201 |
+
}
|
ui/src/PDFPane.tsx
ADDED
|
@@ -0,0 +1,229 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import { useCallback, useEffect, useMemo, useRef, useState } from 'react'
|
| 2 |
+
import { Document, Page } from 'react-pdf'
|
| 3 |
+
import type { FieldProvenance } from './types'
|
| 4 |
+
import { useStore } from './store'
|
| 5 |
+
import { api } from './api'
|
| 6 |
+
|
| 7 |
+
interface Props {
|
| 8 |
+
sessionId: string
|
| 9 |
+
}
|
| 10 |
+
|
| 11 |
+
export function PDFPane({ sessionId }: Props) {
|
| 12 |
+
const sessionData = useStore((s) => s.sessionData)
|
| 13 |
+
const activePdfFile = useStore((s) => s.activePdfFile)
|
| 14 |
+
const activeProvenance = useStore((s) => s.activeProvenance)
|
| 15 |
+
const setActivePdf = useStore((s) => s.setActivePdf)
|
| 16 |
+
|
| 17 |
+
const [numPages, setNumPages] = useState(0)
|
| 18 |
+
const [renderedPages, setRenderedPages] = useState<Set<number>>(new Set())
|
| 19 |
+
const [containerWidth, setContainerWidth] = useState(600)
|
| 20 |
+
const containerRef = useRef<HTMLDivElement>(null)
|
| 21 |
+
const pageRefs = useRef<Map<number, HTMLDivElement>>(new Map())
|
| 22 |
+
|
| 23 |
+
// Track which PDF URL we last requested a scroll for, to avoid re-firing
|
| 24 |
+
const pendingScrollRef = useRef<{ page: number; pdfFile: string } | null>(null)
|
| 25 |
+
|
| 26 |
+
// Unique PDF filenames from provenance
|
| 27 |
+
const pdfFiles = useMemo(() => {
|
| 28 |
+
const seen = new Set<string>()
|
| 29 |
+
return (sessionData?.provenance ?? [])
|
| 30 |
+
.map((p) => p.source_filename)
|
| 31 |
+
.filter((f) => { const fresh = !seen.has(f); seen.add(f); return fresh })
|
| 32 |
+
}, [sessionData?.provenance])
|
| 33 |
+
|
| 34 |
+
// Set container width on resize
|
| 35 |
+
useEffect(() => {
|
| 36 |
+
const el = containerRef.current
|
| 37 |
+
if (!el) return
|
| 38 |
+
const obs = new ResizeObserver(([entry]) => {
|
| 39 |
+
setContainerWidth(Math.floor(entry.contentRect.width) - 24)
|
| 40 |
+
})
|
| 41 |
+
obs.observe(el)
|
| 42 |
+
setContainerWidth(Math.floor(el.clientWidth) - 24)
|
| 43 |
+
return () => obs.disconnect()
|
| 44 |
+
}, [])
|
| 45 |
+
|
| 46 |
+
// When active provenance changes: enqueue a scroll request
|
| 47 |
+
useEffect(() => {
|
| 48 |
+
if (!activeProvenance) return
|
| 49 |
+
pendingScrollRef.current = {
|
| 50 |
+
page: activeProvenance.location.page,
|
| 51 |
+
pdfFile: activeProvenance.source_filename,
|
| 52 |
+
}
|
| 53 |
+
// Reset rendered-pages set when switching documents
|
| 54 |
+
if (activeProvenance.source_filename !== activePdfFile) {
|
| 55 |
+
setRenderedPages(new Set())
|
| 56 |
+
}
|
| 57 |
+
// Try immediately (page already rendered)
|
| 58 |
+
tryScroll()
|
| 59 |
+
}, [activeProvenance]) // eslint-disable-line react-hooks/exhaustive-deps
|
| 60 |
+
|
| 61 |
+
// When a page finishes rendering, check if a scroll is pending for it
|
| 62 |
+
const handlePageRenderSuccess = useCallback((pageNum: number) => {
|
| 63 |
+
setRenderedPages((prev) => new Set([...prev, pageNum]))
|
| 64 |
+
const pending = pendingScrollRef.current
|
| 65 |
+
if (pending && pending.page === pageNum && pending.pdfFile === activePdfFile) {
|
| 66 |
+
const el = pageRefs.current.get(pageNum)
|
| 67 |
+
el?.scrollIntoView({ behavior: 'smooth', block: 'center' })
|
| 68 |
+
pendingScrollRef.current = null
|
| 69 |
+
}
|
| 70 |
+
}, [activePdfFile])
|
| 71 |
+
|
| 72 |
+
function tryScroll() {
|
| 73 |
+
const pending = pendingScrollRef.current
|
| 74 |
+
if (!pending) return
|
| 75 |
+
if (pending.pdfFile !== activePdfFile) return
|
| 76 |
+
const el = pageRefs.current.get(pending.page)
|
| 77 |
+
if (el) {
|
| 78 |
+
el.scrollIntoView({ behavior: 'smooth', block: 'center' })
|
| 79 |
+
pendingScrollRef.current = null
|
| 80 |
+
}
|
| 81 |
+
}
|
| 82 |
+
|
| 83 |
+
// Reset rendered pages when the PDF URL changes
|
| 84 |
+
const pdfUrl = activePdfFile ? api.pdfUrl(sessionId, activePdfFile) : null
|
| 85 |
+
const prevPdfUrlRef = useRef<string | null>(null)
|
| 86 |
+
if (pdfUrl !== prevPdfUrlRef.current) {
|
| 87 |
+
prevPdfUrlRef.current = pdfUrl
|
| 88 |
+
// Clear page refs — old page elements are stale after document switch
|
| 89 |
+
pageRefs.current.clear()
|
| 90 |
+
}
|
| 91 |
+
|
| 92 |
+
// Highlights for the currently displayed PDF
|
| 93 |
+
const highlights = useMemo((): FieldProvenance[] => {
|
| 94 |
+
if (!sessionData || !activePdfFile) return []
|
| 95 |
+
return sessionData.provenance.filter(
|
| 96 |
+
(p) => p.source_filename === activePdfFile,
|
| 97 |
+
)
|
| 98 |
+
}, [sessionData, activePdfFile])
|
| 99 |
+
|
| 100 |
+
return (
|
| 101 |
+
<div className="flex flex-col h-full">
|
| 102 |
+
{/* PDF file selector */}
|
| 103 |
+
{pdfFiles.length > 1 && (
|
| 104 |
+
<div className="flex flex-wrap gap-2 p-3 border-b flex-shrink-0" style={{ backgroundColor: '#1F2937' }}>
|
| 105 |
+
{pdfFiles.map((f) => (
|
| 106 |
+
<button
|
| 107 |
+
key={f}
|
| 108 |
+
onClick={() => setActivePdf(f)}
|
| 109 |
+
className="px-3 py-1 rounded text-xs font-medium transition-colors"
|
| 110 |
+
style={activePdfFile === f
|
| 111 |
+
? { backgroundColor: '#008080', color: '#ffffff' }
|
| 112 |
+
: { backgroundColor: 'rgba(255,255,255,0.08)', border: '1px solid rgba(255,255,255,0.15)', color: 'rgba(255,255,255,0.7)' }
|
| 113 |
+
}
|
| 114 |
+
>
|
| 115 |
+
{f}
|
| 116 |
+
</button>
|
| 117 |
+
))}
|
| 118 |
+
</div>
|
| 119 |
+
)}
|
| 120 |
+
|
| 121 |
+
{/* PDF scroll area */}
|
| 122 |
+
<div
|
| 123 |
+
ref={containerRef}
|
| 124 |
+
className="flex-1 overflow-y-auto pdf-scroll-container bg-gray-200 p-3 space-y-4"
|
| 125 |
+
>
|
| 126 |
+
{pdfUrl ? (
|
| 127 |
+
<Document
|
| 128 |
+
file={pdfUrl}
|
| 129 |
+
onLoadSuccess={({ numPages: n }) => {
|
| 130 |
+
setNumPages(n)
|
| 131 |
+
setRenderedPages(new Set())
|
| 132 |
+
}}
|
| 133 |
+
loading={<LoadingPlaceholder />}
|
| 134 |
+
error={<ErrorPlaceholder />}
|
| 135 |
+
>
|
| 136 |
+
{Array.from({ length: numPages }, (_, i) => i + 1).map((pageNum) => {
|
| 137 |
+
const pageHighlights = highlights.filter(
|
| 138 |
+
(h) => h.location.page === pageNum,
|
| 139 |
+
)
|
| 140 |
+
const hasActive =
|
| 141 |
+
activeProvenance?.location.page === pageNum &&
|
| 142 |
+
activeProvenance.source_filename === activePdfFile
|
| 143 |
+
|
| 144 |
+
return (
|
| 145 |
+
<div
|
| 146 |
+
key={pageNum}
|
| 147 |
+
ref={(el) => {
|
| 148 |
+
if (el) pageRefs.current.set(pageNum, el)
|
| 149 |
+
else pageRefs.current.delete(pageNum)
|
| 150 |
+
}}
|
| 151 |
+
// Use block + explicit width so the overlay div always matches
|
| 152 |
+
// the canvas dimensions exactly (inline-block can shrink-wrap)
|
| 153 |
+
style={{ position: 'relative', width: containerWidth }}
|
| 154 |
+
className={`rounded shadow-md transition-shadow overflow-hidden ${
|
| 155 |
+
hasActive ? 'ring-4 ring-blue-500' : ''
|
| 156 |
+
}`}
|
| 157 |
+
>
|
| 158 |
+
<Page
|
| 159 |
+
pageNumber={pageNum}
|
| 160 |
+
width={containerWidth}
|
| 161 |
+
renderTextLayer={false}
|
| 162 |
+
renderAnnotationLayer={false}
|
| 163 |
+
onRenderSuccess={() => handlePageRenderSuccess(pageNum)}
|
| 164 |
+
/>
|
| 165 |
+
|
| 166 |
+
{/* Highlight overlay — percentage-based, top-left origin */}
|
| 167 |
+
<div
|
| 168 |
+
style={{ position: 'absolute', inset: 0, pointerEvents: 'none' }}
|
| 169 |
+
aria-hidden
|
| 170 |
+
>
|
| 171 |
+
{pageHighlights.map((h) => {
|
| 172 |
+
const [x0, y0, x1, y1] = h.location.bbox
|
| 173 |
+
const isActive = activeProvenance?.field_path === h.field_path
|
| 174 |
+
return (
|
| 175 |
+
<div
|
| 176 |
+
key={h.field_path}
|
| 177 |
+
style={{
|
| 178 |
+
position: 'absolute',
|
| 179 |
+
left: `${x0}%`,
|
| 180 |
+
top: `${y0}%`,
|
| 181 |
+
width: `${x1 - x0}%`,
|
| 182 |
+
height: `${y1 - y0}%`,
|
| 183 |
+
background: isActive
|
| 184 |
+
? 'rgba(59, 130, 246, 0.35)' /* blue-500 fill */
|
| 185 |
+
: 'rgba(134, 239, 172, 0.35)', /* green-300 fill */
|
| 186 |
+
border: isActive
|
| 187 |
+
? '3px solid rgba(37, 99, 235, 1)' /* blue-700 solid */
|
| 188 |
+
: '2px solid rgba(22, 163, 74, 0.9)', /* green-600 */
|
| 189 |
+
borderRadius: 3,
|
| 190 |
+
boxShadow: isActive
|
| 191 |
+
? '0 0 0 2px rgba(147, 197, 253, 0.6)' /* blue glow */
|
| 192 |
+
: 'none',
|
| 193 |
+
transition: 'background 0.15s, border 0.15s',
|
| 194 |
+
}}
|
| 195 |
+
title={`${h.field_path}: ${h.extracted_value}`}
|
| 196 |
+
/>
|
| 197 |
+
)
|
| 198 |
+
})}
|
| 199 |
+
</div>
|
| 200 |
+
</div>
|
| 201 |
+
)
|
| 202 |
+
})}
|
| 203 |
+
</Document>
|
| 204 |
+
) : (
|
| 205 |
+
<div className="flex items-center justify-center h-full text-gray-400 text-sm">
|
| 206 |
+
No PDF selected
|
| 207 |
+
</div>
|
| 208 |
+
)}
|
| 209 |
+
</div>
|
| 210 |
+
</div>
|
| 211 |
+
)
|
| 212 |
+
}
|
| 213 |
+
|
| 214 |
+
function LoadingPlaceholder() {
|
| 215 |
+
return (
|
| 216 |
+
<div className="flex items-center justify-center p-12 text-gray-400 text-sm">
|
| 217 |
+
Loading PDF…
|
| 218 |
+
</div>
|
| 219 |
+
)
|
| 220 |
+
}
|
| 221 |
+
|
| 222 |
+
function ErrorPlaceholder() {
|
| 223 |
+
return (
|
| 224 |
+
<div className="flex items-center justify-center p-12 text-red-400 text-sm">
|
| 225 |
+
Failed to load PDF.
|
| 226 |
+
</div>
|
| 227 |
+
)
|
| 228 |
+
}
|
| 229 |
+
|
ui/src/RecordPane.tsx
ADDED
|
@@ -0,0 +1,174 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import { useMemo } from 'react'
|
| 2 |
+
import type { FieldEntry, GoldenRecord } from './types'
|
| 3 |
+
import { useStore } from './store'
|
| 4 |
+
import { FieldRow } from './FieldRow'
|
| 5 |
+
|
| 6 |
+
interface Props {
|
| 7 |
+
sessionId: string
|
| 8 |
+
}
|
| 9 |
+
|
| 10 |
+
const SECTION_LABELS: Record<string, string> = {
|
| 11 |
+
policy_header: 'Policy Header',
|
| 12 |
+
vehicle_details: 'Vehicle Details',
|
| 13 |
+
driver_details: 'Drivers',
|
| 14 |
+
cover_and_excesses: 'Cover & Excesses',
|
| 15 |
+
financial_summary: 'Financial Summary',
|
| 16 |
+
additional_risk_data: 'Additional Risk Data',
|
| 17 |
+
}
|
| 18 |
+
|
| 19 |
+
export function RecordPane({ sessionId }: Props) {
|
| 20 |
+
const sessionData = useStore((s) => s.sessionData)
|
| 21 |
+
const reviewState = useStore((s) => s.reviewState)
|
| 22 |
+
const activeFieldPath = useStore((s) => s.activeFieldPath)
|
| 23 |
+
const setActiveField = useStore((s) => s.setActiveField)
|
| 24 |
+
|
| 25 |
+
const fieldsBySection = useMemo(() => {
|
| 26 |
+
if (!sessionData) return []
|
| 27 |
+
return flattenRecord(sessionData.record, sessionData.provenance.reduce(
|
| 28 |
+
(acc, p) => { acc[p.field_path] = p; return acc },
|
| 29 |
+
{} as Record<string, import('./types').FieldProvenance>,
|
| 30 |
+
))
|
| 31 |
+
}, [sessionData])
|
| 32 |
+
|
| 33 |
+
if (!sessionData) return null
|
| 34 |
+
|
| 35 |
+
return (
|
| 36 |
+
<div className="flex flex-col h-full">
|
| 37 |
+
{/* Header */}
|
| 38 |
+
<div className="px-5 py-4 border-b flex-shrink-0" style={{ backgroundColor: '#1F2937' }}>
|
| 39 |
+
<h2 className="text-sm font-semibold text-white">Golden Record</h2>
|
| 40 |
+
<p className="text-xs mt-0.5" style={{ color: 'rgba(255,255,255,0.5)' }}>
|
| 41 |
+
Click any field to highlight its source location in the PDF.
|
| 42 |
+
</p>
|
| 43 |
+
</div>
|
| 44 |
+
|
| 45 |
+
{/* Scrollable field list */}
|
| 46 |
+
<div className="flex-1 overflow-y-auto px-4 py-3 space-y-5">
|
| 47 |
+
{fieldsBySection.map(({ section, entries }) => (
|
| 48 |
+
<section key={section}>
|
| 49 |
+
<h3 className="text-xs font-semibold uppercase tracking-wider mb-2 px-1" style={{ color: '#008080' }}>
|
| 50 |
+
{SECTION_LABELS[section] ?? section}
|
| 51 |
+
</h3>
|
| 52 |
+
<div className="space-y-1">
|
| 53 |
+
{entries.map((entry) => (
|
| 54 |
+
<FieldRow
|
| 55 |
+
key={entry.fieldPath}
|
| 56 |
+
entry={entry}
|
| 57 |
+
sessionId={sessionId}
|
| 58 |
+
isActive={activeFieldPath === entry.fieldPath}
|
| 59 |
+
review={reviewState[entry.fieldPath]}
|
| 60 |
+
onClick={() =>
|
| 61 |
+
setActiveField(activeFieldPath === entry.fieldPath ? null : entry)
|
| 62 |
+
}
|
| 63 |
+
/>
|
| 64 |
+
))}
|
| 65 |
+
</div>
|
| 66 |
+
</section>
|
| 67 |
+
))}
|
| 68 |
+
</div>
|
| 69 |
+
</div>
|
| 70 |
+
)
|
| 71 |
+
}
|
| 72 |
+
|
| 73 |
+
// ── Field flattening helpers ───────────────────────────────────────────────
|
| 74 |
+
|
| 75 |
+
interface SectionGroup {
|
| 76 |
+
section: string
|
| 77 |
+
entries: FieldEntry[]
|
| 78 |
+
}
|
| 79 |
+
|
| 80 |
+
function flattenRecord(
|
| 81 |
+
record: GoldenRecord,
|
| 82 |
+
provenanceMap: Record<string, import('./types').FieldProvenance>,
|
| 83 |
+
): SectionGroup[] {
|
| 84 |
+
const groups: SectionGroup[] = []
|
| 85 |
+
|
| 86 |
+
for (const [sectionKey, sectionValue] of Object.entries(record)) {
|
| 87 |
+
if (sectionValue == null) continue
|
| 88 |
+
|
| 89 |
+
const entries: FieldEntry[] = []
|
| 90 |
+
|
| 91 |
+
if (Array.isArray(sectionValue)) {
|
| 92 |
+
// driver_details
|
| 93 |
+
sectionValue.forEach((item: Record<string, unknown>, idx: number) => {
|
| 94 |
+
walkObject(
|
| 95 |
+
item,
|
| 96 |
+
`${sectionKey}[${idx}]`,
|
| 97 |
+
`Driver ${idx + 1}`,
|
| 98 |
+
entries,
|
| 99 |
+
provenanceMap,
|
| 100 |
+
)
|
| 101 |
+
})
|
| 102 |
+
} else if (typeof sectionValue === 'object') {
|
| 103 |
+
walkObject(
|
| 104 |
+
sectionValue as Record<string, unknown>,
|
| 105 |
+
sectionKey,
|
| 106 |
+
'',
|
| 107 |
+
entries,
|
| 108 |
+
provenanceMap,
|
| 109 |
+
)
|
| 110 |
+
} else {
|
| 111 |
+
entries.push({
|
| 112 |
+
fieldPath: sectionKey,
|
| 113 |
+
label: formatLabel(sectionKey),
|
| 114 |
+
value: String(sectionValue),
|
| 115 |
+
section: sectionKey,
|
| 116 |
+
provenance: provenanceMap[sectionKey],
|
| 117 |
+
})
|
| 118 |
+
}
|
| 119 |
+
|
| 120 |
+
if (entries.length > 0) {
|
| 121 |
+
groups.push({ section: sectionKey, entries })
|
| 122 |
+
}
|
| 123 |
+
}
|
| 124 |
+
|
| 125 |
+
return groups
|
| 126 |
+
}
|
| 127 |
+
|
| 128 |
+
function walkObject(
|
| 129 |
+
obj: Record<string, unknown>,
|
| 130 |
+
pathPrefix: string,
|
| 131 |
+
_labelPrefix: string,
|
| 132 |
+
out: FieldEntry[],
|
| 133 |
+
provenanceMap: Record<string, import('./types').FieldProvenance>,
|
| 134 |
+
) {
|
| 135 |
+
for (const [key, val] of Object.entries(obj)) {
|
| 136 |
+
const path = `${pathPrefix}.${key}`
|
| 137 |
+
|
| 138 |
+
if (val == null) continue
|
| 139 |
+
|
| 140 |
+
if (typeof val === 'object' && !Array.isArray(val)) {
|
| 141 |
+
walkObject(val as Record<string, unknown>, path, key, out, provenanceMap)
|
| 142 |
+
} else if (Array.isArray(val)) {
|
| 143 |
+
val.forEach((item, i) => {
|
| 144 |
+
if (item == null) return
|
| 145 |
+
const iPath = `${path}[${i}]`
|
| 146 |
+
if (typeof item === 'object') {
|
| 147 |
+
walkObject(item as Record<string, unknown>, iPath, key, out, provenanceMap)
|
| 148 |
+
} else {
|
| 149 |
+
out.push({
|
| 150 |
+
fieldPath: iPath,
|
| 151 |
+
label: `${formatLabel(key)} [${i}]`,
|
| 152 |
+
value: String(item),
|
| 153 |
+
section: pathPrefix.split('.')[0],
|
| 154 |
+
provenance: provenanceMap[iPath],
|
| 155 |
+
})
|
| 156 |
+
}
|
| 157 |
+
})
|
| 158 |
+
} else {
|
| 159 |
+
out.push({
|
| 160 |
+
fieldPath: path,
|
| 161 |
+
label: formatLabel(key),
|
| 162 |
+
value: String(val),
|
| 163 |
+
section: pathPrefix.split('.')[0],
|
| 164 |
+
provenance: provenanceMap[path],
|
| 165 |
+
})
|
| 166 |
+
}
|
| 167 |
+
}
|
| 168 |
+
}
|
| 169 |
+
|
| 170 |
+
function formatLabel(key: string): string {
|
| 171 |
+
return key
|
| 172 |
+
.replace(/_/g, ' ')
|
| 173 |
+
.replace(/\b\w/g, (c) => c.toUpperCase())
|
| 174 |
+
}
|
ui/src/ReviewDashboard.tsx
ADDED
|
@@ -0,0 +1,93 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import { PDFPane } from './PDFPane'
|
| 2 |
+
import { RecordPane } from './RecordPane'
|
| 3 |
+
import { useStore } from './store'
|
| 4 |
+
import logoUrl from './assets/ai-toolstack-logo.svg'
|
| 5 |
+
|
| 6 |
+
interface Props {
|
| 7 |
+
sessionId: string
|
| 8 |
+
}
|
| 9 |
+
|
| 10 |
+
export function ReviewDashboard({ sessionId }: Props) {
|
| 11 |
+
const sessionData = useStore((s) => s.sessionData)
|
| 12 |
+
const reviewState = useStore((s) => s.reviewState)
|
| 13 |
+
|
| 14 |
+
const verified = Object.values(reviewState).filter((r) => r.action === 'verify').length
|
| 15 |
+
const overridden = Object.values(reviewState).filter((r) => r.action === 'override').length
|
| 16 |
+
const provTotal = sessionData?.provenance.length ?? 0
|
| 17 |
+
const fieldTotal = sessionData ? _countLeaves(sessionData.record) : 0
|
| 18 |
+
|
| 19 |
+
return (
|
| 20 |
+
<div className="flex flex-col h-screen overflow-hidden" style={{ backgroundColor: '#f1f5f9' }}>
|
| 21 |
+
|
| 22 |
+
{/* ── Top bar ─────────────────────────────────────────────────── */}
|
| 23 |
+
<header className="flex items-center justify-between px-6 py-3 bg-white border-b border-gray-200 shadow-sm z-10 flex-shrink-0">
|
| 24 |
+
<div className="flex items-center gap-4">
|
| 25 |
+
<a href="https://www.ai-toolstack.com/" target="_blank" rel="noopener noreferrer">
|
| 26 |
+
<img src={logoUrl} alt="AI Tool Stack" className="h-6 w-auto" />
|
| 27 |
+
</a>
|
| 28 |
+
{/* Divider */}
|
| 29 |
+
<span className="text-gray-200 select-none">|</span>
|
| 30 |
+
<div className="flex items-center gap-2">
|
| 31 |
+
<svg width="16" height="16" viewBox="0 0 28 28" fill="none" aria-hidden="true">
|
| 32 |
+
<path d="M4 18L14 22L24 18" stroke="#1F2937" strokeWidth="2.5" strokeLinecap="round" strokeLinejoin="round"/>
|
| 33 |
+
<path d="M4 14L14 18L24 14" stroke="#2563EB" strokeWidth="2.5" strokeLinecap="round" strokeLinejoin="round"/>
|
| 34 |
+
<path d="M4 10L14 14L24 10L14 6L4 10Z" stroke="#008080" strokeWidth="2.5" strokeLinecap="round" strokeLinejoin="round"/>
|
| 35 |
+
</svg>
|
| 36 |
+
<span className="text-sm font-semibold" style={{ color: '#1F2937' }}>PolicyTrace</span>
|
| 37 |
+
</div>
|
| 38 |
+
<span className="text-xs text-gray-400 font-mono bg-gray-50 px-2 py-0.5 rounded-lg border border-gray-200">
|
| 39 |
+
{sessionId.slice(0, 8)}…
|
| 40 |
+
</span>
|
| 41 |
+
</div>
|
| 42 |
+
<div className="flex items-center gap-3 text-xs text-gray-500">
|
| 43 |
+
<Stat label="Fields" value={fieldTotal} />
|
| 44 |
+
<StatDivider />
|
| 45 |
+
<Stat label="Located" value={provTotal} />
|
| 46 |
+
<StatDivider />
|
| 47 |
+
<Stat label="Verified" value={verified} color="#16a34a" />
|
| 48 |
+
<StatDivider />
|
| 49 |
+
<Stat label="Overridden" value={overridden} color="#2563EB" />
|
| 50 |
+
</div>
|
| 51 |
+
</header>
|
| 52 |
+
|
| 53 |
+
{/* ── 2-column body ───────────────────────────────────────────── */}
|
| 54 |
+
<div className="flex flex-1 overflow-hidden">
|
| 55 |
+
<div className="w-1/2 border-r border-gray-200 flex flex-col overflow-hidden">
|
| 56 |
+
<PDFPane sessionId={sessionId} />
|
| 57 |
+
</div>
|
| 58 |
+
<div className="w-1/2 flex flex-col overflow-hidden">
|
| 59 |
+
<RecordPane sessionId={sessionId} />
|
| 60 |
+
</div>
|
| 61 |
+
</div>
|
| 62 |
+
</div>
|
| 63 |
+
)
|
| 64 |
+
}
|
| 65 |
+
|
| 66 |
+
function StatDivider() {
|
| 67 |
+
return <span className="text-gray-200 select-none">·</span>
|
| 68 |
+
}
|
| 69 |
+
|
| 70 |
+
function Stat({
|
| 71 |
+
label,
|
| 72 |
+
value,
|
| 73 |
+
color = '#374151',
|
| 74 |
+
}: {
|
| 75 |
+
label: string
|
| 76 |
+
value: number
|
| 77 |
+
color?: string
|
| 78 |
+
}) {
|
| 79 |
+
return (
|
| 80 |
+
<span>
|
| 81 |
+
{label}:{' '}
|
| 82 |
+
<span className="font-semibold" style={{ color }}>{value}</span>
|
| 83 |
+
</span>
|
| 84 |
+
)
|
| 85 |
+
}
|
| 86 |
+
|
| 87 |
+
/** Recursively count leaf values in any nested object (mirrors backend _count_leaves). */
|
| 88 |
+
function _countLeaves(obj: unknown): number {
|
| 89 |
+
if (Array.isArray(obj)) return obj.reduce((acc, v) => acc + _countLeaves(v), 0)
|
| 90 |
+
if (obj && typeof obj === 'object')
|
| 91 |
+
return Object.values(obj).reduce((acc: number, v) => acc + _countLeaves(v), 0)
|
| 92 |
+
return 1
|
| 93 |
+
}
|
ui/src/SessionPage.tsx
ADDED
|
@@ -0,0 +1,82 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import { useEffect, useState } from 'react'
|
| 2 |
+
import { useNavigate, useParams } from 'react-router-dom'
|
| 3 |
+
import { api } from './api'
|
| 4 |
+
import { ReviewDashboard } from './ReviewDashboard'
|
| 5 |
+
import { useStore } from './store'
|
| 6 |
+
|
| 7 |
+
/**
|
| 8 |
+
* Route: /session/:sessionId
|
| 9 |
+
*
|
| 10 |
+
* Loads session data from the API on mount so the page survives a hard refresh
|
| 11 |
+
* or a direct link (e.g. from a blog post). If the session ID is not found the
|
| 12 |
+
* user is redirected back to the upload page with a clear error message.
|
| 13 |
+
*/
|
| 14 |
+
export function SessionPage() {
|
| 15 |
+
const { sessionId } = useParams<{ sessionId: string }>()
|
| 16 |
+
const navigate = useNavigate()
|
| 17 |
+
const setSession = useStore((s) => s.setSession)
|
| 18 |
+
const sessionData = useStore((s) => s.sessionData)
|
| 19 |
+
|
| 20 |
+
const [loading, setLoading] = useState(false)
|
| 21 |
+
const [error, setError] = useState<string | null>(null)
|
| 22 |
+
|
| 23 |
+
useEffect(() => {
|
| 24 |
+
if (!sessionId) {
|
| 25 |
+
navigate('/')
|
| 26 |
+
return
|
| 27 |
+
}
|
| 28 |
+
|
| 29 |
+
// If the store already has data for this exact session (just navigated from
|
| 30 |
+
// the upload page), skip the API call.
|
| 31 |
+
if (sessionData?.session_id === sessionId) return
|
| 32 |
+
|
| 33 |
+
setLoading(true)
|
| 34 |
+
api.getSession(sessionId)
|
| 35 |
+
.then((data) => {
|
| 36 |
+
setSession(data)
|
| 37 |
+
setLoading(false)
|
| 38 |
+
})
|
| 39 |
+
.catch(() => {
|
| 40 |
+
setError(`Session "${sessionId.slice(0, 8)}…" not found or has expired.`)
|
| 41 |
+
setLoading(false)
|
| 42 |
+
})
|
| 43 |
+
}, [sessionId]) // eslint-disable-line react-hooks/exhaustive-deps
|
| 44 |
+
|
| 45 |
+
if (loading) {
|
| 46 |
+
return (
|
| 47 |
+
<div className="min-h-screen flex items-center justify-center" style={{ backgroundColor: '#f8fafc' }}>
|
| 48 |
+
<div className="text-center space-y-3">
|
| 49 |
+
<svg className="animate-spin h-8 w-8 mx-auto" viewBox="0 0 24 24" fill="none"
|
| 50 |
+
style={{ color: '#008080' }}>
|
| 51 |
+
<circle className="opacity-25" cx="12" cy="12" r="10" stroke="currentColor" strokeWidth="4" />
|
| 52 |
+
<path className="opacity-75" fill="currentColor" d="M4 12a8 8 0 018-8v8z" />
|
| 53 |
+
</svg>
|
| 54 |
+
<p className="text-sm text-gray-500">Loading session…</p>
|
| 55 |
+
</div>
|
| 56 |
+
</div>
|
| 57 |
+
)
|
| 58 |
+
}
|
| 59 |
+
|
| 60 |
+
if (error) {
|
| 61 |
+
return (
|
| 62 |
+
<div className="min-h-screen flex items-center justify-center" style={{ backgroundColor: '#f8fafc' }}>
|
| 63 |
+
<div className="text-center space-y-4 max-w-sm">
|
| 64 |
+
<p className="text-sm text-red-600 bg-red-50 border border-red-200 rounded-xl px-4 py-3">
|
| 65 |
+
{error}
|
| 66 |
+
</p>
|
| 67 |
+
<button
|
| 68 |
+
onClick={() => navigate('/')}
|
| 69 |
+
className="text-sm font-medium underline"
|
| 70 |
+
style={{ color: '#2563EB' }}
|
| 71 |
+
>
|
| 72 |
+
← Back to upload
|
| 73 |
+
</button>
|
| 74 |
+
</div>
|
| 75 |
+
</div>
|
| 76 |
+
)
|
| 77 |
+
}
|
| 78 |
+
|
| 79 |
+
if (!sessionId) return null
|
| 80 |
+
|
| 81 |
+
return <ReviewDashboard sessionId={sessionId} />
|
| 82 |
+
}
|
ui/src/UploadPage.tsx
ADDED
|
@@ -0,0 +1,210 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import { useCallback, useState } from 'react'
|
| 2 |
+
import { useNavigate } from 'react-router-dom'
|
| 3 |
+
import { api } from './api'
|
| 4 |
+
import { useStore } from './store'
|
| 5 |
+
import logoUrl from './assets/ai-toolstack-logo.svg'
|
| 6 |
+
|
| 7 |
+
const BRAND = {
|
| 8 |
+
dark: '#1F2937',
|
| 9 |
+
blue: '#2563EB',
|
| 10 |
+
teal: '#008080',
|
| 11 |
+
} as const
|
| 12 |
+
|
| 13 |
+
export function UploadPage() {
|
| 14 |
+
const navigate = useNavigate()
|
| 15 |
+
const setSession = useStore((s) => s.setSession)
|
| 16 |
+
|
| 17 |
+
const [loading, setLoading] = useState(false)
|
| 18 |
+
const [error, setError] = useState<string | null>(null)
|
| 19 |
+
const [files, setFiles] = useState<File[]>([])
|
| 20 |
+
const [dragOver, setDragOver] = useState(false)
|
| 21 |
+
|
| 22 |
+
const handleFiles = useCallback((incoming: FileList | null) => {
|
| 23 |
+
if (!incoming) return
|
| 24 |
+
const pdfs = Array.from(incoming).filter((f) => f.name.toLowerCase().endsWith('.pdf'))
|
| 25 |
+
setFiles((prev) => {
|
| 26 |
+
const names = new Set(prev.map((f) => f.name))
|
| 27 |
+
return [...prev, ...pdfs.filter((f) => !names.has(f.name))]
|
| 28 |
+
})
|
| 29 |
+
}, [])
|
| 30 |
+
|
| 31 |
+
const removeFile = (name: string) =>
|
| 32 |
+
setFiles((prev) => prev.filter((f) => f.name !== name))
|
| 33 |
+
|
| 34 |
+
const handleSubmit = async () => {
|
| 35 |
+
if (!files.length) return
|
| 36 |
+
setLoading(true)
|
| 37 |
+
setError(null)
|
| 38 |
+
try {
|
| 39 |
+
const resp = await api.processDocuments(files)
|
| 40 |
+
const sessionData = await api.getSession(resp.session_id)
|
| 41 |
+
setSession(sessionData)
|
| 42 |
+
navigate(`/session/${resp.session_id}`)
|
| 43 |
+
} catch (err: unknown) {
|
| 44 |
+
const msg = err instanceof Error ? err.message : 'An unknown error occurred.'
|
| 45 |
+
setError(msg)
|
| 46 |
+
setLoading(false)
|
| 47 |
+
}
|
| 48 |
+
}
|
| 49 |
+
|
| 50 |
+
return (
|
| 51 |
+
<div className="min-h-screen flex flex-col" style={{ backgroundColor: '#f8fafc' }}>
|
| 52 |
+
|
| 53 |
+
{/* ── Top nav ─────────────────────────────────────────────────── */}
|
| 54 |
+
<header className="flex items-center justify-between px-8 py-4 border-b border-gray-200 bg-white">
|
| 55 |
+
<a href="https://www.ai-toolstack.com/" target="_blank" rel="noopener noreferrer">
|
| 56 |
+
<img src={logoUrl} alt="AI Tool Stack" className="h-7 w-auto" />
|
| 57 |
+
</a>
|
| 58 |
+
<span
|
| 59 |
+
className="text-xs font-medium px-2 py-1 rounded-full"
|
| 60 |
+
style={{ backgroundColor: '#f0fdfc', color: BRAND.teal }}
|
| 61 |
+
>
|
| 62 |
+
Beta
|
| 63 |
+
</span>
|
| 64 |
+
</header>
|
| 65 |
+
|
| 66 |
+
{/* ── Hero ────────────────────────────────────────────────────── */}
|
| 67 |
+
<main className="flex-1 flex flex-col items-center justify-center px-8 py-12">
|
| 68 |
+
<div className="w-full max-w-lg">
|
| 69 |
+
|
| 70 |
+
{/* Title */}
|
| 71 |
+
<div className="mb-8 text-center">
|
| 72 |
+
<div className="inline-flex items-center gap-2 mb-4">
|
| 73 |
+
<svg width="28" height="28" viewBox="0 0 28 28" fill="none" aria-hidden="true">
|
| 74 |
+
<path d="M4 18L14 22L24 18" stroke={BRAND.dark} strokeWidth="2" strokeLinecap="round" strokeLinejoin="round"/>
|
| 75 |
+
<path d="M4 14L14 18L24 14" stroke={BRAND.blue} strokeWidth="2" strokeLinecap="round" strokeLinejoin="round"/>
|
| 76 |
+
<path d="M4 10L14 14L24 10L14 6L4 10Z" stroke={BRAND.teal} strokeWidth="2" strokeLinecap="round" strokeLinejoin="round"/>
|
| 77 |
+
</svg>
|
| 78 |
+
<h1 className="text-2xl font-bold tracking-tight" style={{ color: BRAND.dark }}>
|
| 79 |
+
PolicyTrace
|
| 80 |
+
</h1>
|
| 81 |
+
</div>
|
| 82 |
+
<p className="text-sm text-gray-500 leading-relaxed">
|
| 83 |
+
Upload UK motor insurance PDFs — the pipeline classifies, extracts, and merges
|
| 84 |
+
them into a verified Golden Record with full field-level provenance.
|
| 85 |
+
</p>
|
| 86 |
+
</div>
|
| 87 |
+
|
| 88 |
+
{/* Drop zone */}
|
| 89 |
+
<div
|
| 90 |
+
onDragOver={(e) => { e.preventDefault(); setDragOver(true) }}
|
| 91 |
+
onDragLeave={() => setDragOver(false)}
|
| 92 |
+
onDrop={(e) => {
|
| 93 |
+
e.preventDefault()
|
| 94 |
+
setDragOver(false)
|
| 95 |
+
handleFiles(e.dataTransfer.files)
|
| 96 |
+
}}
|
| 97 |
+
onClick={() => document.getElementById('file-input')?.click()}
|
| 98 |
+
className="rounded-2xl border-2 border-dashed p-10 text-center cursor-pointer transition-all"
|
| 99 |
+
style={{
|
| 100 |
+
borderColor: dragOver ? BRAND.blue : '#d1d5db',
|
| 101 |
+
backgroundColor: dragOver ? '#eff6ff' : '#ffffff',
|
| 102 |
+
}}
|
| 103 |
+
>
|
| 104 |
+
<svg
|
| 105 |
+
className="mx-auto mb-3 h-10 w-10 transition-colors"
|
| 106 |
+
fill="none"
|
| 107 |
+
viewBox="0 0 24 24"
|
| 108 |
+
stroke="currentColor"
|
| 109 |
+
style={{ color: dragOver ? BRAND.blue : '#9ca3af' }}
|
| 110 |
+
>
|
| 111 |
+
<path strokeLinecap="round" strokeLinejoin="round" strokeWidth={1.5}
|
| 112 |
+
d="M7 16a4 4 0 01-.88-7.903A5 5 0 1115.9 6L16 6a5 5 0 011 9.9M15 13l-3-3m0 0l-3 3m3-3v12" />
|
| 113 |
+
</svg>
|
| 114 |
+
<p className="text-sm font-medium text-gray-700">
|
| 115 |
+
Drop PDF files here, or{' '}
|
| 116 |
+
<span style={{ color: BRAND.blue }}>click to browse</span>
|
| 117 |
+
</p>
|
| 118 |
+
<p className="text-xs text-gray-400 mt-1">
|
| 119 |
+
Schedule · Certificate · Statement of Fact · Policy Booklet
|
| 120 |
+
</p>
|
| 121 |
+
<input
|
| 122 |
+
id="file-input"
|
| 123 |
+
type="file"
|
| 124 |
+
accept=".pdf"
|
| 125 |
+
multiple
|
| 126 |
+
className="hidden"
|
| 127 |
+
onChange={(e) => handleFiles(e.target.files)}
|
| 128 |
+
/>
|
| 129 |
+
</div>
|
| 130 |
+
|
| 131 |
+
{/* File list */}
|
| 132 |
+
{files.length > 0 && (
|
| 133 |
+
<ul className="mt-4 space-y-2">
|
| 134 |
+
{files.map((f) => (
|
| 135 |
+
<li
|
| 136 |
+
key={f.name}
|
| 137 |
+
className="flex items-center justify-between bg-white border border-gray-200 rounded-xl px-4 py-2.5 text-sm shadow-sm"
|
| 138 |
+
>
|
| 139 |
+
<div className="flex items-center gap-2 min-w-0">
|
| 140 |
+
<span
|
| 141 |
+
className="shrink-0 text-xs font-semibold px-1.5 py-0.5 rounded"
|
| 142 |
+
style={{ backgroundColor: '#fee2e2', color: '#991b1b' }}
|
| 143 |
+
>
|
| 144 |
+
PDF
|
| 145 |
+
</span>
|
| 146 |
+
<span className="text-gray-700 truncate">{f.name}</span>
|
| 147 |
+
</div>
|
| 148 |
+
<button
|
| 149 |
+
onClick={() => removeFile(f.name)}
|
| 150 |
+
className="text-gray-300 hover:text-red-500 ml-3 shrink-0 transition-colors"
|
| 151 |
+
aria-label={`Remove ${f.name}`}
|
| 152 |
+
>
|
| 153 |
+
✕
|
| 154 |
+
</button>
|
| 155 |
+
</li>
|
| 156 |
+
))}
|
| 157 |
+
</ul>
|
| 158 |
+
)}
|
| 159 |
+
|
| 160 |
+
{/* Error */}
|
| 161 |
+
{error && (
|
| 162 |
+
<div className="mt-4 rounded-xl bg-red-50 border border-red-200 p-3 text-sm text-red-700">
|
| 163 |
+
{error}
|
| 164 |
+
</div>
|
| 165 |
+
)}
|
| 166 |
+
|
| 167 |
+
{/* CTA */}
|
| 168 |
+
<button
|
| 169 |
+
onClick={handleSubmit}
|
| 170 |
+
disabled={!files.length || loading}
|
| 171 |
+
className="mt-6 w-full py-3 px-6 rounded-xl font-semibold text-white transition-colors disabled:opacity-50 disabled:cursor-not-allowed"
|
| 172 |
+
style={{ backgroundColor: loading ? BRAND.teal : BRAND.blue }}
|
| 173 |
+
>
|
| 174 |
+
{loading ? (
|
| 175 |
+
<span className="flex items-center justify-center gap-2">
|
| 176 |
+
<svg className="animate-spin h-4 w-4" viewBox="0 0 24 24" fill="none">
|
| 177 |
+
<circle className="opacity-25" cx="12" cy="12" r="10" stroke="currentColor" strokeWidth="4" />
|
| 178 |
+
<path className="opacity-75" fill="currentColor" d="M4 12a8 8 0 018-8v8z" />
|
| 179 |
+
</svg>
|
| 180 |
+
Extracting — this may take 60 s…
|
| 181 |
+
</span>
|
| 182 |
+
) : (
|
| 183 |
+
'Extract & Review'
|
| 184 |
+
)}
|
| 185 |
+
</button>
|
| 186 |
+
|
| 187 |
+
{loading && (
|
| 188 |
+
<p className="text-center text-xs text-gray-400 mt-3">
|
| 189 |
+
Classifying documents · Masking PII · Calling Groq LLM · Building provenance index
|
| 190 |
+
</p>
|
| 191 |
+
)}
|
| 192 |
+
</div>
|
| 193 |
+
</main>
|
| 194 |
+
|
| 195 |
+
{/* ── Footer ──────────────────────────────────────────────────── */}
|
| 196 |
+
<footer className="text-center py-4 text-xs text-gray-400 border-t border-gray-200 bg-white">
|
| 197 |
+
Built on{' '}
|
| 198 |
+
<a
|
| 199 |
+
href="https://www.ai-toolstack.com/"
|
| 200 |
+
target="_blank"
|
| 201 |
+
rel="noopener noreferrer"
|
| 202 |
+
className="underline hover:text-gray-600 transition-colors"
|
| 203 |
+
>
|
| 204 |
+
AI Tool Stack
|
| 205 |
+
</a>{' '}
|
| 206 |
+
· Powered by Groq & Docling
|
| 207 |
+
</footer>
|
| 208 |
+
</div>
|
| 209 |
+
)
|
| 210 |
+
}
|
ui/src/api.ts
ADDED
|
@@ -0,0 +1,43 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import axios from 'axios'
|
| 2 |
+
import type { ProcessResponse, ReviewAction, ReviewState, SessionData } from './types'
|
| 3 |
+
|
| 4 |
+
const http = axios.create({ baseURL: '/' })
|
| 5 |
+
|
| 6 |
+
export const api = {
|
| 7 |
+
async processDocuments(files: File[]): Promise<ProcessResponse> {
|
| 8 |
+
const form = new FormData()
|
| 9 |
+
for (const f of files) form.append('files', f)
|
| 10 |
+
const { data } = await http.post<ProcessResponse>('/api/process', form, {
|
| 11 |
+
headers: { 'Content-Type': 'multipart/form-data' },
|
| 12 |
+
})
|
| 13 |
+
return data
|
| 14 |
+
},
|
| 15 |
+
|
| 16 |
+
async getSession(sessionId: string): Promise<SessionData> {
|
| 17 |
+
const { data } = await http.get<SessionData>(`/api/session/${sessionId}`)
|
| 18 |
+
return data
|
| 19 |
+
},
|
| 20 |
+
|
| 21 |
+
async getReviewState(sessionId: string): Promise<ReviewState> {
|
| 22 |
+
const { data } = await http.get<ReviewState>(`/api/session/${sessionId}/review-state`)
|
| 23 |
+
return data
|
| 24 |
+
},
|
| 25 |
+
|
| 26 |
+
async updateReview(
|
| 27 |
+
sessionId: string,
|
| 28 |
+
fieldPath: string,
|
| 29 |
+
action: ReviewAction,
|
| 30 |
+
overriddenValue?: string,
|
| 31 |
+
): Promise<void> {
|
| 32 |
+
await http.patch(`/api/session/${sessionId}/review`, {
|
| 33 |
+
field_path: fieldPath,
|
| 34 |
+
action,
|
| 35 |
+
overridden_value: overriddenValue ?? null,
|
| 36 |
+
})
|
| 37 |
+
},
|
| 38 |
+
|
| 39 |
+
/** URL to stream a PDF — used directly by the PDF viewer component */
|
| 40 |
+
pdfUrl(sessionId: string, filename: string): string {
|
| 41 |
+
return `/api/pdf/${sessionId}/${encodeURIComponent(filename)}`
|
| 42 |
+
},
|
| 43 |
+
}
|
ui/src/assets/ai-toolstack-logo.svg
ADDED
|
|
ui/src/index.css
ADDED
|
@@ -0,0 +1,31 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
@tailwind base;
|
| 2 |
+
@tailwind components;
|
| 3 |
+
@tailwind utilities;
|
| 4 |
+
|
| 5 |
+
@layer base {
|
| 6 |
+
body {
|
| 7 |
+
@apply bg-gray-50 text-gray-900;
|
| 8 |
+
font-family: ui-sans-serif, system-ui, -apple-system, BlinkMacSystemFont, "Segoe UI", sans-serif;
|
| 9 |
+
}
|
| 10 |
+
}
|
| 11 |
+
|
| 12 |
+
/* ── Brand accent utilities ─────────────────────────────────────── */
|
| 13 |
+
.btn-primary {
|
| 14 |
+
@apply py-3 px-6 rounded-xl font-semibold text-white transition-colors;
|
| 15 |
+
background-color: #2563EB;
|
| 16 |
+
}
|
| 17 |
+
.btn-primary:hover { background-color: #1d4ed8; }
|
| 18 |
+
.btn-primary:disabled { @apply opacity-50 cursor-not-allowed; }
|
| 19 |
+
|
| 20 |
+
/* ── react-pdf ──────────────────────────────────────────────────── */
|
| 21 |
+
.react-pdf__Page {
|
| 22 |
+
@apply shadow-md;
|
| 23 |
+
}
|
| 24 |
+
.react-pdf__Page__canvas {
|
| 25 |
+
@apply block;
|
| 26 |
+
}
|
| 27 |
+
|
| 28 |
+
/* Smooth scroll for the PDF pane */
|
| 29 |
+
.pdf-scroll-container {
|
| 30 |
+
scroll-behavior: smooth;
|
| 31 |
+
}
|
ui/src/main.tsx
ADDED
|
@@ -0,0 +1,23 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import React from 'react'
|
| 2 |
+
import { pdfjs } from 'react-pdf'
|
| 3 |
+
|
| 4 |
+
// Configure pdfjs worker (Vite resolves this at build time)
|
| 5 |
+
pdfjs.GlobalWorkerOptions.workerSrc = new URL(
|
| 6 |
+
'pdfjs-dist/build/pdf.worker.min.mjs',
|
| 7 |
+
import.meta.url,
|
| 8 |
+
).toString()
|
| 9 |
+
|
| 10 |
+
import ReactDOM from 'react-dom/client'
|
| 11 |
+
import { BrowserRouter } from 'react-router-dom'
|
| 12 |
+
import App from './App'
|
| 13 |
+
import './index.css'
|
| 14 |
+
import 'react-pdf/dist/Page/AnnotationLayer.css'
|
| 15 |
+
import 'react-pdf/dist/Page/TextLayer.css'
|
| 16 |
+
|
| 17 |
+
ReactDOM.createRoot(document.getElementById('root')!).render(
|
| 18 |
+
<React.StrictMode>
|
| 19 |
+
<BrowserRouter>
|
| 20 |
+
<App />
|
| 21 |
+
</BrowserRouter>
|
| 22 |
+
</React.StrictMode>,
|
| 23 |
+
)
|
ui/src/store.ts
ADDED
|
@@ -0,0 +1,88 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import { create } from 'zustand'
|
| 2 |
+
import { api } from './api'
|
| 3 |
+
import type { FieldEntry, FieldProvenance, ReviewAction, ReviewState, SessionData } from './types'
|
| 4 |
+
|
| 5 |
+
interface AppState {
|
| 6 |
+
// Session data
|
| 7 |
+
sessionData: SessionData | null
|
| 8 |
+
reviewState: ReviewState
|
| 9 |
+
|
| 10 |
+
// UI state
|
| 11 |
+
activePdfFile: string | null // filename of the PDF currently displayed
|
| 12 |
+
activeFieldPath: string | null // field path the user clicked
|
| 13 |
+
activeProvenance: FieldProvenance | null
|
| 14 |
+
|
| 15 |
+
// Actions
|
| 16 |
+
setSession: (data: SessionData) => void
|
| 17 |
+
loadReviewState: (sessionId: string) => Promise<void>
|
| 18 |
+
setActiveField: (entry: FieldEntry | null) => void
|
| 19 |
+
verifyField: (sessionId: string, fieldPath: string) => Promise<void>
|
| 20 |
+
overrideField: (sessionId: string, fieldPath: string, newValue: string) => Promise<void>
|
| 21 |
+
rejectField: (sessionId: string, fieldPath: string) => Promise<void>
|
| 22 |
+
setActivePdf: (filename: string) => void
|
| 23 |
+
}
|
| 24 |
+
|
| 25 |
+
export const useStore = create<AppState>((set, get) => ({
|
| 26 |
+
sessionData: null,
|
| 27 |
+
reviewState: {},
|
| 28 |
+
activePdfFile: null,
|
| 29 |
+
activeFieldPath: null,
|
| 30 |
+
activeProvenance: null,
|
| 31 |
+
|
| 32 |
+
setSession(data) {
|
| 33 |
+
// Set default active PDF to the first source file found in provenance
|
| 34 |
+
const firstPdf = data.provenance[0]?.source_filename ?? null
|
| 35 |
+
set({ sessionData: data, activePdfFile: firstPdf })
|
| 36 |
+
},
|
| 37 |
+
|
| 38 |
+
async loadReviewState(sessionId) {
|
| 39 |
+
const state = await api.getReviewState(sessionId)
|
| 40 |
+
set({ reviewState: state })
|
| 41 |
+
},
|
| 42 |
+
|
| 43 |
+
setActiveField(entry) {
|
| 44 |
+
if (!entry) {
|
| 45 |
+
set({ activeFieldPath: null, activeProvenance: null })
|
| 46 |
+
return
|
| 47 |
+
}
|
| 48 |
+
const { sessionData } = get()
|
| 49 |
+
const provenance = sessionData?.provenance.find(
|
| 50 |
+
(p) => p.field_path === entry.fieldPath,
|
| 51 |
+
) ?? null
|
| 52 |
+
|
| 53 |
+
set({
|
| 54 |
+
activeFieldPath: entry.fieldPath,
|
| 55 |
+
activeProvenance: provenance,
|
| 56 |
+
// Switch PDF pane to the file that contains this field
|
| 57 |
+
activePdfFile: provenance?.source_filename ?? get().activePdfFile,
|
| 58 |
+
})
|
| 59 |
+
},
|
| 60 |
+
|
| 61 |
+
async verifyField(sessionId, fieldPath) {
|
| 62 |
+
await _applyReview(sessionId, fieldPath, 'verify', undefined, set)
|
| 63 |
+
},
|
| 64 |
+
|
| 65 |
+
async overrideField(sessionId, fieldPath, newValue) {
|
| 66 |
+
await _applyReview(sessionId, fieldPath, 'override', newValue, set)
|
| 67 |
+
},
|
| 68 |
+
|
| 69 |
+
async rejectField(sessionId, fieldPath) {
|
| 70 |
+
await _applyReview(sessionId, fieldPath, 'reject', undefined, set)
|
| 71 |
+
},
|
| 72 |
+
|
| 73 |
+
setActivePdf(filename) {
|
| 74 |
+
set({ activePdfFile: filename })
|
| 75 |
+
},
|
| 76 |
+
}))
|
| 77 |
+
|
| 78 |
+
async function _applyReview(
|
| 79 |
+
sessionId: string,
|
| 80 |
+
fieldPath: string,
|
| 81 |
+
action: ReviewAction,
|
| 82 |
+
overriddenValue: string | undefined,
|
| 83 |
+
set: (partial: Partial<AppState>) => void,
|
| 84 |
+
) {
|
| 85 |
+
await api.updateReview(sessionId, fieldPath, action, overriddenValue)
|
| 86 |
+
const fresh = await api.getReviewState(sessionId)
|
| 87 |
+
set({ reviewState: fresh })
|
| 88 |
+
}
|
ui/src/types.ts
ADDED
|
@@ -0,0 +1,170 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
// ── Geometry ──────────────────────────────────────────────────────────────
|
| 2 |
+
|
| 3 |
+
export interface Location {
|
| 4 |
+
page: number
|
| 5 |
+
/** [x0%, y0%, x1%, y1%] — top-left origin, 0–100 range, percent of page */
|
| 6 |
+
bbox: [number, number, number, number]
|
| 7 |
+
}
|
| 8 |
+
|
| 9 |
+
export interface FieldProvenance {
|
| 10 |
+
field_path: string
|
| 11 |
+
extracted_value: string
|
| 12 |
+
matched_text: string
|
| 13 |
+
/** 0.0–1.0 */
|
| 14 |
+
match_score: number
|
| 15 |
+
source_filename: string
|
| 16 |
+
location: Location
|
| 17 |
+
}
|
| 18 |
+
|
| 19 |
+
// ── Golden Record sub-types ───────────────────────────────────────────────
|
| 20 |
+
|
| 21 |
+
export interface PeriodOfCover {
|
| 22 |
+
start_date?: string
|
| 23 |
+
expiry_date?: string
|
| 24 |
+
issue_date?: string
|
| 25 |
+
}
|
| 26 |
+
|
| 27 |
+
export interface PolicyHeader {
|
| 28 |
+
policy_number?: string
|
| 29 |
+
insurer?: string
|
| 30 |
+
product_name?: string
|
| 31 |
+
period_of_cover?: PeriodOfCover
|
| 32 |
+
}
|
| 33 |
+
|
| 34 |
+
export interface SecurityDetails {
|
| 35 |
+
has_security_device?: boolean
|
| 36 |
+
tracker_fitted?: boolean
|
| 37 |
+
modifications?: string
|
| 38 |
+
}
|
| 39 |
+
|
| 40 |
+
export interface VehicleDetails {
|
| 41 |
+
vrm?: string
|
| 42 |
+
make?: string
|
| 43 |
+
model?: string
|
| 44 |
+
fuel_type?: string
|
| 45 |
+
transmission?: string
|
| 46 |
+
estimated_value?: string
|
| 47 |
+
annual_mileage?: number
|
| 48 |
+
overnight_postcode?: string
|
| 49 |
+
kept_location?: string
|
| 50 |
+
security?: SecurityDetails
|
| 51 |
+
}
|
| 52 |
+
|
| 53 |
+
export interface Driver {
|
| 54 |
+
name: string
|
| 55 |
+
dob?: string
|
| 56 |
+
relationship?: string
|
| 57 |
+
occupation?: string
|
| 58 |
+
license_type?: string
|
| 59 |
+
is_main_driver: boolean
|
| 60 |
+
specific_excess?: number
|
| 61 |
+
}
|
| 62 |
+
|
| 63 |
+
export interface NoClaimsDiscount {
|
| 64 |
+
years?: number
|
| 65 |
+
protected?: boolean
|
| 66 |
+
}
|
| 67 |
+
|
| 68 |
+
export interface ExcessBreakdown {
|
| 69 |
+
standard_compulsory?: number
|
| 70 |
+
voluntary?: number
|
| 71 |
+
total_accidental_damage?: number
|
| 72 |
+
fire?: number
|
| 73 |
+
theft?: number
|
| 74 |
+
windscreen_repair?: number
|
| 75 |
+
windscreen_replacement?: number
|
| 76 |
+
own_repairer_additional_excess?: number
|
| 77 |
+
}
|
| 78 |
+
|
| 79 |
+
export interface CoverAndExcesses {
|
| 80 |
+
cover_type?: string
|
| 81 |
+
class_of_use?: string
|
| 82 |
+
driving_other_cars?: boolean
|
| 83 |
+
no_claims_discount?: NoClaimsDiscount
|
| 84 |
+
excess_breakdown?: ExcessBreakdown
|
| 85 |
+
}
|
| 86 |
+
|
| 87 |
+
export interface OptionalExtras {
|
| 88 |
+
motor_legal_protection?: number | string
|
| 89 |
+
breakdown_roadside_assistance?: number | string
|
| 90 |
+
enhanced_personal_accident?: number | string
|
| 91 |
+
hire_car?: number | string
|
| 92 |
+
key_cover?: number | string
|
| 93 |
+
}
|
| 94 |
+
|
| 95 |
+
export interface FinancialSummary {
|
| 96 |
+
total_annual_premium?: number
|
| 97 |
+
optional_extras?: OptionalExtras
|
| 98 |
+
}
|
| 99 |
+
|
| 100 |
+
export interface AdditionalRiskData {
|
| 101 |
+
home_ownership?: string
|
| 102 |
+
children_under_16?: boolean
|
| 103 |
+
number_of_cars_in_household?: number
|
| 104 |
+
non_motoring_convictions?: boolean
|
| 105 |
+
endorsements?: string
|
| 106 |
+
}
|
| 107 |
+
|
| 108 |
+
export interface Citations {
|
| 109 |
+
vehicle_model?: string
|
| 110 |
+
excess_details?: string
|
| 111 |
+
class_of_use?: string
|
| 112 |
+
driver_ages?: string
|
| 113 |
+
premium_breakdown?: string
|
| 114 |
+
}
|
| 115 |
+
|
| 116 |
+
export interface GoldenRecord {
|
| 117 |
+
policy_header?: PolicyHeader
|
| 118 |
+
vehicle_details?: VehicleDetails
|
| 119 |
+
driver_details: Driver[]
|
| 120 |
+
cover_and_excesses?: CoverAndExcesses
|
| 121 |
+
financial_summary?: FinancialSummary
|
| 122 |
+
additional_risk_data?: AdditionalRiskData
|
| 123 |
+
citations?: Citations
|
| 124 |
+
}
|
| 125 |
+
|
| 126 |
+
export interface ConflictEntry {
|
| 127 |
+
field: string
|
| 128 |
+
schedule_value?: string
|
| 129 |
+
certificate_value?: string
|
| 130 |
+
winner: 'schedule' | 'certificate' | 'fallback' | string
|
| 131 |
+
}
|
| 132 |
+
|
| 133 |
+
// ── Session ───────────────────────────────────────────────────────────────
|
| 134 |
+
|
| 135 |
+
export interface SessionData {
|
| 136 |
+
record: GoldenRecord
|
| 137 |
+
provenance: FieldProvenance[]
|
| 138 |
+
conflicts?: ConflictEntry[]
|
| 139 |
+
session_id: string
|
| 140 |
+
}
|
| 141 |
+
|
| 142 |
+
// ── Review state ──────────────────────────────────────────────────────────
|
| 143 |
+
|
| 144 |
+
export type ReviewAction = 'verify' | 'reject' | 'override'
|
| 145 |
+
|
| 146 |
+
export interface FieldReview {
|
| 147 |
+
action: ReviewAction
|
| 148 |
+
overridden_value?: string
|
| 149 |
+
reviewer?: string
|
| 150 |
+
}
|
| 151 |
+
|
| 152 |
+
export type ReviewState = Record<string, FieldReview>
|
| 153 |
+
|
| 154 |
+
// ── Flat field entry (used by the form panel) ─────────────────────────────
|
| 155 |
+
|
| 156 |
+
export interface FieldEntry {
|
| 157 |
+
fieldPath: string
|
| 158 |
+
label: string
|
| 159 |
+
value: string | null
|
| 160 |
+
section: string
|
| 161 |
+
provenance?: FieldProvenance
|
| 162 |
+
}
|
| 163 |
+
|
| 164 |
+
// ── API response types ────────────────────────────────────────────────────
|
| 165 |
+
|
| 166 |
+
export interface ProcessResponse {
|
| 167 |
+
session_id: string
|
| 168 |
+
fields_extracted: number
|
| 169 |
+
provenance_coverage: number
|
| 170 |
+
}
|
ui/src/vite-env.d.ts
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
/// <reference types="vite/client" />
|
ui/tailwind.config.js
ADDED
|
@@ -0,0 +1,26 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
/** @type {import('tailwindcss').Config} */
|
| 2 |
+
export default {
|
| 3 |
+
content: [
|
| 4 |
+
"./index.html",
|
| 5 |
+
"./src/**/*.{js,ts,jsx,tsx}",
|
| 6 |
+
],
|
| 7 |
+
theme: {
|
| 8 |
+
extend: {
|
| 9 |
+
colors: {
|
| 10 |
+
brand: {
|
| 11 |
+
dark: '#1F2937',
|
| 12 |
+
blue: '#2563EB',
|
| 13 |
+
teal: '#008080',
|
| 14 |
+
50: '#f0fdfc',
|
| 15 |
+
100: '#ccfbf1',
|
| 16 |
+
600: '#2563EB',
|
| 17 |
+
700: '#1d4ed8',
|
| 18 |
+
},
|
| 19 |
+
},
|
| 20 |
+
fontFamily: {
|
| 21 |
+
sans: ['Inter', 'ui-sans-serif', 'system-ui', 'sans-serif'],
|
| 22 |
+
},
|
| 23 |
+
},
|
| 24 |
+
},
|
| 25 |
+
plugins: [],
|
| 26 |
+
}
|
ui/tsconfig.app.json
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"compilerOptions": {
|
| 3 |
+
"target": "ES2020",
|
| 4 |
+
"useDefineForClassFields": true,
|
| 5 |
+
"lib": ["ES2020", "DOM", "DOM.Iterable"],
|
| 6 |
+
"module": "ESNext",
|
| 7 |
+
"skipLibCheck": true,
|
| 8 |
+
"moduleResolution": "bundler",
|
| 9 |
+
"allowImportingTsExtensions": true,
|
| 10 |
+
"resolveJsonModule": true,
|
| 11 |
+
"isolatedModules": true,
|
| 12 |
+
"moduleDetection": "force",
|
| 13 |
+
"noEmit": true,
|
| 14 |
+
"jsx": "react-jsx",
|
| 15 |
+
"strict": true,
|
| 16 |
+
"noUnusedLocals": false,
|
| 17 |
+
"noUnusedParameters": false,
|
| 18 |
+
"noFallthroughCasesInSwitch": true
|
| 19 |
+
},
|
| 20 |
+
"include": ["src"]
|
| 21 |
+
}
|
ui/tsconfig.json
ADDED
|
@@ -0,0 +1,7 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"files": [],
|
| 3 |
+
"references": [
|
| 4 |
+
{ "path": "./tsconfig.app.json" },
|
| 5 |
+
{ "path": "./tsconfig.node.json" }
|
| 6 |
+
]
|
| 7 |
+
}
|
ui/tsconfig.node.json
ADDED
|
@@ -0,0 +1,14 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"compilerOptions": {
|
| 3 |
+
"target": "ES2022",
|
| 4 |
+
"lib": ["ES2023"],
|
| 5 |
+
"module": "ESNext",
|
| 6 |
+
"skipLibCheck": true,
|
| 7 |
+
"moduleResolution": "bundler",
|
| 8 |
+
"allowImportingTsExtensions": true,
|
| 9 |
+
"isolatedModules": true,
|
| 10 |
+
"moduleDetection": "force",
|
| 11 |
+
"noEmit": true
|
| 12 |
+
},
|
| 13 |
+
"include": ["vite.config.ts"]
|
| 14 |
+
}
|
ui/vite.config.ts
ADDED
|
@@ -0,0 +1,16 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import { defineConfig } from 'vite'
|
| 2 |
+
import react from '@vitejs/plugin-react'
|
| 3 |
+
|
| 4 |
+
export default defineConfig({
|
| 5 |
+
plugins: [react()],
|
| 6 |
+
server: {
|
| 7 |
+
port: 5173,
|
| 8 |
+
proxy: {
|
| 9 |
+
// Forward all /api/* requests to the FastAPI backend
|
| 10 |
+
'/api': {
|
| 11 |
+
target: 'http://localhost:8000',
|
| 12 |
+
changeOrigin: true,
|
| 13 |
+
},
|
| 14 |
+
},
|
| 15 |
+
},
|
| 16 |
+
})
|