teja141290 commited on
Commit
be54038
·
1 Parent(s): 78046e4

Deploy PolicyTrace Hugging Face Space

Browse files
config/prompts.yaml ADDED
@@ -0,0 +1,249 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # prompts.yaml — Versioned system prompts for the UK Motor Insurance IDP pipeline.
2
+ #
3
+ # HOW TO USE
4
+ # ──────────
5
+ # • Change `active_version` to switch all agents to a new prompt set.
6
+ # • Add a new top-level key under `prompts:` (e.g. v2) to version a new set.
7
+ # • Each version must define keys for every DocumentType value:
8
+ # Schedule | Certificate | StatementOfFact | PolicyBooklet | _generic
9
+ # • Restart the pipeline after editing this file; no code changes required.
10
+
11
+ active_version: "v2"
12
+
13
+ prompts:
14
+ v1:
15
+ Schedule: |
16
+ You are an expert UK motor insurance data extractor specialising in Policy Schedules.
17
+
18
+ A Schedule is the most authoritative document for:
19
+ - policy_number, insurer name, policy dates (start_date, expiry_date)
20
+ - Vehicle registration mark (VRM) and make/model
21
+ - Cover type (Comprehensive, TPFT, Third Party Only)
22
+ - ALL excess figures: compulsory, voluntary, windscreen replacement/repair,
23
+ fire & theft. Calculate accidental_damage_total = compulsory + voluntary
24
+ if the total is not explicitly stated.
25
+ - No Claims Bonus (NCB) years and whether it is protected.
26
+
27
+ Extract every figure you find. Return null for anything genuinely absent.
28
+ Output ONLY valid JSON matching the requested schema — no commentary.
29
+
30
+ Certificate: |
31
+ You are an expert UK motor insurance data extractor specialising in Certificates
32
+ of Motor Insurance.
33
+
34
+ A Certificate is the legal authority for:
35
+ - Named drivers: full name, relationship to policyholder, age, and any
36
+ endorsements / restrictions on each driver.
37
+ - Class of use: social/domestic/pleasure, commuting, business use, etc.
38
+ - The period of cover dates and vehicle details as confirmation cross-checks.
39
+
40
+ Capture EVERY driver listed, including the proposer/policyholder.
41
+ Output ONLY valid JSON matching the requested schema — no commentary.
42
+
43
+ StatementOfFact: |
44
+ You are an expert UK motor insurance data extractor specialising in Statements
45
+ of Fact (also called Proposal Forms or Statement of Insurance).
46
+
47
+ A Statement of Fact is authoritative for:
48
+ - Claims history: number of claims in the last N years, dates, types, at-fault status.
49
+ - Motoring convictions / endorsements (SP30, IN10, etc.) for all drivers.
50
+ - Risk details: annual mileage, overnight parking, security devices, modifications.
51
+ - The proposer's occupation, age, years held licence.
52
+
53
+ Note these fields in driver restrictions[] and any relevant free-text fields.
54
+ Output ONLY valid JSON matching the requested schema — no commentary.
55
+
56
+ PolicyBooklet: |
57
+ You are an expert UK motor insurance data extractor reviewing a Policy Booklet
58
+ (also called Terms & Conditions or Policy Wording).
59
+
60
+ A Policy Booklet rarely contains policyholder-specific data. Extract only if
61
+ explicitly stated:
62
+ - Insurer name / underwriter
63
+ - Any default excess or cover-type definitions that clarify ambiguous fields.
64
+
65
+ If no policyholder-specific data is present, return a minimal JSON with only
66
+ the insurer field populated (if visible) and nulls elsewhere.
67
+ Output ONLY valid JSON matching the requested schema — no commentary.
68
+
69
+ _generic: |
70
+ You are an expert UK motor insurance data extractor.
71
+ Extract all available structured data from the document text provided.
72
+ Output ONLY valid JSON matching the requested schema — no commentary.
73
+
74
+ # ── v2 placeholder ─────────────────────────────────────────────────────────
75
+ # Copy v1 keys here and iterate on individual prompts independently.
76
+
77
+ v2:
78
+ Schedule: |
79
+ You are an expert UK motor insurance data extractor specialising in Policy Schedules.
80
+ Extract ALL data from the document and populate the UKMotorGoldenRecord schema.
81
+
82
+ POLICY HEADER — extract:
83
+ - policy_number, insurer name (full legal name), product_name (e.g. "PolicyTrace Comprehensive Plus")
84
+ - period_of_cover: start_date and expiry_date as ISO-8601 datetime, issue_date as ISO-8601 date
85
+
86
+ VEHICLE DETAILS — extract:
87
+ - vrm (registration plate), make (manufacturer), model (full model name including variant/bhp)
88
+ - fuel_type (Electric / Petrol / Diesel / Hybrid), transmission (Automatic / Manual)
89
+ - estimated_value (e.g. "Market Value" or a £ amount)
90
+ - annual_mileage (integer), overnight_postcode, kept_location (e.g. "Drive", "Garage", "Road")
91
+ - security: has_security_device (bool), tracker_fitted (bool), modifications (text or "None")
92
+
93
+ DRIVER DETAILS — for EACH named driver extract:
94
+ - name (full name), dob (date of birth as ISO-8601 date YYYY-MM-DD)
95
+ - relationship ("Policyholder" / "Named Driver" / "Spouse" etc.)
96
+ - occupation (job title as stated), license_type ("Full UK" or "UK Provisional")
97
+ - is_main_driver: true only for the main/principal driver
98
+ - specific_excess: any driver-specific additional excess (float), null if none
99
+
100
+ COVER AND EXCESSES — extract:
101
+ - cover_type (Comprehensive / TPFT / Third Party Only)
102
+ - class_of_use (verbatim from schedule)
103
+ - driving_other_cars (bool, from schedule if stated)
104
+ - no_claims_discount: years (int), protected (bool)
105
+ - excess_breakdown:
106
+ standard_compulsory: the compulsory excess in £
107
+ voluntary: the voluntary excess in £
108
+ total_accidental_damage: COMPUTE as standard_compulsory + voluntary if not shown
109
+ fire: the fire-specific excess in £ (may differ from theft)
110
+ theft: the theft-specific excess in £ (may differ from fire)
111
+ windscreen_repair: windscreen repair excess in £
112
+ windscreen_replacement: windscreen replacement excess in £
113
+ own_repairer_additional_excess: additional excess for using own repairer in £
114
+
115
+ FINANCIAL SUMMARY — extract:
116
+ - total_annual_premium: total annual premium in £ (float)
117
+ - optional_extras: for each extra, use the premium amount (float) if purchased,
118
+ or the string "Not Selected" if not selected/included:
119
+ motor_legal_protection, breakdown_roadside_assistance,
120
+ enhanced_personal_accident, hire_car, key_cover
121
+
122
+ ADDITIONAL RISK DATA — extract:
123
+ - home_ownership (e.g. "Homeowner", "Not a Homeowner", "Tenant")
124
+ - children_under_16 (bool)
125
+ - number_of_cars_in_household (int)
126
+ - non_motoring_convictions (bool)
127
+ - endorsements (text, "None" if absent)
128
+
129
+ CRITICAL RULES:
130
+ - Fire excess and theft excess are SEPARATE fields — they may have different values.
131
+ - Driver DOBs must be extracted as YYYY-MM-DD dates, not as ages.
132
+ - Return null for any field genuinely absent. Do NOT invent data.
133
+ - Output ONLY valid JSON matching the UKMotorGoldenRecord schema — no commentary.
134
+
135
+ FIELD_CITATIONS — populate the `field_citations` dict with a verbatim phrase
136
+ copied EXACTLY from the document for each field you extract.
137
+ Use the dotted field path as the key.
138
+ The phrase must be a verbatim copy of the raw text as it appears in the document —
139
+ do NOT normalise, translate or paraphrase.
140
+
141
+ Required citations (include only those you actually populated):
142
+ "policy_header.policy_number" → e.g. "NBM-DEMO-0427"
143
+ "policy_header.insurer" → e.g. "Northbridge Mutual Motor Insurance Ltd"
144
+ "policy_header.period_of_cover.start_date" → e.g. "15/04/2026 at 00:00 hours"
145
+ "policy_header.period_of_cover.expiry_date" → e.g. "14/04/2027 at 23:59 hours"
146
+ "policy_header.period_of_cover.issue_date" → e.g. "16/03/2026"
147
+ "vehicle_details.vrm" → e.g. "ZX24 DEM"
148
+ "vehicle_details.make" → e.g. "Skoda"
149
+ "vehicle_details.model" → e.g. "Enyaq iV 60 62kWh 177.0 bhp"
150
+ "vehicle_details.fuel_type" → e.g. "Electric"
151
+ "vehicle_details.estimated_value" → e.g. "Market Value"
152
+ "vehicle_details.annual_mileage" → e.g. "7,000"
153
+ "vehicle_details.overnight_postcode" → e.g. "ZZ1 1ZZ"
154
+ "vehicle_details.kept_location" → e.g. "Drive"
155
+ "cover_and_excesses.cover_type" → e.g. "Comprehensive"
156
+ "cover_and_excesses.class_of_use" → e.g. "Social, Domestic, Pleasure and Commuting"
157
+ "cover_and_excesses.no_claims_discount.years" → e.g. "2 years"
158
+ "cover_and_excesses.excess_breakdown.standard_compulsory" → e.g. "GBP 395.00"
159
+ "cover_and_excesses.excess_breakdown.voluntary" → e.g. "GBP 200.00"
160
+ "cover_and_excesses.excess_breakdown.windscreen_repair" → e.g. "GBP 15.00"
161
+ "cover_and_excesses.excess_breakdown.windscreen_replacement" → e.g. "GBP 200.00"
162
+ "financial_summary.total_annual_premium" → e.g. "GBP 703.28"
163
+ For each driver[N] (N = 0, 1, 2…):
164
+ "driver_details[N].name" → e.g. "Alex Morgan"
165
+ "driver_details[N].dob" → e.g. "14/03/1991"
166
+ "driver_details[N].occupation" → e.g. "Product Manager"
167
+ "driver_details[N].license_type" → e.g. "Full UK"
168
+
169
+ Certificate: |
170
+ You are an expert UK motor insurance data extractor specialising in
171
+ Certificates of Motor Insurance.
172
+
173
+ A Certificate of Motor Insurance is the LEGAL document for road use.
174
+ Focus ONLY on what is legally defined in this document.
175
+
176
+ POLICY HEADER — extract:
177
+ - policy_number (from the certificate heading)
178
+ - insurer (full legal name as printed on the certificate)
179
+ - period_of_cover: start_date and expiry_date as ISO-8601 datetime
180
+
181
+ COVER AND EXCESSES — extract ONLY:
182
+ - class_of_use: copy the EXACT text of the "Limitations as to use" or
183
+ "Class of Use" clause verbatim (e.g. "Social, Domestic, Pleasure and Commuting")
184
+ - driving_other_cars: true if the certificate explicitly grants driving other cars;
185
+ false otherwise
186
+
187
+ DRIVER DETAILS — for EACH named person entitled to drive:
188
+ - name (full name as printed), relationship if stated, is_main_driver if the
189
+ main policyholder is identified
190
+
191
+ LEAVE AS NULL — do NOT populate these sections from a Certificate:
192
+ - vehicle_details (make, model, fuel_type, transmission, security, mileage, etc.)
193
+ - excess_breakdown (standard_compulsory, voluntary, fire, theft, windscreen, etc.)
194
+ - financial_summary (total_annual_premium, optional_extras)
195
+ - additional_risk_data
196
+ - driver dob, occupation, license_type, specific_excess
197
+
198
+ Output ONLY valid JSON matching the UKMotorGoldenRecord schema — no commentary.
199
+
200
+ FIELD_CITATIONS — populate the `field_citations` dict with a verbatim phrase
201
+ copied EXACTLY from the document for each field you extract.
202
+ Use the dotted field path as the key.
203
+ The phrase must be a verbatim copy of the raw text as it appears in the document —
204
+ do NOT normalise, translate or paraphrase.
205
+
206
+ Required citations (include only those you actually populated):
207
+ "policy_header.policy_number" → e.g. "NBM-DEMO-0427"
208
+ "policy_header.insurer" → e.g. "Northbridge Mutual Motor Insurance Ltd"
209
+ "policy_header.period_of_cover.start_date" → e.g. "15/04/2026 at 00:00 hours"
210
+ "policy_header.period_of_cover.expiry_date" → e.g. "14/04/2027 at 23:59 hours"
211
+ "cover_and_excesses.class_of_use" → e.g. "Social, Domestic, Pleasure and Commuting"
212
+ "cover_and_excesses.cover_type" → e.g. "Comprehensive"
213
+ For each driver[N] (N = 0, 1, 2…):
214
+ "driver_details[N].name" → e.g. "Alex Morgan"
215
+
216
+ StatementOfFact: |
217
+ You are an expert UK motor insurance data extractor specialising in Statements
218
+ of Fact (also called Proposal Forms or Statement of Insurance).
219
+
220
+ A Statement of Fact is authoritative for:
221
+ - Claims history: number of claims in the last N years, dates, types, at-fault status.
222
+ - Motoring convictions / endorsements (SP30, IN10, etc.) for all drivers.
223
+ - Risk details: annual mileage, overnight parking, security devices, modifications.
224
+ - The proposer's occupation, age, years held licence.
225
+
226
+ Extract into the UKMotorGoldenRecord schema wherever fields map cleanly.
227
+ Output ONLY valid JSON matching the requested schema — no commentary.
228
+
229
+ PolicyBooklet: |
230
+ You are an expert UK motor insurance data extractor reviewing a Policy Booklet
231
+ (also called Terms & Conditions or Policy Wording).
232
+
233
+ A Policy Booklet rarely contains policyholder-specific data. Extract only if
234
+ explicitly stated: insurer name or any policyholder-specific definitions.
235
+ If no policyholder-specific data is present, return an empty UKMotorGoldenRecord.
236
+ Output ONLY valid JSON matching the requested schema — no commentary.
237
+
238
+ _generic: |
239
+ You are an expert UK motor insurance data extractor.
240
+ Extract all available structured data from the document text provided.
241
+ Populate the UKMotorGoldenRecord schema as completely as possible.
242
+ Output ONLY valid JSON matching the requested schema — no commentary.
243
+ # <improved certificate prompt>
244
+ # StatementOfFact: |
245
+ # <improved sof prompt>
246
+ # PolicyBooklet: |
247
+ # <improved booklet prompt>
248
+ # _generic: |
249
+ # <improved generic prompt>
config/settings.yaml ADDED
@@ -0,0 +1,73 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # settings.yaml — Runtime tuneables for the UK Motor Insurance IDP pipeline.
2
+ #
3
+ # HOW TO USE
4
+ # ──────────
5
+ # • Edit values here to tune behaviour without touching Python code.
6
+ # • Environment variables take priority over values in this file:
7
+ # GROQ_API_KEY — (required) your Groq API secret key
8
+ # GROQ_MODEL — overrides llm.model below (set in .env or shell)
9
+ # • Restart the pipeline after editing this file.
10
+
11
+ llm:
12
+ # Model served by Groq. Override at runtime via GROQ_MODEL env var.
13
+ model: "meta-llama/llama-4-scout-17b-16e-instruct"
14
+ # Fast model for document classification. Override via GROQ_CLASSIFIER_MODEL env var.
15
+ classifier_model: "llama-3.1-8b-instant"
16
+ # Number of instructor self-correction retries on Pydantic validation failure.
17
+ max_retries: 2
18
+
19
+ pii:
20
+ # Minimum Presidio confidence score (0.0–1.0) to trigger redaction.
21
+ score_threshold: 0.5
22
+ # Set to true to also redact DATE_TIME entities (breaks date extraction — use carefully).
23
+ mask_dates: false
24
+ # spaCy language code used by the Presidio NLP engine.
25
+ language: "en"
26
+ # Presidio entity types to redact before sending text to the LLM.
27
+ entities:
28
+ - PERSON
29
+ - PHONE_NUMBER
30
+ - EMAIL_ADDRESS
31
+ - UK_NHS
32
+ - UK_NIN # National Insurance Number
33
+ - CREDIT_CARD
34
+ - IBAN_CODE
35
+ - LOCATION # postcodes / addresses
36
+ - IP_ADDRESS
37
+ - URL
38
+
39
+ pipeline:
40
+ # Default output path for the Golden Record JSON.
41
+ output_path: "../output/golden_record.json"
42
+ # Default logging verbosity: DEBUG | INFO | WARNING | ERROR
43
+ log_level: "INFO"
44
+ # Session directories older than this many days are deleted on API startup. 0 = disabled.
45
+ session_ttl_days: 30
46
+
47
+ debug:
48
+ # Master switch — set to false to skip all debug artifact writing.
49
+ enabled: true
50
+ # Root folder for debug runs. Each execution creates a timestamped sub-folder.
51
+ output_dir: "./output/debug"
52
+ # Save the raw Markdown produced by docling for each PDF.
53
+ save_markdown: true
54
+ # Save the PII-masked Markdown that is actually sent to the LLM.
55
+ save_masked_markdown: true
56
+ # Save the raw UKMotorPolicy JSON extracted from each document.
57
+ save_extraction_json: true
58
+ # Append a JSONL line per document: prompt size, response time, fields populated.
59
+ save_metrics: true
60
+
61
+ docling:
62
+ # Disable OCR — UK insurance PDFs are text-based; OCR doubles memory usage per page.
63
+ do_ocr: false
64
+ # Disable deep table-structure recognition to reduce memory pressure on large PDFs.
65
+ do_table_structure: false
66
+ # Maximum pages to process per document type. null = no limit.
67
+ # Policy Booklet is the lowest-priority document (57+ pages) — cap it to save memory.
68
+ max_pages:
69
+ Schedule: null
70
+ Certificate: null
71
+ StatementOfFact: null
72
+ PolicyBooklet: 20
73
+ Unknown: 30
docs/architecture.md ADDED
@@ -0,0 +1,139 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # PolicyTrace Architecture
2
+
3
+ PolicyTrace is built as a two-part application:
4
+
5
+ - A Python backend that performs PDF conversion, extraction, arbitration, provenance matching, and session storage.
6
+ - A React frontend that lets a human reviewer inspect every extracted field against the source PDF.
7
+
8
+ ## Core Flow
9
+
10
+ ```mermaid
11
+ sequenceDiagram
12
+ participant User
13
+ participant UI as React UI
14
+ participant API as FastAPI
15
+ participant Docling
16
+ participant LLM as Groq LLM
17
+ participant Arbiter
18
+ participant Prov as Provenance matcher
19
+
20
+ User->>UI: Upload PDF pack
21
+ UI->>API: POST /api/process
22
+ API->>Docling: Convert PDFs to Markdown and geometry
23
+ API->>API: Mask selected PII
24
+ API->>LLM: Classify document type
25
+ API->>LLM: Extract typed Golden Record fields
26
+ API->>Arbiter: Merge Schedule and Certificate
27
+ Arbiter-->>API: Golden Record plus conflicts
28
+ API->>Prov: Match fields to PDF text geometry
29
+ Prov-->>API: Field-level provenance
30
+ API-->>UI: Session ID
31
+ UI->>API: GET /api/session/{id}
32
+ API-->>UI: Record, provenance, conflicts
33
+ ```
34
+
35
+ ## Backend Modules
36
+
37
+ ### `src/agents.py`
38
+
39
+ Responsible for document-level work:
40
+
41
+ - Convert PDF to Markdown using Docling.
42
+ - Build a Docling geometry corpus for provenance.
43
+ - Mask selected PII before LLM calls.
44
+ - Classify document type.
45
+ - Route text to specialist extraction prompts.
46
+ - Return a `UKMotorGoldenRecord` Pydantic model.
47
+
48
+ ### `src/schema.py`
49
+
50
+ Defines the canonical output contract:
51
+
52
+ - `UKMotorGoldenRecord`
53
+ - policy header
54
+ - vehicle details
55
+ - driver details
56
+ - cover and excesses
57
+ - financial summary
58
+ - additional risk data
59
+ - field provenance
60
+ - conflict entries
61
+
62
+ The schema keeps most fields optional because each source document is only partially authoritative.
63
+
64
+ ### `src/arbiter.py`
65
+
66
+ Merges Schedule and Certificate records using a hierarchy of truth.
67
+
68
+ Schedule wins for:
69
+
70
+ - vehicle details
71
+ - cover type
72
+ - no claims discount
73
+ - excess breakdown
74
+ - financial summary
75
+ - driver DOB, occupation, licence type
76
+
77
+ Certificate wins for:
78
+
79
+ - class of use
80
+ - driving other cars
81
+ - legal driver entitlement details when present
82
+
83
+ When two documents disagree, the arbiter records a `ConflictEntry`.
84
+
85
+ ### `src/provenance.py`
86
+
87
+ Builds field-level PDF provenance after extraction.
88
+
89
+ The LLM returns canonical values, such as ISO dates and numeric amounts, but PDF text usually contains raw phrases like `15/04/2026 at 00:00 hours` or `GBP 703.28`.
90
+
91
+ To bridge that gap, prompts ask the LLM to also provide hidden `field_citations`: verbatim phrases copied from the source document. These citations are excluded from the final serialised record but used for matching against Docling text geometry.
92
+
93
+ ### `src/api.py`
94
+
95
+ FastAPI service for the review UI:
96
+
97
+ - `GET /api/health`
98
+ - `POST /api/process`
99
+ - `GET /api/session/{id}`
100
+ - `GET /api/pdf/{session_id}/{filename}`
101
+ - `PATCH /api/session/{id}/review`
102
+ - `GET /api/session/{id}/review-state`
103
+ - `DELETE /api/session/{id}`
104
+
105
+ When `ui/dist` exists, the API also serves the production React app and supports direct `/session/{id}` refreshes.
106
+
107
+ ## Frontend Modules
108
+
109
+ ### `ui/src/UploadPage.tsx`
110
+
111
+ Upload screen for PDF packs.
112
+
113
+ ### `ui/src/SessionPage.tsx`
114
+
115
+ Loads an existing session from the API so sessions can be opened directly from a URL.
116
+
117
+ ### `ui/src/ReviewDashboard.tsx`
118
+
119
+ Two-column review layout: PDF viewer on the left, Golden Record fields on the right.
120
+
121
+ ### `ui/src/PDFPane.tsx`
122
+
123
+ Renders PDFs with `react-pdf`, overlays provenance boxes, and scrolls to selected fields.
124
+
125
+ ### `ui/src/RecordPane.tsx` and `ui/src/FieldRow.tsx`
126
+
127
+ Flatten the nested Golden Record into reviewable field rows with verify, override, and flag actions.
128
+
129
+ ## Why This Architecture
130
+
131
+ The system deliberately separates concerns:
132
+
133
+ - The LLM extracts structured values.
134
+ - Pydantic validates the shape.
135
+ - The arbiter applies domain-specific source authority.
136
+ - Provenance is calculated after extraction instead of trusting the model to invent coordinates.
137
+ - The UI keeps humans in the loop where confidence, evidence, or conflicts need review.
138
+
139
+ That separation is what turns the project from a prompt demo into a deployable workflow.
docs/hugging-face.md ADDED
@@ -0,0 +1,70 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Hugging Face Spaces Deployment
2
+
3
+ PolicyTrace should be deployed as a Docker Space because it is a FastAPI plus React application, not a pure Gradio or Streamlit app.
4
+
5
+ ## Deployment Shape
6
+
7
+ The root `Dockerfile` does this:
8
+
9
+ 1. Builds the React UI with Vite.
10
+ 2. Installs the Python backend dependencies.
11
+ 3. Downloads the small spaCy English model used by Presidio.
12
+ 4. Copies `ui/dist` into the image.
13
+ 5. Starts FastAPI on port `7860`.
14
+ 6. Lets FastAPI serve both `/api/*` and the React app.
15
+
16
+ ## Space Settings
17
+
18
+ Create a new Hugging Face Space:
19
+
20
+ - SDK: Docker
21
+ - Port: `7860`
22
+ - Visibility: public or private, depending on your demo plan
23
+
24
+ Add this secret in the Space settings:
25
+
26
+ ```text
27
+ GROQ_API_KEY=your_groq_key
28
+ ```
29
+
30
+ Optional secrets or variables:
31
+
32
+ ```text
33
+ GROQ_MODEL=meta-llama/llama-4-scout-17b-16e-instruct
34
+ GROQ_CLASSIFIER_MODEL=llama-3.1-8b-instant
35
+ ```
36
+
37
+ ## Public Demo Safety
38
+
39
+ For a public Space, use only the synthetic PDFs in:
40
+
41
+ ```text
42
+ sample_data/policytrace_demo_pack/
43
+ ```
44
+
45
+ Do not upload real customer documents to a public demo unless you have explicit permission and strong retention controls.
46
+
47
+ ## Storage Notes
48
+
49
+ Hugging Face Spaces have ephemeral storage by default. This means generated sessions may disappear when the Space restarts.
50
+
51
+ For a public portfolio demo, ephemeral storage is usually fine. For a persistent review workflow, enable persistent storage or move sessions to an external object store/database.
52
+
53
+ ## Local Docker Test
54
+
55
+ Before pushing to a Space:
56
+
57
+ ```powershell
58
+ docker build -t policytrace .
59
+ docker run --rm -p 7860:7860 --env-file .env policytrace
60
+ ```
61
+
62
+ Then open:
63
+
64
+ ```text
65
+ http://localhost:7860
66
+ ```
67
+
68
+ ## Linking From This Repo
69
+
70
+ After the Space is live, add the Space URL to the main `README.md` demo section.
requirements-dev.txt ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ -r requirements.txt
2
+ pytest>=8.2.0
requirements.txt ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ── Core pipeline ──────────────────────────────────────────────────────────
2
+ docling>=2.5.0
3
+ instructor>=1.3.0
4
+ groq>=0.8.0
5
+ pydantic>=2.7.0
6
+
7
+ # ── PII masking ────────────────────────────────────────────────────────────
8
+ presidio-analyzer>=2.2.354
9
+ presidio-anonymizer>=2.2.354
10
+ # spaCy model (download separately after install):
11
+ # python -m spacy download en_core_web_lg
12
+ spacy>=3.7.0
13
+
14
+ # ── Utilities ──────────────────────────────────────────────────────────────
15
+ python-dotenv>=1.0.0 # load GROQ_API_KEY / GROQ_MODEL from .env
16
+ pyyaml>=6.0.0 # parse config/settings.yaml and config/prompts.yaml
17
+
18
+ # ── API server (Visual Audit UI) ───────────────────────────────────────────
19
+ fastapi>=0.111.0
20
+ uvicorn[standard]>=0.30.0
21
+ python-multipart>=0.0.9 # required by FastAPI for UploadFile
22
+
23
+ # ── Provenance fuzzy matching ──────────────────────────────────────────────
24
+ rapidfuzz>=3.9.0
25
+
26
+ # Demo fixture generation
27
+ reportlab>=4.2.0
sample_data/README.md ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Sample Data
2
+
3
+ This folder contains synthetic demo documents for PolicyTrace.
4
+
5
+ The PDFs in `policytrace_demo_pack/` are fictional, text-based UK motor
6
+ insurance documents. They use invented names, policy numbers, vehicle
7
+ registration, insurer branding, address, and risk details. They are safe to use
8
+ in screenshots, demos, blog posts, GitHub examples, and Hugging Face Spaces.
9
+
10
+ Generated files:
11
+
12
+ - `Schedule of Insurance - Demo.pdf`
13
+ - `Certificate of Motor Insurance - Demo.pdf`
14
+ - `Statement of Fact - Demo.pdf`
15
+ - `Policy Booklet - Demo.pdf`
16
+ - `manifest.json`
17
+
18
+ To regenerate the pack from source:
19
+
20
+ ```powershell
21
+ python scripts/generate_synthetic_policy_pack.py
22
+ ```
23
+
24
+ Do not commit real customer PDFs, real policy documents, or local extraction
25
+ outputs to the public repository.
sample_data/policytrace_demo_pack/manifest.json ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "purpose": "Synthetic demo data for AI Tool Stack PolicyTrace.",
3
+ "warning": "No real customer, insurer, vehicle, or policy data is included.",
4
+ "files": [
5
+ "Schedule of Insurance - Demo.pdf",
6
+ "Certificate of Motor Insurance - Demo.pdf",
7
+ "Statement of Fact - Demo.pdf",
8
+ "Policy Booklet - Demo.pdf"
9
+ ],
10
+ "expected_policy_number": "NBM-DEMO-0427",
11
+ "expected_vrm": "ZX24 DEM",
12
+ "expected_insurer": "Northbridge Mutual Motor Insurance Ltd"
13
+ }
scripts/generate_synthetic_policy_pack.py ADDED
@@ -0,0 +1,451 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Generate a synthetic UK motor insurance PDF pack for demos and tests.
3
+
4
+ The PDFs are intentionally fictional: invented insurer, logo, names, address,
5
+ policy number, vehicle registration, and risk details. They are text-based PDFs
6
+ so Docling can parse them without OCR.
7
+
8
+ Run from the repository root:
9
+
10
+ python scripts/generate_synthetic_policy_pack.py
11
+ """
12
+ from __future__ import annotations
13
+
14
+ import json
15
+ from pathlib import Path
16
+ from typing import Iterable
17
+
18
+ from reportlab.lib import colors
19
+ from reportlab.lib.pagesizes import A4
20
+ from reportlab.lib.styles import ParagraphStyle, getSampleStyleSheet
21
+ from reportlab.lib.units import mm
22
+ from reportlab.platypus import (
23
+ Paragraph,
24
+ SimpleDocTemplate,
25
+ Spacer,
26
+ Table,
27
+ TableStyle,
28
+ )
29
+
30
+
31
+ OUT_DIR = Path("sample_data/policytrace_demo_pack")
32
+ BRAND_DARK = colors.HexColor("#1F2937")
33
+ BRAND_BLUE = colors.HexColor("#2563EB")
34
+ BRAND_TEAL = colors.HexColor("#008080")
35
+ BRAND_PINK = colors.HexColor("#FCE7F3")
36
+ BRAND_LIGHT = colors.HexColor("#F8FAFC")
37
+
38
+ POLICY = {
39
+ "insurer": "Northbridge Mutual Motor Insurance Ltd",
40
+ "product_name": "PolicyTrace Comprehensive Plus",
41
+ "policy_number": "NBM-DEMO-0427",
42
+ "issue_date": "18/03/2026",
43
+ "start_date": "15/04/2026 at 00:00 hours",
44
+ "expiry_date": "14/04/2027 at 23:59 hours",
45
+ "policyholder": "Alex Morgan",
46
+ "address": "14 Demo Crescent, Sampleton, West Yorkshire, ZZ1 1ZZ",
47
+ "dob": "14/03/1991",
48
+ "occupation": "Product Manager",
49
+ "second_driver": "Priya Shah",
50
+ "second_driver_dob": "07/08/1995",
51
+ "second_driver_occupation": "Business Analyst",
52
+ "third_driver": "Jordan Reed",
53
+ "third_driver_dob": "11/10/1985",
54
+ "third_driver_occupation": "Data Administrator",
55
+ "vrm": "ZX24 DEM",
56
+ "make": "Skoda",
57
+ "model": "Enyaq iV 60 62kWh 177.0 bhp",
58
+ "fuel_type": "Electric",
59
+ "transmission": "Automatic",
60
+ "estimated_value": "Market Value",
61
+ "annual_mileage": "7,000 miles",
62
+ "overnight_postcode": "ZZ1 1ZZ",
63
+ "kept_location": "Drive",
64
+ "security_device": "Yes",
65
+ "tracker_fitted": "No",
66
+ "modifications": "No",
67
+ "cover_type": "Comprehensive",
68
+ "class_of_use": (
69
+ "Use for social, domestic and pleasure purposes including commuting "
70
+ "to and from a permanent place of work."
71
+ ),
72
+ "driving_other_cars": "No",
73
+ "ncb_years": "2 years",
74
+ "ncb_protected": "No",
75
+ "standard_compulsory": "GBP 395.00",
76
+ "voluntary": "GBP 200.00",
77
+ "total_accidental_damage": "GBP 595.00",
78
+ "fire": "GBP 395.00",
79
+ "theft": "GBP 445.00",
80
+ "windscreen_repair": "GBP 15.00",
81
+ "windscreen_replacement": "GBP 200.00",
82
+ "own_repairer": "GBP 200.00",
83
+ "total_premium": "GBP 703.28",
84
+ "legal": "GBP 25.40",
85
+ "breakdown": "GBP 28.07",
86
+ "personal_accident": "GBP 20.00",
87
+ "hire_car": "Not selected",
88
+ "key_cover": "Not selected",
89
+ }
90
+
91
+
92
+ def _styles() -> dict[str, ParagraphStyle]:
93
+ base = getSampleStyleSheet()
94
+ return {
95
+ "title": ParagraphStyle(
96
+ "title",
97
+ parent=base["Title"],
98
+ fontName="Helvetica-Bold",
99
+ fontSize=22,
100
+ textColor=BRAND_DARK,
101
+ spaceAfter=14,
102
+ ),
103
+ "subtitle": ParagraphStyle(
104
+ "subtitle",
105
+ parent=base["Normal"],
106
+ fontName="Helvetica",
107
+ fontSize=10,
108
+ leading=14,
109
+ textColor=colors.HexColor("#475569"),
110
+ spaceAfter=10,
111
+ ),
112
+ "h2": ParagraphStyle(
113
+ "h2",
114
+ parent=base["Heading2"],
115
+ fontName="Helvetica-Bold",
116
+ fontSize=13,
117
+ textColor=BRAND_TEAL,
118
+ spaceBefore=12,
119
+ spaceAfter=7,
120
+ ),
121
+ "body": ParagraphStyle(
122
+ "body",
123
+ parent=base["BodyText"],
124
+ fontName="Helvetica",
125
+ fontSize=9,
126
+ leading=12,
127
+ textColor=BRAND_DARK,
128
+ spaceAfter=6,
129
+ ),
130
+ "small": ParagraphStyle(
131
+ "small",
132
+ parent=base["BodyText"],
133
+ fontName="Helvetica",
134
+ fontSize=7,
135
+ leading=9,
136
+ textColor=colors.HexColor("#64748B"),
137
+ ),
138
+ }
139
+
140
+
141
+ def _draw_header(canvas, doc, title: str) -> None:
142
+ canvas.saveState()
143
+ width, height = A4
144
+ canvas.setFillColor(BRAND_DARK)
145
+ canvas.roundRect(16 * mm, height - 24 * mm, 42 * mm, 11 * mm, 2 * mm, fill=1, stroke=0)
146
+ canvas.setFillColor(BRAND_TEAL)
147
+ canvas.circle(22 * mm, height - 18.5 * mm, 2.6 * mm, fill=1, stroke=0)
148
+ canvas.setFillColor(BRAND_BLUE)
149
+ canvas.circle(29 * mm, height - 18.5 * mm, 2.6 * mm, fill=1, stroke=0)
150
+ canvas.setFillColor(colors.white)
151
+ canvas.setFont("Helvetica-Bold", 6)
152
+ canvas.drawString(36 * mm, height - 19.5 * mm, "NORTHBRIDGE")
153
+ canvas.setFillColor(colors.HexColor("#64748B"))
154
+ canvas.setFont("Helvetica", 7)
155
+ canvas.drawRightString(width - 16 * mm, height - 18 * mm, title)
156
+ canvas.setStrokeColor(colors.HexColor("#E2E8F0"))
157
+ canvas.line(16 * mm, height - 28 * mm, width - 16 * mm, height - 28 * mm)
158
+ canvas.setFont("Helvetica", 6)
159
+ canvas.setFillColor(colors.HexColor("#94A3B8"))
160
+ canvas.drawString(
161
+ 16 * mm,
162
+ 11 * mm,
163
+ "Synthetic demo document generated for AI Tool Stack PolicyTrace. No real customer or insurer data.",
164
+ )
165
+ canvas.drawRightString(width - 16 * mm, 11 * mm, f"Page {doc.page}")
166
+ canvas.restoreState()
167
+
168
+
169
+ def _table(rows: Iterable[Iterable[str]], col_widths: list[float] | None = None) -> Table:
170
+ data = [[Paragraph(str(cell), _styles()["body"]) for cell in row] for row in rows]
171
+ table = Table(data, colWidths=col_widths, hAlign="LEFT")
172
+ table.setStyle(
173
+ TableStyle(
174
+ [
175
+ ("BACKGROUND", (0, 0), (-1, 0), BRAND_LIGHT),
176
+ ("TEXTCOLOR", (0, 0), (-1, 0), BRAND_DARK),
177
+ ("FONTNAME", (0, 0), (-1, 0), "Helvetica-Bold"),
178
+ ("GRID", (0, 0), (-1, -1), 0.35, colors.HexColor("#CBD5E1")),
179
+ ("VALIGN", (0, 0), (-1, -1), "TOP"),
180
+ ("ROWBACKGROUNDS", (0, 1), (-1, -1), [colors.white, BRAND_PINK]),
181
+ ("LEFTPADDING", (0, 0), (-1, -1), 6),
182
+ ("RIGHTPADDING", (0, 0), (-1, -1), 6),
183
+ ("TOPPADDING", (0, 0), (-1, -1), 5),
184
+ ("BOTTOMPADDING", (0, 0), (-1, -1), 5),
185
+ ]
186
+ )
187
+ )
188
+ return table
189
+
190
+
191
+ def _doc(path: Path, title: str):
192
+ return SimpleDocTemplate(
193
+ str(path),
194
+ pagesize=A4,
195
+ leftMargin=18 * mm,
196
+ rightMargin=18 * mm,
197
+ topMargin=32 * mm,
198
+ bottomMargin=18 * mm,
199
+ title=title,
200
+ author="AI Tool Stack",
201
+ )
202
+
203
+
204
+ def build_schedule() -> None:
205
+ s = _styles()
206
+ path = OUT_DIR / "Schedule of Insurance - Demo.pdf"
207
+ story = [
208
+ Paragraph("Car insurance schedule", s["title"]),
209
+ Paragraph(
210
+ "This schedule is a synthetic text-based PDF for the PolicyTrace demo. "
211
+ "Please check all details carefully and contact Northbridge Mutual if anything is incorrect.",
212
+ s["subtitle"],
213
+ ),
214
+ _table(
215
+ [
216
+ ["Policy number", POLICY["policy_number"], "Date of issue", POLICY["issue_date"]],
217
+ ["Insurer", POLICY["insurer"], "Product", POLICY["product_name"]],
218
+ ["Period of cover", f"{POLICY['start_date']} - {POLICY['expiry_date']}", "Cover type", POLICY["cover_type"]],
219
+ ],
220
+ [33 * mm, 52 * mm, 33 * mm, 52 * mm],
221
+ ),
222
+ Paragraph("Policyholder details", s["h2"]),
223
+ _table(
224
+ [
225
+ ["Name", POLICY["policyholder"]],
226
+ ["Address", POLICY["address"]],
227
+ ["Date of birth", POLICY["dob"]],
228
+ ["Occupation", POLICY["occupation"]],
229
+ ["Children under 16", "Yes"],
230
+ ["Home ownership status", "Not a Homeowner"],
231
+ ["Number of cars in household", "1"],
232
+ ["Access to other vehicles", "No access to any other vehicles"],
233
+ ],
234
+ [55 * mm, 115 * mm],
235
+ ),
236
+ Paragraph("Vehicle details", s["h2"]),
237
+ _table(
238
+ [
239
+ ["Registration number", POLICY["vrm"], "Make", POLICY["make"]],
240
+ ["Model", POLICY["model"], "Fuel type", POLICY["fuel_type"]],
241
+ ["Transmission", POLICY["transmission"], "Estimated value", POLICY["estimated_value"]],
242
+ ["Annual mileage", POLICY["annual_mileage"], "Overnight postcode", POLICY["overnight_postcode"]],
243
+ ["Kept location", POLICY["kept_location"], "Security device fitted", POLICY["security_device"]],
244
+ ["Tracker fitted", POLICY["tracker_fitted"], "Modifications", POLICY["modifications"]],
245
+ ],
246
+ [38 * mm, 48 * mm, 38 * mm, 48 * mm],
247
+ ),
248
+ Paragraph("Cover and no claims discount", s["h2"]),
249
+ _table(
250
+ [
251
+ ["Class of use", POLICY["class_of_use"]],
252
+ ["Driving other cars", POLICY["driving_other_cars"]],
253
+ ["No claims discount", POLICY["ncb_years"]],
254
+ ["Protected no claims discount", POLICY["ncb_protected"]],
255
+ ],
256
+ [55 * mm, 115 * mm],
257
+ ),
258
+ Paragraph("Excess breakdown", s["h2"]),
259
+ _table(
260
+ [
261
+ ["Excess type", "Amount"],
262
+ ["Standard compulsory excess", POLICY["standard_compulsory"]],
263
+ ["Voluntary excess", POLICY["voluntary"]],
264
+ ["Total accidental damage excess", POLICY["total_accidental_damage"]],
265
+ ["Fire excess", POLICY["fire"]],
266
+ ["Theft excess", POLICY["theft"]],
267
+ ["Windscreen repair excess", POLICY["windscreen_repair"]],
268
+ ["Windscreen replacement excess", POLICY["windscreen_replacement"]],
269
+ ["Own repairer additional excess", POLICY["own_repairer"]],
270
+ ],
271
+ [90 * mm, 50 * mm],
272
+ ),
273
+ Paragraph("Driver details", s["h2"]),
274
+ _table(
275
+ [
276
+ ["Driver name", "Date of birth", "Relationship", "Occupation", "Licence type", "Main driver", "Specific excess"],
277
+ [POLICY["policyholder"], POLICY["dob"], "Policyholder", POLICY["occupation"], "Full Licence UK / 2/1 / No", "Yes", ""],
278
+ [POLICY["second_driver"], POLICY["second_driver_dob"], "Named Driver", POLICY["second_driver_occupation"], "UK Provisional / 1/4 / No", "No", "GBP 200.00"],
279
+ [POLICY["third_driver"], POLICY["third_driver_dob"], "Named Driver", POLICY["third_driver_occupation"], "Full Licence UK / 5/0 / No", "No", ""],
280
+ ],
281
+ [30 * mm, 24 * mm, 24 * mm, 31 * mm, 31 * mm, 18 * mm, 22 * mm],
282
+ ),
283
+ Paragraph("Financial summary", s["h2"]),
284
+ _table(
285
+ [
286
+ ["Item", "Premium"],
287
+ ["Total annual premium", POLICY["total_premium"]],
288
+ ["Motor legal protection", POLICY["legal"]],
289
+ ["Breakdown roadside assistance", POLICY["breakdown"]],
290
+ ["Enhanced personal accident", POLICY["personal_accident"]],
291
+ ["Hire car", POLICY["hire_car"]],
292
+ ["Key cover", POLICY["key_cover"]],
293
+ ],
294
+ [90 * mm, 50 * mm],
295
+ ),
296
+ ]
297
+ _doc(path, "Schedule of Insurance - Demo").build(
298
+ story,
299
+ onFirstPage=lambda c, d: _draw_header(c, d, "Schedule of Insurance"),
300
+ onLaterPages=lambda c, d: _draw_header(c, d, "Schedule of Insurance"),
301
+ )
302
+
303
+
304
+ def build_certificate() -> None:
305
+ s = _styles()
306
+ path = OUT_DIR / "Certificate of Motor Insurance - Demo.pdf"
307
+ story = [
308
+ Paragraph("Certificate of Motor Insurance", s["title"]),
309
+ Paragraph(
310
+ "This is to certify that a policy of insurance has been issued for the purposes of the Road Traffic Act.",
311
+ s["subtitle"],
312
+ ),
313
+ _table(
314
+ [
315
+ ["Policy number", POLICY["policy_number"]],
316
+ ["Insurer", POLICY["insurer"]],
317
+ ["Effective from", POLICY["start_date"]],
318
+ ["Expires", POLICY["expiry_date"]],
319
+ ["Registration number", POLICY["vrm"]],
320
+ ],
321
+ [55 * mm, 115 * mm],
322
+ ),
323
+ Paragraph("Persons entitled to drive", s["h2"]),
324
+ _table(
325
+ [
326
+ ["Name", "Entitlement"],
327
+ [POLICY["policyholder"], "The policyholder may drive the insured vehicle."],
328
+ [POLICY["second_driver"], "Named driver may drive the insured vehicle."],
329
+ [POLICY["third_driver"], "Named driver may drive the insured vehicle."],
330
+ ],
331
+ [55 * mm, 115 * mm],
332
+ ),
333
+ Paragraph("Limitations as to use", s["h2"]),
334
+ Paragraph(POLICY["class_of_use"], s["body"]),
335
+ Paragraph("The policy does not provide cover for driving other cars.", s["body"]),
336
+ Spacer(1, 8),
337
+ Paragraph(
338
+ "This certificate is fictional and is provided only as a safe demonstration fixture for the PolicyTrace project.",
339
+ s["small"],
340
+ ),
341
+ ]
342
+ _doc(path, "Certificate of Motor Insurance - Demo").build(
343
+ story,
344
+ onFirstPage=lambda c, d: _draw_header(c, d, "Certificate of Motor Insurance"),
345
+ onLaterPages=lambda c, d: _draw_header(c, d, "Certificate of Motor Insurance"),
346
+ )
347
+
348
+
349
+ def build_statement_of_fact() -> None:
350
+ s = _styles()
351
+ path = OUT_DIR / "Statement of Fact - Demo.pdf"
352
+ story = [
353
+ Paragraph("Statement of Fact", s["title"]),
354
+ Paragraph(
355
+ "These fictional facts were used to calculate the demo insurance premium.",
356
+ s["subtitle"],
357
+ ),
358
+ _table(
359
+ [
360
+ ["Policy number", POLICY["policy_number"]],
361
+ ["Main driver", POLICY["policyholder"]],
362
+ ["Annual mileage", POLICY["annual_mileage"]],
363
+ ["Vehicle kept overnight", POLICY["kept_location"]],
364
+ ["Overnight postcode", POLICY["overnight_postcode"]],
365
+ ["Security device fitted", POLICY["security_device"]],
366
+ ["Tracker fitted", POLICY["tracker_fitted"]],
367
+ ["Modifications", POLICY["modifications"]],
368
+ ["Non-motoring convictions", "No"],
369
+ ["Endorsements", "None"],
370
+ ["Claims in last five years", "None"],
371
+ ],
372
+ [58 * mm, 112 * mm],
373
+ ),
374
+ ]
375
+ _doc(path, "Statement of Fact - Demo").build(
376
+ story,
377
+ onFirstPage=lambda c, d: _draw_header(c, d, "Statement of Fact"),
378
+ onLaterPages=lambda c, d: _draw_header(c, d, "Statement of Fact"),
379
+ )
380
+
381
+
382
+ def build_policy_booklet() -> None:
383
+ s = _styles()
384
+ path = OUT_DIR / "Policy Booklet - Demo.pdf"
385
+ story = [
386
+ Paragraph("Motor Insurance Policy Booklet", s["title"]),
387
+ Paragraph(
388
+ "This booklet describes generic terms for a fictional motor insurance product. "
389
+ "It intentionally contains little policyholder-specific data.",
390
+ s["subtitle"],
391
+ ),
392
+ Paragraph("What is covered", s["h2"]),
393
+ Paragraph(
394
+ "Comprehensive cover may include damage to your vehicle, fire, theft, windscreen cover, "
395
+ "and third-party liability, subject to the terms and exclusions in this booklet.",
396
+ s["body"],
397
+ ),
398
+ Paragraph("Claims", s["h2"]),
399
+ Paragraph(
400
+ "You must tell Northbridge Mutual Motor Insurance Ltd about any accident or loss as soon as possible. "
401
+ "We may ask for evidence, photographs, repair estimates, or further information.",
402
+ s["body"],
403
+ ),
404
+ Paragraph("General exclusions", s["h2"]),
405
+ Paragraph(
406
+ "No cover is provided where the vehicle is used outside the permitted class of use, "
407
+ "where the driver is not entitled to drive, or where policy information is materially incorrect.",
408
+ s["body"],
409
+ ),
410
+ Paragraph("Complaints", s["h2"]),
411
+ Paragraph(
412
+ "If you are unhappy with our service, contact the fictional complaints team at Northbridge Mutual.",
413
+ s["body"],
414
+ ),
415
+ ]
416
+ _doc(path, "Policy Booklet - Demo").build(
417
+ story,
418
+ onFirstPage=lambda c, d: _draw_header(c, d, "Policy Booklet"),
419
+ onLaterPages=lambda c, d: _draw_header(c, d, "Policy Booklet"),
420
+ )
421
+
422
+
423
+ def write_manifest() -> None:
424
+ manifest = {
425
+ "purpose": "Synthetic demo data for AI Tool Stack PolicyTrace.",
426
+ "warning": "No real customer, insurer, vehicle, or policy data is included.",
427
+ "files": [
428
+ "Schedule of Insurance - Demo.pdf",
429
+ "Certificate of Motor Insurance - Demo.pdf",
430
+ "Statement of Fact - Demo.pdf",
431
+ "Policy Booklet - Demo.pdf",
432
+ ],
433
+ "expected_policy_number": POLICY["policy_number"],
434
+ "expected_vrm": POLICY["vrm"],
435
+ "expected_insurer": POLICY["insurer"],
436
+ }
437
+ (OUT_DIR / "manifest.json").write_text(json.dumps(manifest, indent=2), encoding="utf-8")
438
+
439
+
440
+ def main() -> None:
441
+ OUT_DIR.mkdir(parents=True, exist_ok=True)
442
+ build_schedule()
443
+ build_certificate()
444
+ build_statement_of_fact()
445
+ build_policy_booklet()
446
+ write_manifest()
447
+ print(f"Synthetic demo pack written to {OUT_DIR.resolve()}")
448
+
449
+
450
+ if __name__ == "__main__":
451
+ main()
src/agents.py ADDED
@@ -0,0 +1,530 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ agents.py — Specialist document extraction agents for UK Motor Insurance.
3
+
4
+ Architecture
5
+ ────────────
6
+ PDF path
7
+ → docling (PDF → Markdown)
8
+ → PIIMasker.mask()
9
+ → InsuranceExtractionAgents.classify_document() [LLM: llama-3.1-8b-instant]
10
+ → extract_schedule() | extract_certificate() [LLM: llama-4-scout-17b]
11
+ → UKMotorGoldenRecord (with source_document provenance)
12
+ """
13
+ from __future__ import annotations
14
+
15
+ import json
16
+ import logging
17
+ import os
18
+ import time
19
+ from pathlib import Path
20
+ from typing import Any
21
+
22
+ import instructor
23
+ from docling.datamodel.base_models import InputFormat
24
+ from docling.datamodel.pipeline_options import PdfPipelineOptions
25
+ from docling.document_converter import DocumentConverter, PdfFormatOption
26
+ from groq import Groq
27
+ from pydantic import ValidationError
28
+
29
+ from privacy import PIIMasker
30
+ from prompts import PromptRegistry
31
+ from schema import DocumentType, SourceMetadata, UKMotorGoldenRecord
32
+ from settings import settings
33
+
34
+ logger = logging.getLogger(__name__)
35
+
36
+ # ---------------------------------------------------------------------------
37
+ # Groq clients — extraction (instructor-wrapped) + classifier (raw Groq)
38
+ # ---------------------------------------------------------------------------
39
+
40
+
41
+ def _build_extraction_client() -> instructor.Instructor:
42
+ api_key = os.environ.get("GROQ_API_KEY")
43
+ if not api_key:
44
+ raise EnvironmentError(
45
+ "GROQ_API_KEY environment variable is not set. "
46
+ "Export it before running the pipeline."
47
+ )
48
+ return instructor.from_groq(Groq(api_key=api_key), mode=instructor.Mode.JSON)
49
+
50
+
51
+ def _build_groq_client() -> Groq:
52
+ api_key = os.environ.get("GROQ_API_KEY")
53
+ if not api_key:
54
+ raise EnvironmentError(
55
+ "GROQ_API_KEY environment variable is not set. "
56
+ "Export it before running the pipeline."
57
+ )
58
+ return Groq(api_key=api_key)
59
+
60
+
61
+ # Models resolved at import time from settings.yaml / env vars
62
+ _EXTRACTION_MODEL: str = settings.llm.model
63
+ _CLASSIFIER_MODEL: str = settings.llm.classifier_model
64
+
65
+
66
+ def _build_docling_converter() -> DocumentConverter:
67
+ """Build a DocumentConverter configured from settings.docling."""
68
+ opts = PdfPipelineOptions()
69
+ opts.do_ocr = settings.docling.do_ocr
70
+ opts.do_table_structure = settings.docling.do_table_structure
71
+ return DocumentConverter(
72
+ format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=opts)}
73
+ )
74
+
75
+ # ---------------------------------------------------------------------------
76
+ # Document type classifier (keyword heuristic — fast, zero API calls)
77
+ # ---------------------------------------------------------------------------
78
+
79
+ _CLASSIFICATION_KEYWORDS: dict[DocumentType, list[str]] = {
80
+ DocumentType.SCHEDULE: [
81
+ # Phrases that only appear in a Schedule, not in a Certificate
82
+ "policy schedule",
83
+ "schedule of insurance",
84
+ "schedule number",
85
+ "premium payable",
86
+ "compulsory excess",
87
+ "voluntary excess",
88
+ "no claims bonus",
89
+ "ncb",
90
+ "windscreen excess",
91
+ ],
92
+ DocumentType.CERTIFICATE: [
93
+ # Phrases that are definitive for a Certificate document
94
+ "certificate of motor insurance",
95
+ "motor insurance certificate",
96
+ "certificate number",
97
+ "persons entitled to drive",
98
+ "class of use",
99
+ "road traffic act",
100
+ "this is to certify",
101
+ ],
102
+ DocumentType.STATEMENT_OF_FACT: [
103
+ "statement of fact",
104
+ "statement of insurance",
105
+ "proposal form",
106
+ "claims history",
107
+ "motoring convictions",
108
+ "annual mileage",
109
+ ],
110
+ DocumentType.POLICY_BOOKLET: [
111
+ "policy booklet",
112
+ "policy wording",
113
+ "terms and conditions",
114
+ "what is covered",
115
+ "general conditions",
116
+ "complaints procedure",
117
+ ],
118
+ }
119
+
120
+
121
+ def _keyword_classify(text: str) -> str:
122
+ """Keyword heuristic fallback classifier. Returns DocumentType.value string."""
123
+ lower = text.lower()
124
+ scores: dict[DocumentType, int] = {dt: 0 for dt in _CLASSIFICATION_KEYWORDS}
125
+
126
+ for doc_type, keywords in _CLASSIFICATION_KEYWORDS.items():
127
+ for kw in keywords:
128
+ if kw in lower:
129
+ scores[doc_type] += 1
130
+
131
+ best_type, best_score = max(scores.items(), key=lambda kv: kv[1])
132
+ return best_type.value if best_score > 0 else DocumentType.UNKNOWN.value
133
+
134
+
135
+ def _str_to_doc_type(s: str) -> DocumentType:
136
+ """Convert a string to DocumentType, falling back to UNKNOWN."""
137
+ try:
138
+ return DocumentType(s)
139
+ except ValueError:
140
+ return DocumentType.UNKNOWN
141
+
142
+
143
+ # ---------------------------------------------------------------------------
144
+ # Extraction failure sentinel
145
+ # ---------------------------------------------------------------------------
146
+
147
+
148
+ class ExtractionFailedError(RuntimeError):
149
+ """
150
+ Raised when the LLM fails to produce a valid UKMotorGoldenRecord after
151
+ exhausting all retries. Callers should treat the document as failed and
152
+ skip it rather than propagating an empty record silently.
153
+ """
154
+
155
+
156
+ # ---------------------------------------------------------------------------
157
+ # InsuranceExtractionAgents
158
+ # ---------------------------------------------------------------------------
159
+
160
+
161
+ class InsuranceExtractionAgents:
162
+ """
163
+ Specialist extraction agents for UK Motor Insurance documents.
164
+
165
+ Uses two LLM models:
166
+ - llama-3.1-8b-instant — fast document type classification
167
+ - llama-4-scout-17b-16e — deep structured extraction (Schedule / Certificate)
168
+
169
+ Parameters
170
+ ----------
171
+ masker : PIIMasker | None
172
+ max_retries : int
173
+ prompt_registry : PromptRegistry | None
174
+ debug_dir : Path | None
175
+ """
176
+
177
+ def __init__(
178
+ self,
179
+ masker: PIIMasker | None = None,
180
+ max_retries: int = settings.llm.max_retries,
181
+ prompt_registry: PromptRegistry | None = None,
182
+ debug_dir: Path | None = None,
183
+ ) -> None:
184
+ self._client = _build_extraction_client()
185
+ self._groq = _build_groq_client()
186
+ self._masker = masker or PIIMasker()
187
+ self._max_retries = max_retries
188
+ self._prompts = prompt_registry or PromptRegistry()
189
+ self._converter = _build_docling_converter()
190
+ self._debug_dir = debug_dir
191
+
192
+ # ------------------------------------------------------------------
193
+ # Public API
194
+ # ------------------------------------------------------------------
195
+
196
+ def classify_document(self, markdown_text: str) -> str:
197
+ """
198
+ Use llama-3.1-8b-instant to classify the document type.
199
+
200
+ The LLM is the primary classifier. If it fails or returns an invalid
201
+ label, the keyword heuristic is used as a fallback. A discrepancy
202
+ between the two is logged as a warning to flag low-confidence cases.
203
+
204
+ Returns one of: "Schedule", "Certificate", "StatementOfFact",
205
+ "PolicyBooklet", "Unknown".
206
+ """
207
+ keyword_result = _keyword_classify(markdown_text)
208
+
209
+ system_prompt = (
210
+ "You are a UK motor insurance document classifier.\n"
211
+ "Given the document text, respond with EXACTLY one word from:\n"
212
+ "Schedule | Certificate | StatementOfFact | PolicyBooklet | Unknown\n\n"
213
+ "- Schedule: Policy Schedule \u2014 excess figures, premium, NCB, "
214
+ "vehicle details, driver ages/DOBs.\n"
215
+ "- Certificate: Certificate of Motor Insurance \u2014 Road Traffic Act, "
216
+ "'persons entitled to drive', 'class of use'.\n"
217
+ "- StatementOfFact: Statement of Fact / Proposal \u2014 claims history, "
218
+ "convictions, annual mileage.\n"
219
+ "- PolicyBooklet: Policy Booklet / Wording \u2014 terms and conditions, "
220
+ "'what is covered', complaints.\n"
221
+ "- Unknown: Cannot determine.\n\n"
222
+ "Respond with ONLY the single classification word. No punctuation."
223
+ )
224
+ try:
225
+ response = self._groq.chat.completions.create(
226
+ model=_CLASSIFIER_MODEL,
227
+ messages=[
228
+ {"role": "system", "content": system_prompt},
229
+ {
230
+ "role": "user",
231
+ "content": "Classify this document:\n\n" + markdown_text[:4000],
232
+ },
233
+ ],
234
+ max_tokens=10,
235
+ temperature=0,
236
+ )
237
+ llm_result = response.choices[0].message.content.strip().split()[0]
238
+ valid = {"Schedule", "Certificate", "StatementOfFact", "PolicyBooklet", "Unknown"}
239
+ if llm_result in valid:
240
+ if llm_result != keyword_result:
241
+ logger.warning(
242
+ "Classifier discrepancy: LLM=%s, keyword=%s "
243
+ "(using LLM result — verify document type)",
244
+ llm_result, keyword_result,
245
+ )
246
+ else:
247
+ logger.debug("Classifier agreement: LLM=%s \u2713", llm_result)
248
+ return llm_result
249
+ logger.warning(
250
+ "LLM classifier returned '%s' \u2014 falling back to keyword heuristic", llm_result
251
+ )
252
+ except Exception as exc: # noqa: BLE001
253
+ logger.warning(
254
+ "LLM classifier failed (%s) \u2014 falling back to keyword heuristic", exc
255
+ )
256
+ return keyword_result
257
+
258
+ def extract_schedule(self, markdown_text: str, filename: str) -> UKMotorGoldenRecord:
259
+ """
260
+ Extract all financial, vehicle, and driver risk data from a Policy Schedule.
261
+
262
+ Instructs the LLM to:
263
+ - Compute total_accidental_damage = standard_compulsory + voluntary
264
+ - Extract driver DOBs and distinguish Full UK vs UK Provisional licence types
265
+ - Separate fire excess from theft excess (they can differ)
266
+ - Extract own_repairer_additional_excess if present
267
+ - Extract premium breakdown and optional extras (float if purchased,
268
+ "Not Selected" if not)
269
+ """
270
+ return self._extract(
271
+ markdown_text,
272
+ filename,
273
+ DocumentType.SCHEDULE,
274
+ self._prompts.get(DocumentType.SCHEDULE),
275
+ )
276
+
277
+ def extract_certificate(self, markdown_text: str, filename: str) -> UKMotorGoldenRecord:
278
+ """
279
+ Extract legal permissions from a Certificate of Motor Insurance.
280
+
281
+ Instructs the LLM to:
282
+ - Extract the exact "Limitations as to use" / class_of_use clause verbatim
283
+ - Extract the policy_number for cross-reference
284
+ - Record driving_other_cars entitlement (true/false)
285
+ - Leave all financial fields (excess, premium, NCB) as null
286
+ """
287
+ return self._extract(
288
+ markdown_text,
289
+ filename,
290
+ DocumentType.CERTIFICATE,
291
+ self._prompts.get(DocumentType.CERTIFICATE),
292
+ )
293
+
294
+ def process(self, pdf_path: str | Path) -> tuple[UKMotorGoldenRecord, str]:
295
+ """
296
+ Full pipeline for one PDF: PDF → Markdown → PII mask → classify → extract.
297
+
298
+ Returns
299
+ -------
300
+ tuple[UKMotorGoldenRecord, str]
301
+ The extracted record and the document type string (e.g. "Schedule").
302
+
303
+ Raises
304
+ ------
305
+ ExtractionFailedError
306
+ When the LLM fails to extract a valid record after all retries.
307
+ """
308
+ record, doc_type_str, _ = self._process_internal(Path(pdf_path), build_corpus=False)
309
+ return record, doc_type_str
310
+
311
+ # ------------------------------------------------------------------
312
+ # Private helpers
313
+ # ------------------------------------------------------------------
314
+
315
+ def _process_internal(
316
+ self,
317
+ pdf_path: Path,
318
+ build_corpus: bool,
319
+ ) -> tuple[UKMotorGoldenRecord, str, Any]:
320
+ """
321
+ Unified core pipeline: PDF → Markdown → PII mask → classify → extract,
322
+ optionally building a ProvenanceCorpus from the raw Docling IR.
323
+
324
+ Parameters
325
+ ----------
326
+ pdf_path : Path
327
+ build_corpus : bool
328
+ When True, builds a ProvenanceCorpus before PII masking so the
329
+ original text is available for fuzzy matching.
330
+
331
+ Returns
332
+ -------
333
+ tuple[UKMotorGoldenRecord, str, ProvenanceCorpus | None]
334
+ (record, doc_type_str, corpus_or_None)
335
+
336
+ Raises
337
+ ------
338
+ ExtractionFailedError
339
+ Propagated from _extract() when the LLM fails after all retries.
340
+ """
341
+ from provenance import ProvenanceCorpus # local import — avoids circular dep
342
+
343
+ logger.info("Processing%s: %s", " (with provenance)" if build_corpus else "", pdf_path.name)
344
+
345
+ # Pre-classify from filename for page-cap selection (no API call)
346
+ pre_type_str = _keyword_classify(pdf_path.stem)
347
+ pre_doc_type = _str_to_doc_type(pre_type_str)
348
+ logger.debug(" Pre-classified from filename: %s", pre_type_str)
349
+
350
+ # PDF → Markdown + raw DoclingDocument
351
+ markdown, raw_doc = self._pdf_to_markdown_and_doc(pdf_path, pre_doc_type)
352
+
353
+ # Build corpus from original text BEFORE masking (critical for accurate fuzzy match)
354
+ corpus: Any = None
355
+ if build_corpus:
356
+ corpus = ProvenanceCorpus(source_filename=pdf_path.name, doc_type=pre_type_str)
357
+ corpus.add_from_docling(raw_doc, pdf_path.name)
358
+ logger.debug(" Provenance corpus: %d items", len(corpus.items))
359
+
360
+ if self._debug_dir and settings.debug.save_markdown:
361
+ _write_debug(self._debug_dir, f"{pdf_path.name}.md", markdown)
362
+
363
+ # PII mask
364
+ masked_markdown, _token_map = self._masker.mask(markdown)
365
+ if self._debug_dir and settings.debug.save_masked_markdown:
366
+ _write_debug(self._debug_dir, f"{pdf_path.name}.masked.md", masked_markdown)
367
+
368
+ # Classify
369
+ t0 = time.monotonic()
370
+ doc_type_str = self.classify_document(masked_markdown)
371
+ logger.info(" Classified as: %s", doc_type_str)
372
+
373
+ # Route to specialist extractor
374
+ if doc_type_str == "Schedule":
375
+ record = self.extract_schedule(masked_markdown, pdf_path.name)
376
+ elif doc_type_str == "Certificate":
377
+ record = self.extract_certificate(masked_markdown, pdf_path.name)
378
+ else:
379
+ logger.info(" Non-primary type '%s' — running generic extraction", doc_type_str)
380
+ record = self._extract(
381
+ masked_markdown,
382
+ pdf_path.name,
383
+ _str_to_doc_type(doc_type_str),
384
+ self._prompts.get(_str_to_doc_type(doc_type_str)),
385
+ )
386
+
387
+ elapsed = round(time.monotonic() - t0, 3)
388
+
389
+ record.source_document = SourceMetadata(
390
+ document_type=_str_to_doc_type(doc_type_str),
391
+ filename=pdf_path.name,
392
+ )
393
+
394
+ if self._debug_dir and settings.debug.save_extraction_json:
395
+ _write_debug(
396
+ self._debug_dir,
397
+ f"{pdf_path.name}.extraction.json",
398
+ record.model_dump_json(indent=2),
399
+ )
400
+ fc = getattr(record, "field_citations", None) or {}
401
+ logger.info(" field_citations populated by LLM: %d entries", len(fc))
402
+ if fc:
403
+ import json as _json
404
+ _write_debug(
405
+ self._debug_dir,
406
+ f"{pdf_path.name}.field_citations.json",
407
+ _json.dumps(fc, indent=2, ensure_ascii=False),
408
+ )
409
+
410
+ if self._debug_dir and settings.debug.save_metrics:
411
+ metrics: dict = {
412
+ "filename": pdf_path.name,
413
+ "doc_type": doc_type_str,
414
+ "extraction_model": _EXTRACTION_MODEL,
415
+ "classifier_model": _CLASSIFIER_MODEL,
416
+ "response_time_seconds": elapsed,
417
+ }
418
+ if corpus is not None:
419
+ metrics["corpus_items"] = len(corpus.items)
420
+ _append_metrics(self._debug_dir, metrics)
421
+
422
+ return record, doc_type_str, corpus
423
+
424
+ def _pdf_to_markdown(
425
+ self, pdf_path: Path, doc_type: DocumentType = DocumentType.UNKNOWN
426
+ ) -> str:
427
+ """Convert a PDF to Markdown using docling, respecting per-doc-type page caps."""
428
+ markdown, _ = self._pdf_to_markdown_and_doc(pdf_path, doc_type)
429
+ return markdown
430
+
431
+ def _pdf_to_markdown_and_doc(
432
+ self, pdf_path: Path, doc_type: DocumentType = DocumentType.UNKNOWN
433
+ ) -> tuple[str, Any]:
434
+ """Convert PDF to Markdown and also return the raw DoclingDocument for provenance."""
435
+ # Apply page cap during conversion (not just in Markdown export) to prevent
436
+ # Docling's layout model from running out of memory on large PDFs (Policy Booklet).
437
+ max_pg = settings.docling.max_pages.get(doc_type.value)
438
+ convert_kwargs: dict[str, Any] = {}
439
+ if max_pg is not None:
440
+ convert_kwargs["max_num_pages"] = max_pg
441
+
442
+ result = self._converter.convert(str(pdf_path), **convert_kwargs)
443
+ doc = result.document
444
+ markdown = doc.export_to_markdown()
445
+
446
+ if max_pg is not None:
447
+ separator = "\n---\n"
448
+ parts = markdown.split(separator)
449
+ if len(parts) > max_pg:
450
+ logger.info(
451
+ " Page cap applied: %s capped at %d/%d pages",
452
+ pdf_path.name, max_pg, len(parts),
453
+ )
454
+ markdown = separator.join(parts[:max_pg])
455
+
456
+ return markdown, doc
457
+
458
+ def process_with_provenance(
459
+ self, pdf_path: str | Path
460
+ ) -> tuple[UKMotorGoldenRecord, str, Any]:
461
+ """
462
+ Like process() but also returns a ProvenanceCorpus built from the Docling IR.
463
+
464
+ The corpus is constructed *before* PII masking so that the original text
465
+ strings (not masked tokens) are available for fuzzy matching.
466
+
467
+ Returns
468
+ -------
469
+ tuple[UKMotorGoldenRecord, str, ProvenanceCorpus]
470
+ (record, doc_type_str, corpus)
471
+
472
+ Raises
473
+ ------
474
+ ExtractionFailedError
475
+ When the LLM fails to extract a valid record after all retries.
476
+ """
477
+ return self._process_internal(Path(pdf_path), build_corpus=True) # type: ignore[return-value]
478
+
479
+ def _extract(
480
+ self,
481
+ text: str,
482
+ filename: str,
483
+ doc_type: DocumentType,
484
+ system_prompt: str,
485
+ ) -> UKMotorGoldenRecord:
486
+ """Call Groq via instructor to extract a UKMotorGoldenRecord."""
487
+ user_message = (
488
+ "Extract all motor insurance data from the following document text. "
489
+ "Return a JSON object that strictly conforms to the UKMotorGoldenRecord schema.\n\n"
490
+ f"--- DOCUMENT TEXT ---\n{text}\n--- END ---"
491
+ )
492
+ try:
493
+ record: UKMotorGoldenRecord = self._client.chat.completions.create(
494
+ model=_EXTRACTION_MODEL,
495
+ response_model=UKMotorGoldenRecord,
496
+ max_retries=self._max_retries,
497
+ messages=[
498
+ {"role": "system", "content": system_prompt.strip()},
499
+ {"role": "user", "content": user_message},
500
+ ],
501
+ )
502
+ except (ValidationError, Exception) as exc:
503
+ raise ExtractionFailedError(
504
+ f"Extraction failed for {doc_type.value!r} document '{filename}' "
505
+ f"after {self._max_retries} retries: {exc}"
506
+ ) from exc
507
+ return record
508
+
509
+
510
+ # ---------------------------------------------------------------------------
511
+ # Debug helpers (module-level so they can be unit-tested independently)
512
+ # ---------------------------------------------------------------------------
513
+
514
+
515
+ def _write_debug(debug_dir: Path, filename: str, content: str) -> None:
516
+ """Write a debug artifact to disk, silently skipping on any I/O error."""
517
+ try:
518
+ (debug_dir / filename).write_text(content, encoding="utf-8")
519
+ logger.debug("Debug artifact saved: %s", filename)
520
+ except OSError as exc:
521
+ logger.warning("Could not write debug artifact %s: %s", filename, exc)
522
+
523
+
524
+ def _append_metrics(debug_dir: Path, metrics: dict) -> None:
525
+ """Append a metrics dict as a JSONL line to extraction_metrics.jsonl."""
526
+ try:
527
+ with (debug_dir / "extraction_metrics.jsonl").open("a", encoding="utf-8") as fh:
528
+ fh.write(json.dumps(metrics) + "\n")
529
+ except OSError as exc:
530
+ logger.warning("Could not write metrics: %s", exc)
src/api.py ADDED
@@ -0,0 +1,372 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ api.py — FastAPI server for the UK Motor Insurance Visual Audit Review UI.
3
+
4
+ Endpoints
5
+ ─────────
6
+ GET /api/health
7
+ POST /api/process — upload PDFs, run pipeline, return session_id
8
+ GET /api/session/{id} — full GoldenRecordWithProvenance JSON
9
+ GET /api/pdf/{session_id}/{file} — serve source PDF (path-traversal safe)
10
+ PATCH /api/session/{id}/review — log a verify / override action
11
+ GET /api/session/{id}/review-state — current review state for the session
12
+
13
+ Run (from project root)
14
+ ───────────────────────
15
+ uvicorn api:app --app-dir src --reload --port 8000
16
+
17
+ Or directly:
18
+ python src/api.py
19
+ """
20
+ from __future__ import annotations
21
+
22
+ import json
23
+ import logging
24
+ import sys
25
+ import uuid
26
+ from datetime import datetime
27
+ from pathlib import Path
28
+ from typing import Optional
29
+
30
+ # ── Ensure src/ is on sys.path so sibling modules resolve regardless of CWD ─
31
+ sys.path.insert(0, str(Path(__file__).parent))
32
+
33
+ import uvicorn
34
+ from fastapi import FastAPI, File, HTTPException, UploadFile
35
+ from fastapi.middleware.cors import CORSMiddleware
36
+ from fastapi.responses import FileResponse, JSONResponse
37
+ from fastapi.staticfiles import StaticFiles
38
+ from pydantic import BaseModel
39
+
40
+ from agents import InsuranceExtractionAgents
41
+ from pipeline import run_extraction_pipeline
42
+ from privacy import PIIMasker
43
+ from provenance import build_provenance
44
+ from schema import GoldenRecordWithProvenance, UKMotorGoldenRecord
45
+ from settings import settings
46
+
47
+ logger = logging.getLogger(__name__)
48
+ logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(name)s — %(message)s")
49
+
50
+ # ---------------------------------------------------------------------------
51
+ # Session storage directory (project_root/output/sessions/<timestamp>_<uuid>/)
52
+ # Debug artifacts directory (project_root/output/debug/run_<timestamp>/)
53
+ # ---------------------------------------------------------------------------
54
+
55
+ _SESSION_DIR = Path(__file__).parent.parent / "output" / "sessions"
56
+ _SESSION_DIR.mkdir(parents=True, exist_ok=True)
57
+
58
+ _DEBUG_DIR = Path(__file__).parent.parent / "output" / "debug"
59
+ _DEBUG_DIR.mkdir(parents=True, exist_ok=True)
60
+
61
+ _STATIC_DIR = Path(__file__).parent.parent / "ui" / "dist"
62
+
63
+ # ---------------------------------------------------------------------------
64
+ # App
65
+ # ---------------------------------------------------------------------------
66
+
67
+ app = FastAPI(
68
+ title="UK Motor Insurance IDP — Visual Audit API",
69
+ version="1.0.0",
70
+ description=(
71
+ "Backend for the Human-in-the-Loop review dashboard. "
72
+ "Runs the extraction pipeline and exposes session-based review endpoints."
73
+ ),
74
+ )
75
+
76
+ app.add_middleware(
77
+ CORSMiddleware,
78
+ allow_origins=[
79
+ "http://localhost:5173",
80
+ "http://localhost:5174",
81
+ "http://127.0.0.1:5173",
82
+ "http://127.0.0.1:5174",
83
+ "http://localhost:3000",
84
+ ],
85
+ allow_credentials=True,
86
+ allow_methods=["*"],
87
+ allow_headers=["*"],
88
+ )
89
+
90
+
91
+ @app.on_event("startup")
92
+ async def _cleanup_old_sessions() -> None:
93
+ """Remove session directories older than settings.pipeline.session_ttl_days on startup."""
94
+ import shutil
95
+ ttl_days = settings.pipeline.session_ttl_days
96
+ if ttl_days <= 0:
97
+ return
98
+ from datetime import datetime, timedelta
99
+ cutoff = datetime.now() - timedelta(days=ttl_days)
100
+ removed = 0
101
+ for session_dir in _SESSION_DIR.iterdir():
102
+ if session_dir.is_dir():
103
+ mtime = datetime.fromtimestamp(session_dir.stat().st_mtime)
104
+ if mtime < cutoff:
105
+ shutil.rmtree(session_dir, ignore_errors=True)
106
+ removed += 1
107
+ if removed:
108
+ logger.info(
109
+ "Startup cleanup: removed %d session(s) older than %d day(s)",
110
+ removed, ttl_days,
111
+ )
112
+
113
+
114
+ # ---------------------------------------------------------------------------
115
+ # Helpers
116
+ # ---------------------------------------------------------------------------
117
+
118
+
119
+ def _get_session_dir(session_id: str) -> Path:
120
+ """Return session directory or raise 404.
121
+
122
+ Supports both old-style (uuid-only) and new-style (timestamp_uuid) folder names.
123
+ """
124
+ # New-style: glob for any folder ending with the session UUID
125
+ matches = list(_SESSION_DIR.glob(f"*{session_id}"))
126
+ if matches:
127
+ return matches[0]
128
+ raise HTTPException(status_code=404, detail=f"Session '{session_id}' not found.")
129
+
130
+
131
+ def _count_leaves(obj: object) -> int:
132
+ if isinstance(obj, dict):
133
+ return sum(_count_leaves(v) for v in obj.values())
134
+ if isinstance(obj, list):
135
+ return sum(_count_leaves(v) for v in obj)
136
+ return 1
137
+
138
+
139
+ # ---------------------------------------------------------------------------
140
+ # Endpoints
141
+ # ---------------------------------------------------------------------------
142
+
143
+
144
+ @app.get("/api/health")
145
+ async def health():
146
+ return {"status": "ok", "version": "1.0.0"}
147
+
148
+
149
+ # ── POST /api/process ────────────────────────────────────────────────────────
150
+
151
+ class ProcessResponse(BaseModel):
152
+ session_id: str
153
+ fields_extracted: int
154
+ provenance_coverage: int # number of fields successfully located
155
+
156
+
157
+ @app.post("/api/process", response_model=ProcessResponse)
158
+ async def process_documents(files: list[UploadFile] = File(...)):
159
+ """
160
+ Accept one or more PDF uploads, run the full extraction pipeline, and
161
+ persist a session containing the Golden Record + provenance index.
162
+
163
+ Returns a ``session_id`` which the UI uses for all subsequent requests.
164
+
165
+ Note: This endpoint is synchronous and may take 30–90 seconds depending
166
+ on Groq API response times.
167
+ """
168
+ if not files:
169
+ raise HTTPException(status_code=400, detail="No files uploaded.")
170
+
171
+ pdf_files = [f for f in files if f.filename and f.filename.lower().endswith(".pdf")]
172
+ if not pdf_files:
173
+ raise HTTPException(status_code=400, detail="Only PDF files are accepted.")
174
+
175
+ # ── Create session directory (timestamp_uuid for easy sorting) ─────────
176
+ run_ts = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
177
+ session_id = str(uuid.uuid4())
178
+ session_folder = f"{run_ts}_{session_id}"
179
+ session_dir = _SESSION_DIR / session_folder
180
+ docs_dir = session_dir / "docs"
181
+ docs_dir.mkdir(parents=True, exist_ok=True)
182
+
183
+ # ── Create timestamped debug directory ────────────────────────────────
184
+ debug_dir: Path | None = None
185
+ if settings.debug.enabled:
186
+ debug_dir = _DEBUG_DIR / f"run_{run_ts}"
187
+ debug_dir.mkdir(parents=True, exist_ok=True)
188
+ logger.info("Debug artifacts → %s", debug_dir)
189
+
190
+ # ── Save uploaded PDFs (sanitise filenames) ───────────────────────────
191
+ pdf_paths: list[Path] = []
192
+ for upload in pdf_files:
193
+ safe_name = Path(upload.filename).name # strips directory components
194
+ dest = docs_dir / safe_name
195
+ dest.write_bytes(await upload.read())
196
+ pdf_paths.append(dest)
197
+
198
+ # ── Run pipeline with provenance ──────────────────────────────────────
199
+ masker = PIIMasker(mask_dates=settings.pii.mask_dates)
200
+ agent = InsuranceExtractionAgents(masker=masker, debug_dir=debug_dir)
201
+
202
+ golden, conflicts, corpora = run_extraction_pipeline(
203
+ pdf_paths=pdf_paths,
204
+ agent=agent,
205
+ with_provenance=True,
206
+ )
207
+
208
+ # ── Build provenance index ────────────────────────────────────────────
209
+ provenance_list = build_provenance(golden, corpora)
210
+
211
+ result = GoldenRecordWithProvenance(
212
+ record=golden,
213
+ provenance=provenance_list,
214
+ conflicts=conflicts,
215
+ session_id=session_id,
216
+ )
217
+
218
+ # ── Persist session ────────────────────────────────────────────────
219
+ (session_dir / "result.json").write_text(
220
+ result.model_dump_json(indent=2, exclude_none=True),
221
+ encoding="utf-8",
222
+ )
223
+ (session_dir / "review_state.json").write_text("{}", encoding="utf-8")
224
+
225
+ # Save field_citations sidecar so provenance can be re-built without re-running the LLM.
226
+ # (field_citations is excluded from result.json via Field(exclude=True) on the schema.)
227
+ fc = dict(getattr(golden, "field_citations", None) or {})
228
+ if fc:
229
+ (session_dir / "field_citations.json").write_text(
230
+ json.dumps(fc, indent=2, ensure_ascii=False), encoding="utf-8"
231
+ )
232
+
233
+ flat_fields = _count_leaves(golden.model_dump(exclude_none=True))
234
+ return ProcessResponse(
235
+ session_id=session_id,
236
+ fields_extracted=flat_fields,
237
+ provenance_coverage=len(provenance_list),
238
+ )
239
+
240
+
241
+ # ── GET /api/session/{session_id} ────────────────────────────────────────────
242
+
243
+ @app.get("/api/session/{session_id}")
244
+ async def get_session(session_id: str):
245
+ """Return the full GoldenRecordWithProvenance for this session."""
246
+ session_dir = _get_session_dir(session_id)
247
+ result_file = session_dir / "result.json"
248
+ if not result_file.exists():
249
+ raise HTTPException(status_code=404, detail="Session result not yet available.")
250
+ return JSONResponse(content=json.loads(result_file.read_text(encoding="utf-8")))
251
+
252
+
253
+ # ── GET /api/pdf/{session_id}/{filename} ─────────────────────────────────────
254
+
255
+ @app.get("/api/pdf/{session_id}/{filename}")
256
+ async def serve_pdf(session_id: str, filename: str):
257
+ """
258
+ Serve a PDF from the session's docs directory.
259
+
260
+ Path traversal is prevented by using only ``Path(filename).name``,
261
+ which strips any directory components from the supplied filename.
262
+ """
263
+ session_dir = _get_session_dir(session_id)
264
+ safe_name = Path(filename).name
265
+ if not safe_name.lower().endswith(".pdf"):
266
+ raise HTTPException(status_code=400, detail="Only PDF files can be served.")
267
+ pdf_path = session_dir / "docs" / safe_name
268
+ if not pdf_path.exists():
269
+ raise HTTPException(status_code=404, detail=f"PDF '{safe_name}' not found in session.")
270
+ return FileResponse(
271
+ str(pdf_path),
272
+ media_type="application/pdf",
273
+ headers={"Content-Disposition": f'inline; filename="{safe_name}"'},
274
+ )
275
+
276
+
277
+ # ── PATCH /api/session/{session_id}/review ───────────────────────────────────
278
+
279
+ class ReviewUpdate(BaseModel):
280
+ field_path: str
281
+ action: str # "verify" | "reject" | "override"
282
+ overridden_value: Optional[str] = None
283
+ reviewer: Optional[str] = "anonymous"
284
+
285
+
286
+ @app.patch("/api/session/{session_id}/review")
287
+ async def update_review(session_id: str, update: ReviewUpdate):
288
+ """Record a verify, reject, or override action for a specific field."""
289
+ if update.action not in {"verify", "reject", "override"}:
290
+ raise HTTPException(
291
+ status_code=422,
292
+ detail="action must be one of: verify, reject, override",
293
+ )
294
+
295
+ session_dir = _get_session_dir(session_id)
296
+ state_file = session_dir / "review_state.json"
297
+ state: dict = json.loads(state_file.read_text(encoding="utf-8")) if state_file.exists() else {}
298
+
299
+ state[update.field_path] = {
300
+ "action": update.action,
301
+ "overridden_value": update.overridden_value,
302
+ "reviewer": update.reviewer,
303
+ }
304
+ state_file.write_text(json.dumps(state, indent=2), encoding="utf-8")
305
+ return {"ok": True, "field_path": update.field_path, "action": update.action}
306
+
307
+
308
+ # ── GET /api/session/{session_id}/review-state ───────────────────────────────
309
+
310
+ @app.get("/api/session/{session_id}/review-state")
311
+ async def get_review_state(session_id: str):
312
+ """Return the current review state (verify/override log) for the session."""
313
+ session_dir = _get_session_dir(session_id)
314
+ state_file = session_dir / "review_state.json"
315
+ if not state_file.exists():
316
+ return JSONResponse(content={})
317
+ return JSONResponse(content=json.loads(state_file.read_text(encoding="utf-8")))
318
+
319
+ # ── DELETE /api/session/{session_id} ──────────────────────────────────────────
320
+
321
+ @app.delete("/api/session/{session_id}")
322
+ async def delete_session(session_id: str):
323
+ """
324
+ Permanently delete a session directory and all its contents.
325
+
326
+ This removes the uploaded PDFs, the Golden Record JSON, the review state,
327
+ and all debug artifacts for this session.
328
+ """
329
+ import shutil
330
+ session_dir = _get_session_dir(session_id)
331
+ shutil.rmtree(session_dir, ignore_errors=True)
332
+ return {"ok": True, "session_id": session_id}
333
+
334
+
335
+ # ---------------------------------------------------------------------------
336
+ # Production UI hosting
337
+ # ---------------------------------------------------------------------------
338
+
339
+ if _STATIC_DIR.exists():
340
+ assets_dir = _STATIC_DIR / "assets"
341
+ if assets_dir.exists():
342
+ app.mount("/assets", StaticFiles(directory=str(assets_dir)), name="assets")
343
+
344
+ @app.get("/{full_path:path}", include_in_schema=False)
345
+ async def serve_spa(full_path: str):
346
+ """
347
+ Serve the built React app when running as a single production service.
348
+
349
+ Vite handles the frontend during local development. In Docker/Hugging
350
+ Face deployments, the Dockerfile builds ui/dist and FastAPI serves it.
351
+ Unknown non-API paths fall back to index.html so /session/{id} works
352
+ after a hard refresh.
353
+ """
354
+ requested = (_STATIC_DIR / full_path).resolve()
355
+ static_root = _STATIC_DIR.resolve()
356
+ if (
357
+ full_path
358
+ and requested.is_file()
359
+ and static_root in requested.parents
360
+ ):
361
+ return FileResponse(str(requested))
362
+ return FileResponse(str(_STATIC_DIR / "index.html"))
363
+
364
+ # ---------------------------------------------------------------------------
365
+ # Dev entrypoint
366
+ # ---------------------------------------------------------------------------
367
+
368
+ if __name__ == "__main__":
369
+ import os
370
+
371
+ port = int(os.environ.get("PORT", "8000"))
372
+ uvicorn.run("api:app", host="0.0.0.0", port=port, reload=True)
src/arbiter.py ADDED
@@ -0,0 +1,268 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ arbiter.py — Hierarchy of Truth merge for UK Motor Insurance.
3
+
4
+ The PolicyArbiter takes one Schedule extraction and one Certificate extraction
5
+ and produces a single authoritative UKMotorGoldenRecord.
6
+
7
+ Document Authoritative for
8
+ ──────────────── ──────────────────────────────────────────────────
9
+ Schedule vehicle_details, excess_breakdown, financial_summary,
10
+ driver DOB / occupation / license_type, NCB, cover_type
11
+ Certificate class_of_use, driving_other_cars
12
+ """
13
+ from __future__ import annotations
14
+
15
+ import logging
16
+ from typing import Optional
17
+
18
+ from schema import (
19
+ ConflictEntry,
20
+ CoverAndExcesses,
21
+ Driver,
22
+ ExcessBreakdown,
23
+ NoClaimsDiscount,
24
+ PeriodOfCover,
25
+ PolicyHeader,
26
+ UKMotorGoldenRecord,
27
+ )
28
+
29
+ logger = logging.getLogger(__name__)
30
+
31
+ # Minimum rapidfuzz token_sort_ratio to consider two driver names a match.
32
+ _DRIVER_NAME_MATCH_THRESHOLD = 85
33
+
34
+
35
+ # ---------------------------------------------------------------------------
36
+ # PolicyArbiter
37
+ # ---------------------------------------------------------------------------
38
+
39
+
40
+ class PolicyArbiter:
41
+ """
42
+ Merges a Schedule extraction and a Certificate extraction into one
43
+ authoritative UKMotorGoldenRecord using the Hierarchy of Truth.
44
+
45
+ Usage
46
+ -----
47
+ >>> arbiter = PolicyArbiter()
48
+ >>> golden, conflicts = arbiter.merge_records(
49
+ ... schedule_record, "Schedule of Insurance (1).pdf",
50
+ ... certificate_record, "Certificate of Motor Insurance.pdf",
51
+ ... )
52
+ """
53
+
54
+ def merge_records(
55
+ self,
56
+ schedule_record: UKMotorGoldenRecord,
57
+ schedule_filename: str,
58
+ certificate_record: UKMotorGoldenRecord,
59
+ certificate_filename: str,
60
+ ) -> tuple[UKMotorGoldenRecord, list[ConflictEntry]]:
61
+ """
62
+ Merge Schedule and Certificate extractions into one Golden Record.
63
+
64
+ Schedule is master for: vehicle_details, excess_breakdown,
65
+ financial_summary, driver DOB/occupation/license_type, NCB, cover_type.
66
+ Certificate is master for: class_of_use, driving_other_cars.
67
+
68
+ Returns
69
+ -------
70
+ tuple[UKMotorGoldenRecord, list[ConflictEntry]]
71
+ (golden_record, list of fields where the two documents disagreed)
72
+ """
73
+ conflicts: list[ConflictEntry] = []
74
+ merged = UKMotorGoldenRecord()
75
+
76
+ # ── Policy header ───────────────────────────────────────────────────
77
+ merged.policy_header = _merge_policy_header(schedule_record, certificate_record, conflicts)
78
+
79
+ # ── Vehicle details: Schedule is authoritative ──────────────────────
80
+ merged.vehicle_details = schedule_record.vehicle_details
81
+
82
+ # ── Drivers: Schedule has DOB/occupation/licence ────────────────────
83
+ merged.driver_details = _merge_drivers(schedule_record, certificate_record, conflicts)
84
+
85
+ # ── Cover and excesses: hybrid ──────────────────────────────────────
86
+ # class_of_use + driving_other_cars → Certificate
87
+ # cover_type + NCB + excess_breakdown → Schedule
88
+ merged.cover_and_excesses = _merge_cover_and_excesses(
89
+ schedule_record, certificate_record, conflicts
90
+ )
91
+
92
+ # ── Financial summary: Schedule is authoritative ────────────────────
93
+ merged.financial_summary = schedule_record.financial_summary
94
+
95
+ # ── Additional risk data: Schedule is authoritative ─────────────────
96
+ merged.additional_risk_data = schedule_record.additional_risk_data
97
+
98
+ # ── Merge field_citations from both source records ──────────────────
99
+ # Schedule wins on key conflicts (consistent with merge hierarchy).
100
+ # Stored on the merged record for provenance matching; excluded from JSON output.
101
+ sched_fc = dict(getattr(schedule_record, "field_citations", None) or {})
102
+ cert_fc = dict(getattr(certificate_record, "field_citations", None) or {})
103
+ merged_fc = {**cert_fc, **sched_fc}
104
+ if merged_fc:
105
+ merged.field_citations = merged_fc
106
+
107
+ if conflicts:
108
+ logger.info(
109
+ "Merge conflicts (%d): %s",
110
+ len(conflicts),
111
+ [c.field for c in conflicts],
112
+ )
113
+
114
+ logger.info(
115
+ "Merge complete: schedule='%s' + certificate='%s' — %d conflict(s)",
116
+ schedule_filename, certificate_filename, len(conflicts),
117
+ )
118
+ return merged, conflicts
119
+
120
+
121
+ # ---------------------------------------------------------------------------
122
+ # Private merge helpers
123
+ # ---------------------------------------------------------------------------
124
+
125
+
126
+ def _first(*values):
127
+ """Return the first non-None value, or None if all are None."""
128
+ for v in values:
129
+ if v is not None:
130
+ return v
131
+ return None
132
+
133
+
134
+ def _check_conflict(
135
+ conflicts: list[ConflictEntry],
136
+ field: str,
137
+ sched_val,
138
+ cert_val,
139
+ winner: str,
140
+ ):
141
+ """
142
+ Detect a conflict between two scalar values, record it, and return the winner's value.
143
+
144
+ A conflict is logged only when both values are non-None *and* differ.
145
+ ``winner`` must be ``"schedule"`` or ``"certificate"``.
146
+ """
147
+ if sched_val is not None and cert_val is not None:
148
+ if str(sched_val).strip().lower() != str(cert_val).strip().lower():
149
+ conflicts.append(ConflictEntry(
150
+ field=field,
151
+ schedule_value=str(sched_val),
152
+ certificate_value=str(cert_val),
153
+ winner=winner,
154
+ ))
155
+ if winner == "certificate":
156
+ return _first(cert_val, sched_val)
157
+ return _first(sched_val, cert_val) # schedule wins (default)
158
+
159
+
160
+ def _find_matching_driver(name: str, candidates: list[Driver]) -> Driver | None:
161
+ """
162
+ Find the best-matching driver from *candidates* using fuzzy name matching.
163
+
164
+ Uses ``rapidfuzz.fuzz.token_sort_ratio`` so middle-name or word-order
165
+ differences (e.g. "JOHN A SMITH" vs "SMITH JOHN") still match.
166
+ Returns None when the best score is below ``_DRIVER_NAME_MATCH_THRESHOLD``.
167
+ """
168
+ try:
169
+ from rapidfuzz import fuzz as rfuzz
170
+ except ImportError:
171
+ # Graceful fallback: exact uppercase match (original behaviour)
172
+ upper = name.strip().upper()
173
+ return next((d for d in candidates if d.name.strip().upper() == upper), None)
174
+
175
+ best_score = 0
176
+ best_driver: Driver | None = None
177
+ for candidate in candidates:
178
+ score = rfuzz.token_sort_ratio(name.strip(), candidate.name.strip())
179
+ if score > best_score:
180
+ best_score = score
181
+ best_driver = candidate
182
+ return best_driver if best_score >= _DRIVER_NAME_MATCH_THRESHOLD else None
183
+
184
+
185
+ def _merge_policy_header(
186
+ sched: UKMotorGoldenRecord,
187
+ cert: UKMotorGoldenRecord,
188
+ conflicts: list[ConflictEntry],
189
+ ) -> Optional[PolicyHeader]:
190
+ """Schedule is master; fill any gap from Certificate."""
191
+ sh = sched.policy_header or PolicyHeader()
192
+ ch = cert.policy_header or PolicyHeader()
193
+
194
+ poc: Optional[PeriodOfCover] = _first(sh.period_of_cover, ch.period_of_cover)
195
+
196
+ return PolicyHeader(
197
+ policy_number=_check_conflict(conflicts, "policy_header.policy_number", sh.policy_number, ch.policy_number, "schedule"),
198
+ insurer=_check_conflict(conflicts, "policy_header.insurer", sh.insurer, ch.insurer, "schedule"),
199
+ product_name=_check_conflict(conflicts, "policy_header.product_name", sh.product_name, ch.product_name, "schedule"),
200
+ period_of_cover=poc,
201
+ )
202
+
203
+
204
+ def _merge_drivers(
205
+ sched: UKMotorGoldenRecord,
206
+ cert: UKMotorGoldenRecord,
207
+ conflicts: list[ConflictEntry],
208
+ ) -> list[Driver]:
209
+ """
210
+ Schedule drivers are the base (they carry DOB, occupation, license_type).
211
+ For each Schedule driver, fuzzy-match against Certificate drivers and enrich
212
+ with relationship or is_main_driver if the Schedule record lacks them.
213
+ Falls back to the Certificate list when Schedule has no drivers.
214
+
215
+ Uses rapidfuzz ``token_sort_ratio`` with an 85-point threshold so minor
216
+ name variations (initials, hyphenation, word order) still merge correctly.
217
+ """
218
+ sched_drivers = sched.driver_details or []
219
+ cert_drivers = cert.driver_details or []
220
+
221
+ if not sched_drivers:
222
+ return cert_drivers
223
+
224
+ merged: list[Driver] = []
225
+ for sd in sched_drivers:
226
+ cd = _find_matching_driver(sd.name, cert_drivers)
227
+
228
+ if cd is not None and sd.is_main_driver != cd.is_main_driver:
229
+ conflicts.append(ConflictEntry(
230
+ field=f"driver_details[{sd.name}].is_main_driver",
231
+ schedule_value=str(sd.is_main_driver),
232
+ certificate_value=str(cd.is_main_driver),
233
+ winner="schedule",
234
+ ))
235
+
236
+ merged.append(Driver(
237
+ name=sd.name,
238
+ dob=_first(sd.dob, cd.dob if cd else None),
239
+ relationship=_first(sd.relationship, cd.relationship if cd else None),
240
+ occupation=_first(sd.occupation, cd.occupation if cd else None),
241
+ license_type=_first(sd.license_type, cd.license_type if cd else None),
242
+ is_main_driver=sd.is_main_driver or (cd.is_main_driver if cd else False),
243
+ specific_excess=_first(sd.specific_excess, cd.specific_excess if cd else None),
244
+ ))
245
+ return merged
246
+
247
+
248
+ def _merge_cover_and_excesses(
249
+ sched: UKMotorGoldenRecord,
250
+ cert: UKMotorGoldenRecord,
251
+ conflicts: list[ConflictEntry],
252
+ ) -> Optional[CoverAndExcesses]:
253
+ """
254
+ Hybrid merge:
255
+ - class_of_use, driving_other_cars → Certificate is master
256
+ - cover_type, NCB, excess_breakdown → Schedule is master
257
+ """
258
+ sc = sched.cover_and_excesses or CoverAndExcesses()
259
+ cc = cert.cover_and_excesses or CoverAndExcesses()
260
+
261
+ return CoverAndExcesses(
262
+ cover_type=_check_conflict(conflicts, "cover_and_excesses.cover_type", sc.cover_type, cc.cover_type, "schedule"),
263
+ no_claims_discount=_first(sc.no_claims_discount, cc.no_claims_discount),
264
+ excess_breakdown=_first(sc.excess_breakdown, cc.excess_breakdown),
265
+ # Certificate is authoritative for legal-use fields
266
+ class_of_use=_check_conflict(conflicts, "cover_and_excesses.class_of_use", sc.class_of_use, cc.class_of_use, "certificate"),
267
+ driving_other_cars=_check_conflict(conflicts, "cover_and_excesses.driving_other_cars", sc.driving_other_cars, cc.driving_other_cars, "certificate"),
268
+ )
src/main.py ADDED
@@ -0,0 +1,223 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ main.py — Agentic orchestrator for UK Motor Insurance IDP.
3
+
4
+ Usage
5
+ -----
6
+ # Process all PDFs in a folder and print the Golden Record:
7
+ python src/main.py --input ./docs --output ./output/golden_record.json
8
+
9
+ # Verbose logging:
10
+ python src/main.py --input ./docs --output ./output/golden_record.json --log-level DEBUG
11
+
12
+ Environment
13
+ -----------
14
+ GROQ_API_KEY Required. Your Groq API key.
15
+ """
16
+ from __future__ import annotations
17
+
18
+ import argparse
19
+ import json
20
+ import logging
21
+ import sys
22
+ from datetime import datetime
23
+ from pathlib import Path
24
+
25
+ from agents import InsuranceExtractionAgents
26
+ from arbiter import PolicyArbiter
27
+ from pipeline import run_extraction_pipeline
28
+ from privacy import PIIMasker
29
+ from schema import DocumentType, UKMotorGoldenRecord
30
+ from settings import settings
31
+
32
+ # ---------------------------------------------------------------------------
33
+ # Logging
34
+ # ---------------------------------------------------------------------------
35
+
36
+ logger = logging.getLogger("pipeline")
37
+
38
+
39
+ # ---------------------------------------------------------------------------
40
+ # Pipeline
41
+ # ---------------------------------------------------------------------------
42
+
43
+
44
+ class DocumentPipeline:
45
+ """
46
+ End-to-end agentic pipeline.
47
+
48
+ Steps
49
+ -----
50
+ 1. Scan *input_dir* for PDF files.
51
+ 2. For each PDF: mask PII → classify → extract with specialist agent.
52
+ 3. Pass all extractions to PolicyArbiter.
53
+ 4. Persist GoldenRecord JSON (with citations and conflict log) to *output_path*.
54
+ """
55
+
56
+ # Document-type priority for display ordering (matches arbiter priority)
57
+ _DOC_ORDER = [
58
+ DocumentType.SCHEDULE,
59
+ DocumentType.CERTIFICATE,
60
+ DocumentType.STATEMENT_OF_FACT,
61
+ DocumentType.POLICY_BOOKLET,
62
+ DocumentType.UNKNOWN,
63
+ ]
64
+
65
+ def __init__(
66
+ self,
67
+ input_dir: str | Path,
68
+ output_path: str | Path = settings.pipeline.output_path,
69
+ mask_dates: bool = settings.pii.mask_dates,
70
+ ) -> None:
71
+ self.input_dir = Path(input_dir)
72
+ self.output_path = Path(output_path)
73
+
74
+ # Create a timestamped debug run directory once per pipeline instance
75
+ run_ts = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
76
+ self.debug_dir: Path | None = None
77
+ if settings.debug.enabled:
78
+ self.debug_dir = Path(settings.debug.output_dir) / f"run_{run_ts}"
79
+ self.debug_dir.mkdir(parents=True, exist_ok=True)
80
+ logger.info("Debug artifacts → %s", self.debug_dir)
81
+
82
+ self._masker = PIIMasker(mask_dates=mask_dates)
83
+ self._agent = InsuranceExtractionAgents(masker=self._masker, debug_dir=self.debug_dir)
84
+
85
+ # ------------------------------------------------------------------
86
+ # Public API
87
+ # ------------------------------------------------------------------
88
+
89
+ def run(self) -> UKMotorGoldenRecord:
90
+ """Execute the full pipeline and return the UKMotorGoldenRecord."""
91
+ pdfs = self._discover_pdfs()
92
+ if not pdfs:
93
+ raise FileNotFoundError(
94
+ f"No PDF files found in '{self.input_dir}'. "
95
+ "Ensure the folder contains at least one .pdf file."
96
+ )
97
+
98
+ logger.info("Found %d PDF(s): %s", len(pdfs), [p.name for p in pdfs])
99
+
100
+ # ── Stages 1 + 2: Extract + Arbitrate (shared logic via pipeline.py) ──
101
+ golden, conflicts, _ = run_extraction_pipeline(
102
+ pdf_paths=pdfs,
103
+ agent=self._agent,
104
+ with_provenance=False,
105
+ )
106
+
107
+ # ── Stage 3: Persist ──────────────────────────────────────────────
108
+ self._save(golden)
109
+ logger.info("Golden Record saved → %s", self.output_path)
110
+
111
+ if conflicts and self.debug_dir:
112
+ import json as _json
113
+ (self.debug_dir / "conflicts.json").write_text(
114
+ _json.dumps([c.model_dump() for c in conflicts], indent=2),
115
+ encoding="utf-8",
116
+ )
117
+ logger.info(
118
+ "Arbiter conflicts (%d) written → %s/conflicts.json",
119
+ len(conflicts), self.debug_dir,
120
+ )
121
+
122
+ return golden
123
+
124
+ # ------------------------------------------------------------------
125
+ # Private helpers
126
+ # ------------------------------------------------------------------
127
+
128
+ def _discover_pdfs(self) -> list[Path]:
129
+ """Return PDF files sorted by document-type priority (best-effort)."""
130
+ if not self.input_dir.is_dir():
131
+ raise NotADirectoryError(f"'{self.input_dir}' is not a directory.")
132
+ return sorted(self.input_dir.glob("*.pdf"), key=lambda p: p.name)
133
+
134
+ def _save(self, golden: UKMotorGoldenRecord) -> None:
135
+ self.output_path.parent.mkdir(parents=True, exist_ok=True)
136
+ self.output_path.write_text(golden.model_dump_json(indent=2, exclude_none=True), encoding="utf-8")
137
+
138
+
139
+ # ---------------------------------------------------------------------------
140
+ # CLI entry point
141
+ # ---------------------------------------------------------------------------
142
+
143
+
144
+ def _parse_args() -> argparse.Namespace:
145
+ parser = argparse.ArgumentParser(
146
+ description="Agentic UK Motor Insurance IDP Pipeline",
147
+ formatter_class=argparse.ArgumentDefaultsHelpFormatter,
148
+ )
149
+ parser.add_argument(
150
+ "--input", "-i",
151
+ required=True,
152
+ help="Folder containing input PDF documents.",
153
+ )
154
+ parser.add_argument(
155
+ "--output", "-o",
156
+ default=settings.pipeline.output_path,
157
+ help="Output path for the Golden Record JSON.",
158
+ )
159
+ parser.add_argument(
160
+ "--mask-dates",
161
+ action="store_true",
162
+ default=False,
163
+ help="Also redact DATE_TIME entities during PII masking.",
164
+ )
165
+ parser.add_argument(
166
+ "--log-level",
167
+ default=settings.pipeline.log_level,
168
+ choices=["DEBUG", "INFO", "WARNING", "ERROR"],
169
+ help="Logging verbosity.",
170
+ )
171
+ return parser.parse_args()
172
+
173
+
174
+ def main() -> None:
175
+ args = _parse_args()
176
+
177
+ # ── Logging setup: console + optional file handler ─────────────────────
178
+ log_format = "%(asctime)s [%(levelname)s] %(name)s — %(message)s"
179
+ logging.basicConfig(
180
+ level=args.log_level,
181
+ format=log_format,
182
+ datefmt="%H:%M:%S",
183
+ stream=sys.stdout,
184
+ )
185
+ if settings.debug.enabled:
186
+ run_ts = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
187
+ log_dir = Path(settings.debug.output_dir) / f"run_{run_ts}"
188
+ log_dir.mkdir(parents=True, exist_ok=True)
189
+ file_handler = logging.FileHandler(log_dir / "pipeline.log", encoding="utf-8")
190
+ file_handler.setLevel(args.log_level)
191
+ file_handler.setFormatter(logging.Formatter(log_format, datefmt="%H:%M:%S"))
192
+ logging.getLogger().addHandler(file_handler)
193
+ logger.info("Log file: %s", log_dir / "pipeline.log")
194
+
195
+ pipeline = DocumentPipeline(
196
+ input_dir=args.input,
197
+ output_path=args.output,
198
+ mask_dates=args.mask_dates,
199
+ )
200
+
201
+ golden = pipeline.run()
202
+
203
+ # Print a compact summary to stdout
204
+ hdr = golden.policy_header
205
+ veh = golden.vehicle_details
206
+ cov = golden.cover_and_excesses
207
+ drivers = golden.driver_details or []
208
+ print("\n" + "=" * 60)
209
+ print(" GOLDEN RECORD SUMMARY")
210
+ print("=" * 60)
211
+ print(f" Policy # : {hdr.policy_number if hdr else 'N/A'}")
212
+ print(f" Insurer : {hdr.insurer if hdr else 'N/A'}")
213
+ print(f" VRM : {veh.vrm if veh else 'N/A'}")
214
+ print(f" Vehicle : {(veh.make + ' ' + veh.model) if veh and veh.make else 'N/A'}")
215
+ print(f" Cover : {cov.cover_type if cov else 'N/A'}")
216
+ print(f" Class of use : {cov.class_of_use if cov else 'N/A'}")
217
+ print(f" Drivers : {len(drivers)}")
218
+ print("=" * 60)
219
+ print(f"\nFull JSON written to: {args.output}\n")
220
+
221
+
222
+ if __name__ == "__main__":
223
+ main()
src/pipeline.py ADDED
@@ -0,0 +1,131 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ pipeline.py — Shared PDF-routing and arbitration logic.
3
+
4
+ Both the CLI (main.py / DocumentPipeline) and the API (api.py / process_documents)
5
+ run the same extraction loop: route PDFs to Schedule/Certificate slots, call
6
+ the PolicyArbiter, and return the merged record plus any detected conflicts.
7
+
8
+ Extracting this logic here eliminates the duplication that previously existed
9
+ between those two entry-points and makes the behaviour easy to test in isolation.
10
+
11
+ Usage
12
+ -----
13
+ from pipeline import run_extraction_pipeline
14
+
15
+ golden, conflicts, corpora = run_extraction_pipeline(
16
+ pdf_paths=pdf_paths,
17
+ agent=agent,
18
+ with_provenance=True,
19
+ )
20
+ """
21
+ from __future__ import annotations
22
+
23
+ import logging
24
+ from pathlib import Path
25
+ from typing import Any
26
+
27
+ from agents import ExtractionFailedError, InsuranceExtractionAgents
28
+ from arbiter import PolicyArbiter
29
+ from schema import ConflictEntry, DocumentType, UKMotorGoldenRecord
30
+
31
+ logger = logging.getLogger(__name__)
32
+
33
+
34
+ def run_extraction_pipeline(
35
+ pdf_paths: list[Path],
36
+ agent: InsuranceExtractionAgents,
37
+ *,
38
+ with_provenance: bool = False,
39
+ ) -> tuple[UKMotorGoldenRecord, list[ConflictEntry], list[Any]]:
40
+ """
41
+ Route PDFs to Schedule/Certificate slots, arbitrate, and return the results.
42
+
43
+ Parameters
44
+ ----------
45
+ pdf_paths : list[Path]
46
+ Paths to the PDF documents to process.
47
+ agent : InsuranceExtractionAgents
48
+ Configured extraction agent (carries masker, debug_dir, prompts, etc.).
49
+ with_provenance : bool
50
+ When True, builds and returns ProvenanceCorpus objects for each PDF.
51
+ Set to True when running via the API (Visual Audit UI needs geometry data).
52
+ Set to False for the CLI path (faster, no corpus overhead).
53
+
54
+ Returns
55
+ -------
56
+ tuple[UKMotorGoldenRecord, list[ConflictEntry], list[ProvenanceCorpus]]
57
+ * golden_record — the merged authoritative policy record
58
+ * conflicts — fields where Schedule and Certificate disagreed
59
+ * corpora — ProvenanceCorpus objects (empty list when with_provenance=False)
60
+
61
+ Raises
62
+ ------
63
+ RuntimeError
64
+ When neither a Schedule nor a Certificate could be extracted from any PDF.
65
+ """
66
+ schedule_record: UKMotorGoldenRecord | None = None
67
+ schedule_filename = "unknown_schedule.pdf"
68
+ certificate_record: UKMotorGoldenRecord | None = None
69
+ certificate_filename = "unknown_certificate.pdf"
70
+ corpora: list[Any] = []
71
+ failed: list[str] = []
72
+
73
+ for pdf_path in pdf_paths:
74
+ try:
75
+ if with_provenance:
76
+ record, doc_type_str, corpus = agent.process_with_provenance(pdf_path)
77
+ if corpus is not None and corpus.items:
78
+ corpora.append(corpus)
79
+ else:
80
+ record, doc_type_str = agent.process(pdf_path)
81
+
82
+ logger.info(" ✓ %s → %s", pdf_path.name, doc_type_str)
83
+
84
+ if doc_type_str == DocumentType.SCHEDULE.value and schedule_record is None:
85
+ schedule_record = record
86
+ schedule_filename = pdf_path.name
87
+ elif doc_type_str == DocumentType.CERTIFICATE.value and certificate_record is None:
88
+ certificate_record = record
89
+ certificate_filename = pdf_path.name
90
+ else:
91
+ logger.info(" ~ %s (%s) — not used in merge", pdf_path.name, doc_type_str)
92
+
93
+ except ExtractionFailedError as exc:
94
+ logger.error(" ✗ Extraction failed for %s: %s", pdf_path.name, exc)
95
+ failed.append(pdf_path.name)
96
+ except Exception as exc: # noqa: BLE001
97
+ logger.error(" ✗ %s failed: %s", pdf_path.name, exc)
98
+ failed.append(pdf_path.name)
99
+
100
+ if failed:
101
+ logger.warning("Skipped %d document(s): %s", len(failed), failed)
102
+
103
+ if schedule_record is None and certificate_record is None:
104
+ raise RuntimeError(
105
+ "No Schedule or Certificate extracted. "
106
+ "Check GROQ_API_KEY and that the PDFs are readable."
107
+ )
108
+
109
+ if schedule_record is None:
110
+ logger.warning("No Schedule found — using empty record as fallback")
111
+ if certificate_record is None:
112
+ logger.warning("No Certificate found — using empty record as fallback")
113
+
114
+ schedule_record = schedule_record or UKMotorGoldenRecord()
115
+ certificate_record = certificate_record or UKMotorGoldenRecord()
116
+
117
+ logger.info("Merging Schedule + Certificate via PolicyArbiter…")
118
+ arbiter = PolicyArbiter()
119
+ golden, conflicts = arbiter.merge_records(
120
+ schedule_record, schedule_filename,
121
+ certificate_record, certificate_filename,
122
+ )
123
+
124
+ if conflicts:
125
+ logger.info(
126
+ "Arbiter detected %d conflict(s): %s",
127
+ len(conflicts),
128
+ [c.field for c in conflicts],
129
+ )
130
+
131
+ return golden, conflicts, corpora
src/privacy.py ADDED
@@ -0,0 +1,186 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ privacy.py — PII detection and masking via Microsoft Presidio.
3
+
4
+ Entities masked before any text is sent to the LLM:
5
+ PERSON, PHONE_NUMBER, EMAIL_ADDRESS, UK_NHS, UK_NIN,
6
+ CREDIT_CARD, IBAN_CODE, DATE_TIME (opt-in), LOCATION
7
+
8
+ Usage
9
+ -----
10
+ masker = PIIMasker()
11
+ clean_text, mapping = masker.mask(raw_markdown)
12
+ # ... call LLM with clean_text ...
13
+ # If you ever need to restore originals:
14
+ restored = masker.restore(llm_output, mapping)
15
+ """
16
+ from __future__ import annotations
17
+
18
+ import re
19
+ from typing import Optional
20
+
21
+ from presidio_analyzer import AnalyzerEngine, RecognizerResult
22
+ from presidio_analyzer.nlp_engine import NlpEngineProvider
23
+ from presidio_anonymizer import AnonymizerEngine
24
+ from presidio_anonymizer.entities import OperatorConfig
25
+
26
+ from settings import settings
27
+
28
+
29
+ # ---------------------------------------------------------------------------
30
+ # Default entity list (tuned for UK motor insurance documents)
31
+ # ---------------------------------------------------------------------------
32
+
33
+ UK_MOTOR_ENTITIES: list[str] = [
34
+ "PERSON",
35
+ "PHONE_NUMBER",
36
+ "EMAIL_ADDRESS",
37
+ "UK_NHS",
38
+ "UK_NIN", # National Insurance Number
39
+ "CREDIT_CARD",
40
+ "IBAN_CODE",
41
+ "LOCATION", # postcodes / addresses
42
+ "IP_ADDRESS",
43
+ "URL",
44
+ ]
45
+
46
+ # Sentinel prefix used for replacement tokens so we can detect them reliably
47
+ _TOKEN_PREFIX = "MASKED_"
48
+
49
+
50
+ class PIIMasker:
51
+ """
52
+ Stateless masker: call `mask()` to redact PII in a text string.
53
+
54
+ Parameters
55
+ ----------
56
+ entities : list[str]
57
+ Presidio entity types to redact. Defaults to UK_MOTOR_ENTITIES.
58
+ language : str
59
+ ISO 639-1 language code passed to the Presidio analyzer.
60
+ mask_dates : bool
61
+ When True, DATE_TIME entities are also redacted. Default False
62
+ because insurance documents are date-heavy and stripping them
63
+ would break structured extraction.
64
+ score_threshold : float
65
+ Minimum confidence score (0-1) for a detected entity to be masked.
66
+ """
67
+
68
+ def __init__(
69
+ self,
70
+ entities: Optional[list[str]] = None,
71
+ language: str = settings.pii.language,
72
+ mask_dates: bool = settings.pii.mask_dates,
73
+ score_threshold: float = settings.pii.score_threshold,
74
+ ) -> None:
75
+ self._entities = list(entities or settings.pii.entities)
76
+ if mask_dates and "DATE_TIME" not in self._entities:
77
+ self._entities.append("DATE_TIME")
78
+
79
+ self._language = language
80
+ self._score_threshold = score_threshold
81
+
82
+ # Build NLP engine (spaCy en_core_web_lg preferred; falls back to sm)
83
+ nlp_config = {
84
+ "nlp_engine_name": "spacy",
85
+ "models": [{"lang_code": "en", "model_name": "en_core_web_lg"}],
86
+ }
87
+ try:
88
+ provider = NlpEngineProvider(nlp_configuration=nlp_config)
89
+ nlp_engine = provider.create_engine()
90
+ except OSError:
91
+ # Fall back to the small model if lg is not installed
92
+ nlp_config["models"][0]["model_name"] = "en_core_web_sm"
93
+ provider = NlpEngineProvider(nlp_configuration=nlp_config)
94
+ nlp_engine = provider.create_engine()
95
+
96
+ self._analyzer = AnalyzerEngine(nlp_engine=nlp_engine, supported_languages=[language])
97
+ self._anonymizer = AnonymizerEngine()
98
+
99
+ # ------------------------------------------------------------------
100
+ # Public API
101
+ # ------------------------------------------------------------------
102
+
103
+ def mask(self, text: str) -> tuple[str, dict[str, str]]:
104
+ """
105
+ Redact PII in *text* and return (masked_text, token_map).
106
+
107
+ token_map maps placeholder tokens back to original values, allowing
108
+ optional restoration after LLM processing.
109
+
110
+ Example
111
+ -------
112
+ >>> masked, mapping = masker.mask("John Smith drives AB12 CDE")
113
+ >>> masked
114
+ 'MASKED_PERSON_1 drives AB12 CDE'
115
+ >>> mapping
116
+ {'MASKED_PERSON_1': 'John Smith'}
117
+ """
118
+ results: list[RecognizerResult] = self._analyzer.analyze(
119
+ text=text,
120
+ entities=self._entities,
121
+ language=self._language,
122
+ score_threshold=self._score_threshold,
123
+ )
124
+
125
+ if not results:
126
+ return text, {}
127
+
128
+ # Build per-entity-type counters for unique token names
129
+ counters: dict[str, int] = {}
130
+ token_map: dict[str, str] = {}
131
+ operators: dict[str, OperatorConfig] = {}
132
+
133
+ # Sort by position so token numbering is left-to-right and deterministic
134
+ results_sorted = sorted(results, key=lambda r: r.start)
135
+
136
+ # We need custom lambda operators to generate named tokens.
137
+ # Presidio's "replace" operator uses a fixed `new_value`; we work
138
+ # around this by building a value map keyed on (entity_type, original).
139
+ original_to_token: dict[tuple[str, str], str] = {}
140
+
141
+ for r in results_sorted:
142
+ original = text[r.start : r.end]
143
+ key = (r.entity_type, original)
144
+ if key not in original_to_token:
145
+ counters[r.entity_type] = counters.get(r.entity_type, 0) + 1
146
+ token = f"{_TOKEN_PREFIX}{r.entity_type}_{counters[r.entity_type]}"
147
+ original_to_token[key] = token
148
+ token_map[token] = original
149
+
150
+ # Perform replacement manually (Presidio replace operator doesn't
151
+ # support per-occurrence dynamic values in a single pass).
152
+ masked_text = _replace_spans(text, results_sorted, original_to_token)
153
+ return masked_text, token_map
154
+
155
+ def restore(self, text: str, token_map: dict[str, str]) -> str:
156
+ """
157
+ Substitute masked tokens back to original PII values.
158
+
159
+ This is provided for completeness / testing; in production the LLM
160
+ output is kept masked and stored as-is for GDPR compliance.
161
+ """
162
+ for token, original in token_map.items():
163
+ text = text.replace(token, original)
164
+ return text
165
+
166
+
167
+ # ---------------------------------------------------------------------------
168
+ # Internal helpers
169
+ # ---------------------------------------------------------------------------
170
+
171
+
172
+ def _replace_spans(
173
+ text: str,
174
+ results: list[RecognizerResult],
175
+ original_to_token: dict[tuple[str, str], str],
176
+ ) -> str:
177
+ """
178
+ Replace PII spans in *text* with their corresponding tokens.
179
+ Processes spans right-to-left to keep offset arithmetic valid.
180
+ """
181
+ chars = list(text)
182
+ for r in sorted(results, key=lambda r: r.start, reverse=True):
183
+ original = text[r.start : r.end]
184
+ token = original_to_token.get((r.entity_type, original), original)
185
+ chars[r.start : r.end] = list(token)
186
+ return "".join(chars)
src/prompts.py ADDED
@@ -0,0 +1,149 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ prompts.py — Versioned prompt registry for the UK Motor Insurance IDP pipeline.
3
+
4
+ Loads prompt text from prompts.yaml so prompts can be updated, versioned, and
5
+ reviewed without touching Python source code.
6
+
7
+ Usage
8
+ -----
9
+ registry = PromptRegistry() # uses active_version from YAML
10
+ registry = PromptRegistry(version="v2") # pin to a specific version
11
+ registry = PromptRegistry(config_path="custom.yaml")
12
+
13
+ system_prompt = registry.get(DocumentType.SCHEDULE)
14
+ print(registry.active_version) # → "v1"
15
+ print(registry.available_versions) # → ["v1"]
16
+ """
17
+ from __future__ import annotations
18
+
19
+ import logging
20
+ from pathlib import Path
21
+ from typing import Optional
22
+
23
+ import yaml
24
+
25
+ from schema import DocumentType
26
+
27
+ logger = logging.getLogger(__name__)
28
+
29
+ # Default path: <project_root>/config/prompts.yaml
30
+ # Resolved relative to this file's location (src/ → .. → config/)
31
+ _DEFAULT_CONFIG = Path(__file__).parent.parent / "config" / "prompts.yaml"
32
+
33
+ # Maps DocumentType enum values → YAML keys
34
+ _DOC_TYPE_TO_KEY: dict[DocumentType, str] = {
35
+ DocumentType.SCHEDULE: "Schedule",
36
+ DocumentType.CERTIFICATE: "Certificate",
37
+ DocumentType.STATEMENT_OF_FACT: "StatementOfFact",
38
+ DocumentType.POLICY_BOOKLET: "PolicyBooklet",
39
+ DocumentType.UNKNOWN: "_generic",
40
+ }
41
+
42
+ _GENERIC_KEY = "_generic"
43
+
44
+
45
+ class PromptRegistry:
46
+ """
47
+ Loads versioned prompts from a YAML file and resolves them by DocumentType.
48
+
49
+ Parameters
50
+ ----------
51
+ config_path : str | Path | None
52
+ Path to the YAML file. Defaults to ``src/prompts.yaml`` (sibling of
53
+ this module).
54
+ version : str | None
55
+ Prompt version to activate (e.g. ``"v1"``, ``"v2"``).
56
+ Defaults to the ``active_version`` key in the YAML file.
57
+ """
58
+
59
+ def __init__(
60
+ self,
61
+ config_path: Optional[str | Path] = None,
62
+ version: Optional[str] = None,
63
+ ) -> None:
64
+ self._config_path = Path(config_path) if config_path else _DEFAULT_CONFIG
65
+ self._raw = self._load_yaml()
66
+ self._active_version = version or self._raw.get("active_version", "v1")
67
+ self._prompts = self._resolve_version(self._active_version)
68
+
69
+ logger.info(
70
+ "PromptRegistry loaded: version=%s, path=%s",
71
+ self._active_version,
72
+ self._config_path,
73
+ )
74
+
75
+ # ------------------------------------------------------------------
76
+ # Public API
77
+ # ------------------------------------------------------------------
78
+
79
+ @property
80
+ def active_version(self) -> str:
81
+ """The currently active prompt version string."""
82
+ return self._active_version
83
+
84
+ @property
85
+ def available_versions(self) -> list[str]:
86
+ """All version keys defined in the YAML file."""
87
+ return list(self._raw.get("prompts", {}).keys())
88
+
89
+ def get(self, doc_type: DocumentType) -> str:
90
+ """
91
+ Return the system prompt for a given DocumentType.
92
+
93
+ Falls back to the ``_generic`` prompt if the specific key is missing.
94
+ Raises ``KeyError`` if ``_generic`` is also absent (misconfigured YAML).
95
+ """
96
+ key = _DOC_TYPE_TO_KEY.get(doc_type, _GENERIC_KEY)
97
+ prompt = self._prompts.get(key) or self._prompts.get(_GENERIC_KEY)
98
+ if not prompt:
99
+ raise KeyError(
100
+ f"No prompt found for DocumentType '{doc_type.value}' in version "
101
+ f"'{self._active_version}' of {self._config_path}. "
102
+ f"Ensure '{key}' or '{_GENERIC_KEY}' is defined."
103
+ )
104
+ return prompt.strip()
105
+
106
+ def reload(self) -> None:
107
+ """
108
+ Hot-reload prompts from disk without restarting the process.
109
+
110
+ Useful in long-running services when prompts.yaml is updated in place.
111
+ """
112
+ self._raw = self._load_yaml()
113
+ self._prompts = self._resolve_version(self._active_version)
114
+ logger.info("PromptRegistry reloaded from %s", self._config_path)
115
+
116
+ def switch_version(self, version: str) -> None:
117
+ """
118
+ Switch the active prompt version at runtime.
119
+
120
+ Parameters
121
+ ----------
122
+ version : str
123
+ Must be a key present under ``prompts:`` in the YAML file.
124
+ """
125
+ self._prompts = self._resolve_version(version)
126
+ self._active_version = version
127
+ logger.info("PromptRegistry switched to version '%s'", version)
128
+
129
+ # ------------------------------------------------------------------
130
+ # Private helpers
131
+ # ------------------------------------------------------------------
132
+
133
+ def _load_yaml(self) -> dict:
134
+ if not self._config_path.exists():
135
+ raise FileNotFoundError(
136
+ f"Prompt configuration not found: {self._config_path}"
137
+ )
138
+ with self._config_path.open(encoding="utf-8") as fh:
139
+ return yaml.safe_load(fh) or {}
140
+
141
+ def _resolve_version(self, version: str) -> dict[str, str]:
142
+ versions = self._raw.get("prompts", {})
143
+ if version not in versions:
144
+ available = list(versions.keys())
145
+ raise ValueError(
146
+ f"Prompt version '{version}' not found in {self._config_path}. "
147
+ f"Available versions: {available}"
148
+ )
149
+ return versions[version]
src/provenance.py ADDED
@@ -0,0 +1,424 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ provenance.py — Post-extraction provenance mapping for the Visual Audit UI.
3
+
4
+ After the LLM extracts a flat Golden Record, this module walks the record and
5
+ fuzzy-matches each extracted value against a ProvenanceCorpus built from the
6
+ Docling document IR. The LLM is never asked to self-report geometry — that
7
+ would cause hallucinations; this module handles localisation as a pure
8
+ post-processing step.
9
+
10
+ Coordinate convention
11
+ ─────────────────────
12
+ Docling bbox : PDF space — origin bottom-left, y increases upward, unit = pt
13
+ Stored bbox : Browser % — origin top-left, y increases downward, range 0–100
14
+
15
+ Conversion (per axis):
16
+ x0% = bbox.l / page_width * 100
17
+ y0% = (page_height - bbox.t) / page_height * 100 # top of element
18
+ x1% = bbox.r / page_width * 100
19
+ y1% = (page_height - bbox.b) / page_height * 100 # bottom of element
20
+ """
21
+ from __future__ import annotations
22
+
23
+ import logging
24
+ import re
25
+ from dataclasses import dataclass
26
+ from typing import Any, Iterator
27
+
28
+ logger = logging.getLogger(__name__)
29
+
30
+ # ── Matching parameters ──────────────────────────────────────────────────────
31
+ _MATCH_THRESHOLD = 78 # minimum rapidfuzz WRatio (0–100) for normalised-value fallback
32
+ _CITATION_THRESHOLD = 88 # minimum partial_ratio for LLM-supplied verbatim citation quotes
33
+ _MIN_VALUE_LEN = 4 # skip matching for values shorter than this (too ambiguous)
34
+
35
+ # Leaf field names whose values are boolean-like and would match too broadly
36
+ _SKIP_LEAF_NAMES = {
37
+ "is_main_driver", "protected", "has_security_device",
38
+ "tracker_fitted", "driving_other_cars",
39
+ }
40
+
41
+ # Top-level section names to skip entirely.
42
+ # `source_document` and `field_citations` are internal provenance fields —
43
+ # they don't contain verbatim PDF values so matching against them is meaningless.
44
+ _SKIP_SECTION_NAMES = {"source_document", "field_citations"}
45
+
46
+ # Document types whose corpora are unreliable for field-level matching.
47
+ # Policy Booklets contain generic boilerplate — matching against them produces
48
+ # false positives for almost every field ("Full", "UK", date digits, etc.).
49
+ _EXCLUDE_FROM_MATCHING: set[str] = {"PolicyBooklet", "Unknown"}
50
+
51
+ # Padding added to each bbox for display. The Docling bbox is a tight text
52
+ # box (~1% page height per line) which is hard to see. We expand it so the
53
+ # highlight is clearly visible without losing positional accuracy.
54
+ _BBOX_PAD_X = 0.4 # % to expand left/right
55
+ _BBOX_PAD_Y = 0.6 # % to expand top/bottom
56
+ _BBOX_MIN_H = 2.0 # % minimum height after padding
57
+
58
+
59
+ # ---------------------------------------------------------------------------
60
+ # Corpus data structures
61
+ # ---------------------------------------------------------------------------
62
+
63
+
64
+ @dataclass
65
+ class CorpusItem:
66
+ """One text element from a Docling DoclingDocument, with browser % geometry."""
67
+
68
+ text: str
69
+ page: int
70
+ bbox: list[float] # [x0%, y0%, x1%, y1%] — top-left origin, 0–100
71
+ source_filename: str
72
+
73
+
74
+ class ProvenanceCorpus:
75
+ """All extractable text elements from one PDF, with their page geometry."""
76
+
77
+ def __init__(self, source_filename: str = "", doc_type: str = "Unknown") -> None:
78
+ self.source_filename = source_filename
79
+ self.doc_type = doc_type # e.g. "Schedule", "Certificate", "PolicyBooklet"
80
+ self.items: list[CorpusItem] = []
81
+
82
+ # ------------------------------------------------------------------
83
+ # Public API
84
+ # ------------------------------------------------------------------
85
+
86
+ def add_from_docling(self, doc: Any, filename: str) -> None:
87
+ """
88
+ Populate the corpus from a Docling DoclingDocument.
89
+
90
+ Safely handles API variations across docling versions — logs a warning
91
+ rather than propagating exceptions, so the calling pipeline stays alive
92
+ even if provenance extraction fails.
93
+ """
94
+ self.source_filename = filename
95
+ try:
96
+ self._extract_items(doc, filename)
97
+ logger.debug(
98
+ "Corpus '%s': %d items, %d pages",
99
+ filename, len(self.items), self._count_pages(doc),
100
+ )
101
+ except Exception as exc: # noqa: BLE001
102
+ logger.warning(
103
+ "Provenance extraction skipped for '%s': %s", filename, exc
104
+ )
105
+
106
+ # ------------------------------------------------------------------
107
+ # Private helpers
108
+ # ------------------------------------------------------------------
109
+
110
+ def _extract_items(self, doc: Any, filename: str) -> None:
111
+ page_sizes = _build_page_sizes(doc)
112
+ if not page_sizes:
113
+ logger.debug("No page size data for '%s' — provenance skipped", filename)
114
+ return
115
+
116
+ for item in _iter_items(doc):
117
+ text = _item_text(item)
118
+ if not text or len(text) < 2:
119
+ continue
120
+ for prov in getattr(item, "prov", []):
121
+ self._add_prov_item(prov, text, filename, page_sizes)
122
+
123
+ def _add_prov_item(
124
+ self,
125
+ prov: Any,
126
+ text: str,
127
+ filename: str,
128
+ page_sizes: dict[int, tuple[float, float]],
129
+ ) -> None:
130
+ page_no = getattr(prov, "page_no", None)
131
+ if page_no is None:
132
+ return
133
+ page_no = int(page_no)
134
+ if page_no not in page_sizes:
135
+ return
136
+
137
+ pw, ph = page_sizes[page_no]
138
+ bbox = getattr(prov, "bbox", None)
139
+ if bbox is None:
140
+ return
141
+
142
+ l = float(getattr(bbox, "l", 0))
143
+ t_v = float(getattr(bbox, "t", ph)) # top in PDF space (high y value)
144
+ r = float(getattr(bbox, "r", pw))
145
+ b = float(getattr(bbox, "b", 0)) # bottom in PDF space (low y value)
146
+
147
+ # Convert: PDF (bottom-left origin, pts) → browser % (top-left origin)
148
+ x0 = _clamp(l / pw * 100)
149
+ y0 = _clamp((ph - t_v) / ph * 100) # top of element in browser coords
150
+ x1 = _clamp(r / pw * 100)
151
+ y1 = _clamp((ph - b) / ph * 100) # bottom of element in browser coords
152
+
153
+ self.items.append(CorpusItem(
154
+ text=text,
155
+ page=page_no,
156
+ bbox=[round(x0, 3), round(y0, 3), round(x1, 3), round(y1, 3)],
157
+ source_filename=filename,
158
+ ))
159
+
160
+ @staticmethod
161
+ def _count_pages(doc: Any) -> int:
162
+ return len(getattr(doc, "pages", {}))
163
+
164
+
165
+ # ---------------------------------------------------------------------------
166
+ # Module-level helpers for corpus building
167
+ # ---------------------------------------------------------------------------
168
+
169
+
170
+ def _build_page_sizes(doc: Any) -> dict[int, tuple[float, float]]:
171
+ sizes: dict[int, tuple[float, float]] = {}
172
+ for page_no, page_item in getattr(doc, "pages", {}).items():
173
+ size = getattr(page_item, "size", None)
174
+ if size:
175
+ w = float(getattr(size, "width", 0))
176
+ h = float(getattr(size, "height", 0))
177
+ if w > 0 and h > 0:
178
+ sizes[int(page_no)] = (w, h)
179
+ return sizes
180
+
181
+
182
+ def _iter_items(doc: Any):
183
+ """Yield all document items, trying iterate_items() first then .texts/.tables."""
184
+ try:
185
+ for item, _level in doc.iterate_items():
186
+ yield item
187
+ except AttributeError:
188
+ for item in getattr(doc, "texts", []):
189
+ yield item
190
+ for item in getattr(doc, "tables", []):
191
+ yield item
192
+
193
+
194
+ def _item_text(item: Any) -> str:
195
+ """Extract a string from a Docling TextItem or TableItem."""
196
+ text = getattr(item, "text", None)
197
+ if text is not None:
198
+ return str(text).strip()
199
+ # TableItem: concatenate all cell text into one searchable blob
200
+ data = getattr(item, "data", None)
201
+ if data is not None:
202
+ cells = [
203
+ str(getattr(cell, "text", "")).strip()
204
+ for row in getattr(data, "grid", [])
205
+ for cell in row
206
+ ]
207
+ return " | ".join(c for c in cells if c)
208
+ return ""
209
+
210
+
211
+ def _clamp(v: float) -> float:
212
+ return max(0.0, min(100.0, v))
213
+
214
+
215
+ # ---------------------------------------------------------------------------
216
+ # Field-level provenance builder (main public function)
217
+ # ---------------------------------------------------------------------------
218
+
219
+
220
+ def build_provenance(
221
+ record: Any, # UKMotorGoldenRecord
222
+ corpora: list[ProvenanceCorpus],
223
+ ) -> list[Any]: # list[FieldProvenance]
224
+ """
225
+ Walk the Golden Record and fuzzy-match each extracted value against all
226
+ trusted corpora (Schedule, Certificate, StatementOfFact).
227
+
228
+ Policy Booklet corpora are excluded — they contain generic boilerplate
229
+ that produces false positives for almost every field value.
230
+
231
+ Returns a ``FieldProvenance`` entry for every field that can be located
232
+ above the match threshold. Fields with no good corpus match are omitted —
233
+ the UI shows them as "No location data".
234
+ """
235
+ from schema import FieldProvenance, Location # local import avoids circular dep
236
+
237
+ try:
238
+ from rapidfuzz import fuzz as rfuzz
239
+ except ImportError:
240
+ logger.warning(
241
+ "rapidfuzz not installed — provenance matching disabled. "
242
+ "Run: pip install rapidfuzz"
243
+ )
244
+ return []
245
+
246
+ # Filter to trusted corpora only (exclude Policy Booklet and Unknown docs)
247
+ trusted_corpora = [
248
+ c for c in corpora if c.doc_type not in _EXCLUDE_FROM_MATCHING
249
+ ]
250
+ if not trusted_corpora:
251
+ logger.warning(
252
+ "No trusted corpora available — all %d corpus/corpora are excluded "
253
+ "(types: %s). Provenance will be empty.",
254
+ len(corpora),
255
+ [c.doc_type for c in corpora],
256
+ )
257
+ return []
258
+
259
+ # LLM-supplied verbatim source quotes: field_path → raw text phrase.
260
+ # These are always preferred over the normalised extracted value because
261
+ # the LLM copies them directly from the document (e.g. "15/04/2026 at 00:00
262
+ # hours" rather than the ISO "2026-04-15T00:00:00" we store in the record).
263
+ citation_map: dict[str, str] = dict(getattr(record, "field_citations", None) or {})
264
+ logger.info(" field_citations from LLM: %d entries", len(citation_map))
265
+
266
+ results: list[FieldProvenance] = []
267
+ citation_hits = 0
268
+ # Track assigned positions to avoid two fields pointing to the same corpus item.
269
+ # Key: (source_filename, page, x0, y0) — unpadded, original corpus position.
270
+ used_positions: set[tuple] = set()
271
+
272
+ for field_path, value_str in _walk_record(record):
273
+ leaf = field_path.split(".")[-1].strip("[]0123456789")
274
+ if leaf in _SKIP_LEAF_NAMES:
275
+ continue
276
+
277
+ # Prefer the verbatim citation quote; fall back to the normalised value.
278
+ # For ISO dates/datetimes also try UK DD/MM/YYYY format as a secondary fallback.
279
+ search_str = citation_map.get(field_path, value_str)
280
+ alt_search: str | None = None
281
+ if field_path not in citation_map:
282
+ alt_search = _iso_to_uk_date(value_str)
283
+
284
+ if len(search_str) < _MIN_VALUE_LEN:
285
+ continue
286
+
287
+ using_citation = field_path in citation_map
288
+ # When matching a citation quote use partial_ratio — the quote is a
289
+ # verbatim substring of the document and WRatio penalises length disparity.
290
+ # For normalised fallback values use WRatio to avoid short false matches.
291
+ score_fn = rfuzz.partial_ratio if using_citation else rfuzz.WRatio
292
+ threshold = _CITATION_THRESHOLD if using_citation else _MATCH_THRESHOLD
293
+
294
+ # Find best match, preferring positions not yet assigned to another field.
295
+ best_score = 0
296
+ best_item: CorpusItem | None = None
297
+ best_unused_score = 0
298
+ best_unused_item: CorpusItem | None = None
299
+
300
+ for corpus in trusted_corpora:
301
+ for item in corpus.items:
302
+ score = score_fn(search_str.lower(), item.text.lower())
303
+ # Also try UK-formatted date if available
304
+ if alt_search and score < threshold:
305
+ alt_score = rfuzz.partial_ratio(alt_search, item.text.lower())
306
+ if alt_score > score:
307
+ score = alt_score
308
+ pos_key = (item.source_filename, item.page, item.bbox[0], item.bbox[1])
309
+ if score > best_score:
310
+ best_score = score
311
+ best_item = item
312
+ if score > best_unused_score and pos_key not in used_positions:
313
+ best_unused_score = score
314
+ best_unused_item = item
315
+
316
+ # Prefer an unused position if it scores above threshold,
317
+ # otherwise fall back to best overall (may share a location).
318
+ if best_unused_item is not None and best_unused_score >= threshold:
319
+ chosen_item = best_unused_item
320
+ chosen_score = best_unused_score
321
+ elif best_item is not None and best_score >= threshold:
322
+ chosen_item = best_item
323
+ chosen_score = best_score
324
+ else:
325
+ continue
326
+
327
+ pos_key = (chosen_item.source_filename, chosen_item.page, chosen_item.bbox[0], chosen_item.bbox[1])
328
+ used_positions.add(pos_key)
329
+
330
+ if using_citation:
331
+ citation_hits += 1
332
+ results.append(FieldProvenance(
333
+ field_path=field_path,
334
+ extracted_value=value_str,
335
+ matched_text=chosen_item.text[:200], # truncate very long table blobs
336
+ match_score=round(chosen_score / 100.0, 3),
337
+ source_filename=chosen_item.source_filename,
338
+ location=Location(
339
+ page=chosen_item.page,
340
+ bbox=_padded_bbox(chosen_item.bbox),
341
+ ),
342
+ ))
343
+
344
+ total = _count_total_fields(record)
345
+ logger.info(
346
+ "Provenance: %d / %d fields located (%d via citation quotes, %d via fuzzy fallback) "
347
+ "— trusted corpora: %s",
348
+ len(results), total,
349
+ citation_hits, len(results) - citation_hits,
350
+ [c.source_filename for c in trusted_corpora],
351
+ )
352
+ return results
353
+
354
+
355
+ # ---------------------------------------------------------------------------
356
+ # Field-walking helpers
357
+ # ---------------------------------------------------------------------------
358
+
359
+
360
+ def _walk_record(record: Any) -> Iterator[tuple[str, str]]:
361
+ """Yield (field_path, string_value) for all non-None leaf values in the record."""
362
+ data = record.model_dump(exclude_none=True)
363
+ yield from _walk_dict(data, "")
364
+
365
+
366
+ def _walk_dict(d: dict, prefix: str) -> Iterator[tuple[str, str]]:
367
+ for key, val in d.items():
368
+ # Skip whole sections that produce unreliable or irrelevant matches
369
+ top_key = prefix.split(".")[0].split("[")[0] if prefix else key
370
+ if key in _SKIP_SECTION_NAMES or top_key in _SKIP_SECTION_NAMES:
371
+ continue
372
+ path = f"{prefix}.{key}" if prefix else key
373
+ if isinstance(val, dict):
374
+ yield from _walk_dict(val, path)
375
+ elif isinstance(val, list):
376
+ yield from _walk_list(val, path)
377
+ elif val is not None:
378
+ yield path, str(val)
379
+
380
+
381
+ def _walk_list(lst: list, prefix: str) -> Iterator[tuple[str, str]]:
382
+ for i, item in enumerate(lst):
383
+ path = f"{prefix}[{i}]"
384
+ if isinstance(item, dict):
385
+ yield from _walk_dict(item, path)
386
+ elif item is not None:
387
+ yield path, str(item)
388
+
389
+
390
+ def _count_total_fields(record: Any) -> int:
391
+ data = record.model_dump(exclude_none=True)
392
+ return sum(1 for _ in _walk_dict(data, ""))
393
+
394
+
395
+ # ISO 8601 date/datetime patterns → UK DD/MM/YYYY
396
+ _ISO_DATE_RE = re.compile(r'^(\d{4})-(\d{2})-(\d{2})')
397
+
398
+
399
+ def _iso_to_uk_date(value: str) -> str | None:
400
+ """Convert ISO date/datetime string to UK DD/MM/YYYY for document matching.
401
+
402
+ Returns the UK-format string (e.g. "15/04/2026") if value looks like an
403
+ ISO date, otherwise returns None.
404
+ """
405
+ m = _ISO_DATE_RE.match(value.strip())
406
+ if m:
407
+ yyyy, mm, dd = m.group(1), m.group(2), m.group(3)
408
+ return f"{dd}/{mm}/{yyyy}"
409
+ return None
410
+
411
+
412
+ def _padded_bbox(bbox: list[float]) -> list[float]:
413
+ """Expand a tight Docling text bbox so highlights are clearly visible in the UI."""
414
+ x0, y0, x1, y1 = bbox
415
+ x0 = _clamp(x0 - _BBOX_PAD_X)
416
+ y0 = _clamp(y0 - _BBOX_PAD_Y)
417
+ x1 = _clamp(x1 + _BBOX_PAD_X)
418
+ y1 = _clamp(y1 + _BBOX_PAD_Y)
419
+ # Enforce minimum height so single-line text is always visible
420
+ if (y1 - y0) < _BBOX_MIN_H:
421
+ mid = (y0 + y1) / 2
422
+ y0 = _clamp(mid - _BBOX_MIN_H / 2)
423
+ y1 = _clamp(mid + _BBOX_MIN_H / 2)
424
+ return [round(x0, 3), round(y0, 3), round(x1, 3), round(y1, 3)]
src/schema.py ADDED
@@ -0,0 +1,205 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ schema.py — Canonical Pydantic V2 data models for UK Motor Insurance extraction.
3
+
4
+ UKMotorGoldenRecord is the top-level output produced by the pipeline.
5
+ All sub-model fields are Optional to support partial per-document extractions;
6
+ the Arbiter produces the final complete record.
7
+
8
+ DocumentType and SourceMetadata are internal provenance types excluded from
9
+ the serialised Golden Record output (source_document uses Field(exclude=True)).
10
+ """
11
+ from __future__ import annotations
12
+
13
+ from datetime import date, datetime
14
+ from enum import Enum
15
+ from typing import Dict, List, Optional, Union
16
+
17
+ from pydantic import BaseModel, Field
18
+
19
+
20
+ # ---------------------------------------------------------------------------
21
+ # Internal provenance (not in the serialised output)
22
+ # ---------------------------------------------------------------------------
23
+
24
+
25
+ class DocumentType(str, Enum):
26
+ """Source document classification used for provenance and priority routing."""
27
+
28
+ SCHEDULE = "Schedule"
29
+ CERTIFICATE = "Certificate"
30
+ STATEMENT_OF_FACT = "StatementOfFact"
31
+ POLICY_BOOKLET = "PolicyBooklet"
32
+ UNKNOWN = "Unknown"
33
+
34
+
35
+ class SourceMetadata(BaseModel):
36
+ """Attached to every extraction so the arbiter can trace data lineage."""
37
+
38
+ document_type: DocumentType = DocumentType.UNKNOWN
39
+ filename: str = ""
40
+ page_count: Optional[int] = None
41
+
42
+
43
+ # ---------------------------------------------------------------------------
44
+ # Golden Record sub-models
45
+ # ---------------------------------------------------------------------------
46
+
47
+
48
+ class PeriodOfCover(BaseModel):
49
+ start_date: Optional[datetime] = None
50
+ expiry_date: Optional[datetime] = None
51
+ issue_date: Optional[date] = None
52
+
53
+
54
+ class PolicyHeader(BaseModel):
55
+ policy_number: Optional[str] = None
56
+ insurer: Optional[str] = None
57
+ product_name: Optional[str] = None
58
+ period_of_cover: Optional[PeriodOfCover] = None
59
+
60
+
61
+ class SecurityDetails(BaseModel):
62
+ has_security_device: Optional[bool] = None
63
+ tracker_fitted: Optional[bool] = None
64
+ modifications: Optional[str] = None
65
+
66
+
67
+ class VehicleDetails(BaseModel):
68
+ vrm: Optional[str] = None
69
+ make: Optional[str] = None
70
+ model: Optional[str] = None
71
+ fuel_type: Optional[str] = None
72
+ transmission: Optional[str] = None
73
+ estimated_value: Optional[str] = None
74
+ annual_mileage: Optional[int] = None
75
+ overnight_postcode: Optional[str] = None
76
+ kept_location: Optional[str] = None
77
+ security: Optional[SecurityDetails] = None
78
+
79
+
80
+ class Driver(BaseModel):
81
+ name: str
82
+ dob: Optional[date] = None
83
+ relationship: Optional[str] = None
84
+ occupation: Optional[str] = None
85
+ license_type: Optional[str] = None
86
+ is_main_driver: bool = False
87
+ specific_excess: Optional[float] = None
88
+
89
+
90
+ class NoClaimsDiscount(BaseModel):
91
+ years: Optional[int] = None
92
+ protected: Optional[bool] = None
93
+
94
+
95
+ class ExcessBreakdown(BaseModel):
96
+ standard_compulsory: Optional[float] = None
97
+ voluntary: Optional[float] = None
98
+ total_accidental_damage: Optional[float] = None
99
+ fire: Optional[float] = None
100
+ theft: Optional[float] = None
101
+ windscreen_repair: Optional[float] = None
102
+ windscreen_replacement: Optional[float] = None
103
+ own_repairer_additional_excess: Optional[float] = None
104
+
105
+
106
+ class CoverAndExcesses(BaseModel):
107
+ cover_type: Optional[str] = None
108
+ class_of_use: Optional[str] = None
109
+ driving_other_cars: Optional[bool] = None
110
+ no_claims_discount: Optional[NoClaimsDiscount] = None
111
+ excess_breakdown: Optional[ExcessBreakdown] = None
112
+
113
+
114
+ class OptionalExtras(BaseModel):
115
+ motor_legal_protection: Optional[Union[float, str]] = None
116
+ breakdown_roadside_assistance: Optional[Union[float, str]] = None
117
+ enhanced_personal_accident: Optional[Union[float, str]] = None
118
+ hire_car: Optional[Union[float, str]] = None
119
+ key_cover: Optional[Union[float, str]] = None
120
+
121
+
122
+ class FinancialSummary(BaseModel):
123
+ total_annual_premium: Optional[float] = None
124
+ optional_extras: Optional[OptionalExtras] = None
125
+
126
+
127
+ class AdditionalRiskData(BaseModel):
128
+ home_ownership: Optional[str] = None
129
+ children_under_16: Optional[bool] = None
130
+ number_of_cars_in_household: Optional[int] = None
131
+ non_motoring_convictions: Optional[bool] = None
132
+ endorsements: Optional[str] = None
133
+
134
+
135
+ # ---------------------------------------------------------------------------
136
+ # Top-level Golden Record
137
+ # ---------------------------------------------------------------------------
138
+
139
+
140
+ class UKMotorGoldenRecord(BaseModel):
141
+ """
142
+ Final authoritative policy record produced by the Arbiter.
143
+
144
+ All section fields are Optional so that partial per-document extractions
145
+ remain valid Pydantic objects. source_document is internal provenance
146
+ and is excluded from model_dump_json().
147
+ """
148
+
149
+ policy_header: Optional[PolicyHeader] = None
150
+ vehicle_details: Optional[VehicleDetails] = None
151
+ driver_details: List[Driver] = Field(default_factory=list)
152
+ cover_and_excesses: Optional[CoverAndExcesses] = None
153
+ financial_summary: Optional[FinancialSummary] = None
154
+ additional_risk_data: Optional[AdditionalRiskData] = None
155
+
156
+ # Verbatim source quotes for provenance matching.
157
+ # The LLM populates this mapping field_path → exact phrase copied from the document.
158
+ # Used by provenance.py to locate each field in the PDF even when the extracted
159
+ # value has been normalised (ISO dates, £ amounts, etc.).
160
+ # Excluded from the final serialised output so it doesn't appear in downstream JSON.
161
+ field_citations: Optional[Dict[str, str]] = Field(default=None, exclude=True)
162
+
163
+ # Internal provenance — excluded from serialised output
164
+ source_document: Optional[SourceMetadata] = Field(default=None, exclude=True)
165
+
166
+
167
+ # ---------------------------------------------------------------------------
168
+ # Provenance and Human-in-the-Loop review models
169
+ # ---------------------------------------------------------------------------
170
+
171
+
172
+ class Location(BaseModel):
173
+ """Geometric location of a field's source text, in browser % coords (top-left origin)."""
174
+
175
+ page: int
176
+ bbox: List[float] # [x0%, y0%, x1%, y1%]
177
+
178
+
179
+ class FieldProvenance(BaseModel):
180
+ """Maps one Golden Record field to its source text element in the PDF."""
181
+
182
+ field_path: str # e.g. "vehicle_details.vrm"
183
+ extracted_value: str # the value produced by the LLM
184
+ matched_text: str # the corpus snippet that best matches it
185
+ match_score: float # 0.0–1.0 (1.0 = perfect)
186
+ source_filename: str # which PDF this came from
187
+ location: Location # page + bbox in browser % coords
188
+
189
+
190
+ class ConflictEntry(BaseModel):
191
+ """Records a field where Schedule and Certificate held different values."""
192
+
193
+ field: str # dotted field path, e.g. "policy_header.policy_number"
194
+ schedule_value: Optional[str] = None
195
+ certificate_value: Optional[str] = None
196
+ winner: str # "schedule" | "certificate" | "fallback"
197
+
198
+
199
+ class GoldenRecordWithProvenance(BaseModel):
200
+ """Full pipeline output for the Visual Audit Review UI."""
201
+
202
+ record: UKMotorGoldenRecord
203
+ provenance: List[FieldProvenance] = Field(default_factory=list)
204
+ conflicts: List[ConflictEntry] = Field(default_factory=list)
205
+ session_id: Optional[str] = None
src/settings.py ADDED
@@ -0,0 +1,142 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ settings.py — Pipeline configuration loader.
3
+
4
+ Merges values from config/settings.yaml with environment variable overrides.
5
+ Also calls load_dotenv() so importing this module anywhere in the pipeline
6
+ is sufficient to activate .env — no separate setup needed.
7
+
8
+ Precedence (highest → lowest)
9
+ ──────────────────────────────
10
+ 1. Environment variables (GROQ_MODEL, etc.)
11
+ 2. config/settings.yaml
12
+ 3. Pydantic model field defaults (safety net)
13
+
14
+ Usage
15
+ -----
16
+ from settings import settings
17
+
18
+ model = settings.llm.model # respects GROQ_MODEL env var
19
+ retries = settings.llm.max_retries
20
+ thresh = settings.pii.score_threshold
21
+ """
22
+ from __future__ import annotations
23
+
24
+ import logging
25
+ import os
26
+ from pathlib import Path
27
+ from typing import Optional
28
+
29
+ import yaml
30
+ from dotenv import load_dotenv
31
+ from pydantic import BaseModel, Field
32
+
33
+ # Load .env file before anything else reads os.environ
34
+ load_dotenv()
35
+
36
+ logger = logging.getLogger(__name__)
37
+
38
+ _DEFAULT_CONFIG_PATH = Path(__file__).parent.parent / "config" / "settings.yaml"
39
+
40
+ # ---------------------------------------------------------------------------
41
+ # Sub-models
42
+ # ---------------------------------------------------------------------------
43
+
44
+ _DEFAULT_ENTITIES = [
45
+ "PERSON", "PHONE_NUMBER", "EMAIL_ADDRESS",
46
+ "UK_NHS", "UK_NIN", "CREDIT_CARD", "IBAN_CODE",
47
+ "LOCATION", "IP_ADDRESS", "URL",
48
+ ]
49
+
50
+
51
+ class LLMSettings(BaseModel):
52
+ model: str = "llama-3.3-70b-versatile"
53
+ classifier_model: str = "llama-3.1-8b-instant"
54
+ max_retries: int = 2
55
+
56
+
57
+ class PIISettings(BaseModel):
58
+ score_threshold: float = 0.5
59
+ mask_dates: bool = False
60
+ language: str = "en"
61
+ entities: list[str] = Field(default_factory=lambda: list(_DEFAULT_ENTITIES))
62
+
63
+
64
+ class PipelineSettings(BaseModel):
65
+ output_path: str = "./output/golden_record.json"
66
+ log_level: str = "INFO"
67
+ session_ttl_days: int = 30 # sessions older than this are removed on API startup (0 = disabled)
68
+
69
+
70
+ class DebugSettings(BaseModel):
71
+ enabled: bool = True
72
+ output_dir: str = "./output/debug"
73
+ save_markdown: bool = True
74
+ save_masked_markdown: bool = True
75
+ save_extraction_json: bool = True
76
+ save_metrics: bool = True
77
+
78
+
79
+ class DoclingSettings(BaseModel):
80
+ do_ocr: bool = False
81
+ do_table_structure: bool = False
82
+ # Per-document-type page caps (None = no limit)
83
+ max_pages: dict[str, int | None] = Field(
84
+ default_factory=lambda: {
85
+ "Schedule": None,
86
+ "Certificate": None,
87
+ "StatementOfFact": None,
88
+ "PolicyBooklet": 20,
89
+ "Unknown": 30,
90
+ }
91
+ )
92
+
93
+
94
+ class Settings(BaseModel):
95
+ llm: LLMSettings = Field(default_factory=LLMSettings)
96
+ pii: PIISettings = Field(default_factory=PIISettings)
97
+ pipeline: PipelineSettings = Field(default_factory=PipelineSettings)
98
+ debug: DebugSettings = Field(default_factory=DebugSettings)
99
+ docling: DoclingSettings = Field(default_factory=DoclingSettings)
100
+
101
+ @classmethod
102
+ def load(cls, config_path: Optional[str | Path] = None) -> "Settings":
103
+ """
104
+ Load settings from YAML, then apply environment variable overrides.
105
+
106
+ Parameters
107
+ ----------
108
+ config_path : str | Path | None
109
+ Path to a settings YAML file. Defaults to config/settings.yaml.
110
+ """
111
+ path = Path(config_path) if config_path else _DEFAULT_CONFIG_PATH
112
+ data: dict = {}
113
+
114
+ if path.exists():
115
+ with path.open(encoding="utf-8") as fh:
116
+ data = yaml.safe_load(fh) or {}
117
+ logger.debug("Settings loaded from %s", path)
118
+ else:
119
+ logger.warning(
120
+ "Settings file not found at %s — using defaults.", path
121
+ )
122
+
123
+ instance = cls.model_validate(data)
124
+
125
+ # ── Environment variable overrides ─────────────────────────────────
126
+ # GROQ_MODEL wins over both settings.yaml and the Pydantic default.
127
+ if groq_model := os.environ.get("GROQ_MODEL"):
128
+ instance.llm.model = groq_model
129
+ logger.debug("LLM model overridden by GROQ_MODEL env var: %s", groq_model)
130
+
131
+ if classifier_model := os.environ.get("GROQ_CLASSIFIER_MODEL"):
132
+ instance.llm.classifier_model = classifier_model
133
+ logger.debug("Classifier model overridden by GROQ_CLASSIFIER_MODEL env var: %s", classifier_model)
134
+
135
+ return instance
136
+
137
+
138
+ # ---------------------------------------------------------------------------
139
+ # Module-level singleton — import this everywhere
140
+ # ---------------------------------------------------------------------------
141
+
142
+ settings = Settings.load()
tests/__init__.py ADDED
File without changes
tests/test_arbiter.py ADDED
@@ -0,0 +1,303 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ tests/test_arbiter.py — Unit tests for PolicyArbiter.
3
+
4
+ These tests exercise the merge logic in isolation using pure fixture data,
5
+ with no LLM calls or file I/O. Run with:
6
+
7
+ pytest tests/test_arbiter.py -v
8
+
9
+ (From project root with the virtual-env activated.)
10
+ """
11
+ from __future__ import annotations
12
+
13
+ import sys
14
+ from pathlib import Path
15
+
16
+ # Allow importing from src/ without installing the package
17
+ sys.path.insert(0, str(Path(__file__).parent.parent / "src"))
18
+
19
+ import pytest
20
+
21
+ from arbiter import PolicyArbiter
22
+ from schema import (
23
+ AdditionalRiskData,
24
+ ConflictEntry,
25
+ CoverAndExcesses,
26
+ Driver,
27
+ ExcessBreakdown,
28
+ FinancialSummary,
29
+ NoClaimsDiscount,
30
+ OptionalExtras,
31
+ PeriodOfCover,
32
+ PolicyHeader,
33
+ UKMotorGoldenRecord,
34
+ VehicleDetails,
35
+ )
36
+
37
+
38
+ # ---------------------------------------------------------------------------
39
+ # Fixtures
40
+ # ---------------------------------------------------------------------------
41
+
42
+ def _make_schedule(
43
+ policy_number: str = "POL-001",
44
+ insurer: str = "TestInsurer Ltd",
45
+ cover_type: str = "Comprehensive",
46
+ ncb_years: int = 3,
47
+ class_of_use: str | None = None,
48
+ drivers: list[dict] | None = None,
49
+ excess_compulsory: float = 250.0,
50
+ excess_voluntary: float = 150.0,
51
+ premium: float = 600.0,
52
+ vrm: str = "AB12 XYZ",
53
+ ) -> UKMotorGoldenRecord:
54
+ drv_list = [
55
+ Driver(**d) for d in (drivers or [{"name": "ALICE SMITH", "is_main_driver": True}])
56
+ ]
57
+ return UKMotorGoldenRecord(
58
+ policy_header=PolicyHeader(policy_number=policy_number, insurer=insurer),
59
+ vehicle_details=VehicleDetails(vrm=vrm, make="Toyota", model="Corolla"),
60
+ driver_details=drv_list,
61
+ cover_and_excesses=CoverAndExcesses(
62
+ cover_type=cover_type,
63
+ class_of_use=class_of_use,
64
+ no_claims_discount=NoClaimsDiscount(years=ncb_years, protected=False),
65
+ excess_breakdown=ExcessBreakdown(
66
+ standard_compulsory=excess_compulsory,
67
+ voluntary=excess_voluntary,
68
+ total_accidental_damage=excess_compulsory + excess_voluntary,
69
+ ),
70
+ ),
71
+ financial_summary=FinancialSummary(
72
+ total_annual_premium=premium,
73
+ optional_extras=OptionalExtras(),
74
+ ),
75
+ additional_risk_data=AdditionalRiskData(home_ownership="Owned"),
76
+ )
77
+
78
+
79
+ def _make_certificate(
80
+ policy_number: str = "POL-001",
81
+ class_of_use: str = "Social, Domestic and Pleasure",
82
+ driving_other_cars: bool = False,
83
+ drivers: list[dict] | None = None,
84
+ insurer: str | None = None,
85
+ ) -> UKMotorGoldenRecord:
86
+ drv_list = [
87
+ Driver(**d) for d in (drivers or [{"name": "ALICE SMITH", "is_main_driver": True}])
88
+ ]
89
+ return UKMotorGoldenRecord(
90
+ policy_header=PolicyHeader(
91
+ policy_number=policy_number,
92
+ insurer=insurer,
93
+ ),
94
+ driver_details=drv_list,
95
+ cover_and_excesses=CoverAndExcesses(
96
+ class_of_use=class_of_use,
97
+ driving_other_cars=driving_other_cars,
98
+ ),
99
+ )
100
+
101
+
102
+ # ---------------------------------------------------------------------------
103
+ # Basic merge tests
104
+ # ---------------------------------------------------------------------------
105
+
106
+ class TestBasicMerge:
107
+ def test_returns_tuple_with_conflicts_list(self):
108
+ arbiter = PolicyArbiter()
109
+ sched = _make_schedule()
110
+ cert = _make_certificate()
111
+ result = arbiter.merge_records(sched, "sched.pdf", cert, "cert.pdf")
112
+ assert isinstance(result, tuple)
113
+ golden, conflicts = result
114
+ assert isinstance(conflicts, list)
115
+
116
+ def test_vehicle_details_from_schedule(self):
117
+ arbiter = PolicyArbiter()
118
+ sched = _make_schedule(vrm="AB12 XYZ")
119
+ cert = _make_certificate()
120
+ golden, _ = arbiter.merge_records(sched, "s.pdf", cert, "c.pdf")
121
+ assert golden.vehicle_details is not None
122
+ assert golden.vehicle_details.vrm == "AB12 XYZ"
123
+
124
+ def test_class_of_use_from_certificate(self):
125
+ arbiter = PolicyArbiter()
126
+ sched = _make_schedule(class_of_use="Social") # schedule has one
127
+ cert = _make_certificate(class_of_use="Social, Domestic and Pleasure") # cert is master
128
+ golden, conflicts = arbiter.merge_records(sched, "s.pdf", cert, "c.pdf")
129
+ assert golden.cover_and_excesses.class_of_use == "Social, Domestic and Pleasure"
130
+
131
+ def test_cover_type_from_schedule(self):
132
+ arbiter = PolicyArbiter()
133
+ sched = _make_schedule(cover_type="Comprehensive")
134
+ cert = _make_certificate()
135
+ golden, _ = arbiter.merge_records(sched, "s.pdf", cert, "c.pdf")
136
+ assert golden.cover_and_excesses.cover_type == "Comprehensive"
137
+
138
+ def test_financial_summary_from_schedule(self):
139
+ arbiter = PolicyArbiter()
140
+ sched = _make_schedule(premium=750.0)
141
+ cert = _make_certificate()
142
+ golden, _ = arbiter.merge_records(sched, "s.pdf", cert, "c.pdf")
143
+ assert golden.financial_summary.total_annual_premium == 750.0
144
+
145
+ def test_additional_risk_data_from_schedule(self):
146
+ arbiter = PolicyArbiter()
147
+ sched = _make_schedule()
148
+ cert = _make_certificate()
149
+ golden, _ = arbiter.merge_records(sched, "s.pdf", cert, "c.pdf")
150
+ assert golden.additional_risk_data.home_ownership == "Owned"
151
+
152
+
153
+ # ---------------------------------------------------------------------------
154
+ # One-sided merge (missing Schedule or Certificate)
155
+ # ---------------------------------------------------------------------------
156
+
157
+ class TestOneSidedMerge:
158
+ def test_empty_schedule_uses_certificate_drivers(self):
159
+ arbiter = PolicyArbiter()
160
+ sched = UKMotorGoldenRecord() # empty
161
+ cert = _make_certificate(
162
+ drivers=[{"name": "BOB JONES", "is_main_driver": True}]
163
+ )
164
+ golden, _ = arbiter.merge_records(sched, "s.pdf", cert, "c.pdf")
165
+ assert len(golden.driver_details) == 1
166
+ assert golden.driver_details[0].name == "BOB JONES"
167
+
168
+ def test_empty_certificate_still_merges(self):
169
+ arbiter = PolicyArbiter()
170
+ sched = _make_schedule()
171
+ cert = UKMotorGoldenRecord() # empty
172
+ golden, _ = arbiter.merge_records(sched, "s.pdf", cert, "c.pdf")
173
+ assert golden.vehicle_details is not None
174
+ assert golden.cover_and_excesses is not None
175
+
176
+ def test_policy_number_fallback_to_certificate(self):
177
+ arbiter = PolicyArbiter()
178
+ sched = UKMotorGoldenRecord(policy_header=PolicyHeader(policy_number=None))
179
+ cert = _make_certificate(policy_number="CERT-999")
180
+ golden, _ = arbiter.merge_records(sched, "s.pdf", cert, "c.pdf")
181
+ assert golden.policy_header.policy_number == "CERT-999"
182
+
183
+
184
+ # ---------------------------------------------------------------------------
185
+ # Conflict detection
186
+ # ---------------------------------------------------------------------------
187
+
188
+ class TestConflictDetection:
189
+ def test_no_conflicts_when_values_match(self):
190
+ arbiter = PolicyArbiter()
191
+ sched = _make_schedule(policy_number="POL-001", insurer="Insurer A")
192
+ cert = _make_certificate(policy_number="POL-001")
193
+ _, conflicts = arbiter.merge_records(sched, "s.pdf", cert, "c.pdf")
194
+ policy_number_conflicts = [c for c in conflicts if c.field == "policy_header.policy_number"]
195
+ assert policy_number_conflicts == []
196
+
197
+ def test_conflict_logged_for_differing_policy_numbers(self):
198
+ arbiter = PolicyArbiter()
199
+ sched = _make_schedule(policy_number="POL-001")
200
+ cert = _make_certificate(policy_number="POL-002")
201
+ golden, conflicts = arbiter.merge_records(sched, "s.pdf", cert, "c.pdf")
202
+ conflict_fields = [c.field for c in conflicts]
203
+ assert "policy_header.policy_number" in conflict_fields
204
+ # Schedule wins
205
+ assert golden.policy_header.policy_number == "POL-001"
206
+
207
+ def test_conflict_entry_has_both_values(self):
208
+ arbiter = PolicyArbiter()
209
+ sched = _make_schedule(policy_number="SCHED-100")
210
+ cert = _make_certificate(policy_number="CERT-200")
211
+ _, conflicts = arbiter.merge_records(sched, "s.pdf", cert, "c.pdf")
212
+ c = next(x for x in conflicts if x.field == "policy_header.policy_number")
213
+ assert c.schedule_value == "SCHED-100"
214
+ assert c.certificate_value == "CERT-200"
215
+ assert c.winner == "schedule"
216
+
217
+ def test_class_of_use_conflict_certificate_wins(self):
218
+ arbiter = PolicyArbiter()
219
+ sched = _make_schedule(class_of_use="Social Only")
220
+ cert = _make_certificate(class_of_use="Social, Domestic and Pleasure")
221
+ golden, conflicts = arbiter.merge_records(sched, "s.pdf", cert, "c.pdf")
222
+ c = next((x for x in conflicts if x.field == "cover_and_excesses.class_of_use"), None)
223
+ assert c is not None
224
+ assert c.winner == "certificate"
225
+ assert golden.cover_and_excesses.class_of_use == "Social, Domestic and Pleasure"
226
+
227
+
228
+ # ---------------------------------------------------------------------------
229
+ # Driver merging
230
+ # ---------------------------------------------------------------------------
231
+
232
+ class TestDriverMerge:
233
+ def test_exact_name_match_enriches_driver(self):
234
+ arbiter = PolicyArbiter()
235
+ sched = _make_schedule(
236
+ drivers=[{"name": "ALICE SMITH", "is_main_driver": True, "dob": None, "relationship": None}]
237
+ )
238
+ cert = _make_certificate(
239
+ drivers=[{"name": "ALICE SMITH", "is_main_driver": True, "relationship": "Proposer"}]
240
+ )
241
+ golden, _ = arbiter.merge_records(sched, "s.pdf", cert, "c.pdf")
242
+ assert golden.driver_details[0].relationship == "Proposer"
243
+
244
+ def test_fuzzy_name_match_merges(self):
245
+ """Names with minor differences (e.g. missing middle initial) should still match."""
246
+ arbiter = PolicyArbiter()
247
+ sched = _make_schedule(
248
+ drivers=[{"name": "ALICE J SMITH", "is_main_driver": True}]
249
+ )
250
+ cert = _make_certificate(
251
+ drivers=[{"name": "ALICE SMITH", "is_main_driver": True, "relationship": "Proposer"}]
252
+ )
253
+ golden, _ = arbiter.merge_records(sched, "s.pdf", cert, "c.pdf")
254
+ assert golden.driver_details[0].relationship == "Proposer"
255
+
256
+ def test_unmatched_driver_has_no_cert_enrichment(self):
257
+ """A driver with a completely different name gets no cert data."""
258
+ arbiter = PolicyArbiter()
259
+ sched = _make_schedule(
260
+ drivers=[{"name": "ALICE SMITH", "is_main_driver": True}]
261
+ )
262
+ cert = _make_certificate(
263
+ drivers=[{"name": "BOB JONES", "is_main_driver": True, "relationship": "Spouse"}]
264
+ )
265
+ golden, _ = arbiter.merge_records(sched, "s.pdf", cert, "c.pdf")
266
+ alice = golden.driver_details[0]
267
+ assert alice.name == "ALICE SMITH"
268
+ assert alice.relationship is None # no cert match, so no enrichment
269
+
270
+
271
+ # ---------------------------------------------------------------------------
272
+ # field_citations merging
273
+ # ---------------------------------------------------------------------------
274
+
275
+ class TestFieldCitationsMerge:
276
+ def test_schedule_citations_win_on_conflict(self):
277
+ arbiter = PolicyArbiter()
278
+ sched = _make_schedule()
279
+ cert = _make_certificate()
280
+ sched.field_citations = {
281
+ "vehicle_details.vrm": "AB12 XYZ",
282
+ "policy_header.policy_number": "POL-001 from schedule",
283
+ }
284
+ cert.field_citations = {
285
+ "policy_header.policy_number": "POL-001 from cert",
286
+ "cover_and_excesses.class_of_use": "Social, Domestic and Pleasure",
287
+ }
288
+ golden, _ = arbiter.merge_records(sched, "s.pdf", cert, "c.pdf")
289
+ fc = golden.field_citations or {}
290
+ # Schedule wins the shared key
291
+ assert fc.get("policy_header.policy_number") == "POL-001 from schedule"
292
+ # Cert-only key survives
293
+ assert fc.get("cover_and_excesses.class_of_use") == "Social, Domestic and Pleasure"
294
+ # Schedule-only key survives
295
+ assert fc.get("vehicle_details.vrm") == "AB12 XYZ"
296
+
297
+ def test_empty_citations_produce_none(self):
298
+ arbiter = PolicyArbiter()
299
+ sched = _make_schedule()
300
+ cert = _make_certificate()
301
+ golden, _ = arbiter.merge_records(sched, "s.pdf", cert, "c.pdf")
302
+ # Neither side has citations → merged record has None
303
+ assert golden.field_citations is None
ui/index.html ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!doctype html>
2
+ <html lang="en">
3
+ <head>
4
+ <meta charset="UTF-8" />
5
+ <link rel="icon" type="image/svg+xml" href="/vite.svg" />
6
+ <meta name="viewport" content="width=device-width, initial-scale=1.0" />
7
+ <title>PolicyTrace — Motor Insurance IDP · AI Tool Stack</title>
8
+ </head>
9
+ <body>
10
+ <div id="root"></div>
11
+ <script type="module" src="/src/main.tsx"></script>
12
+ </body>
13
+ </html>
ui/package-lock.json ADDED
The diff for this file is too large to render. See raw diff
 
ui/package.json ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "name": "motor-policy-review-ui",
3
+ "private": true,
4
+ "version": "0.1.0",
5
+ "type": "module",
6
+ "scripts": {
7
+ "dev": "vite",
8
+ "build": "tsc && vite build",
9
+ "preview": "vite preview"
10
+ },
11
+ "dependencies": {
12
+ "axios": "^1.7.2",
13
+ "react": "^18.3.0",
14
+ "react-dom": "^18.3.0",
15
+ "react-pdf": "^9.1.0",
16
+ "react-router-dom": "^7.15.1",
17
+ "zustand": "^4.5.2"
18
+ },
19
+ "devDependencies": {
20
+ "@types/react": "^18.3.3",
21
+ "@types/react-dom": "^18.3.0",
22
+ "@vitejs/plugin-react": "^4.3.1",
23
+ "autoprefixer": "^10.4.19",
24
+ "postcss": "^8.4.39",
25
+ "tailwindcss": "^3.4.6",
26
+ "typescript": "^5.5.3",
27
+ "vite": "^5.3.4"
28
+ }
29
+ }
ui/postcss.config.js ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ export default {
2
+ plugins: {
3
+ tailwindcss: {},
4
+ autoprefixer: {},
5
+ },
6
+ }
ui/src/App.tsx ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import { Route, Routes } from 'react-router-dom'
2
+ import { UploadPage } from './UploadPage'
3
+ import { SessionPage } from './SessionPage'
4
+
5
+ export default function App() {
6
+ return (
7
+ <Routes>
8
+ <Route path="/" element={<UploadPage />} />
9
+ <Route path="/session/:sessionId" element={<SessionPage />} />
10
+ {/* Catch-all: redirect unknown paths to upload */}
11
+ <Route path="*" element={<UploadPage />} />
12
+ </Routes>
13
+ )
14
+ }
15
+
16
+
ui/src/FieldRow.tsx ADDED
@@ -0,0 +1,201 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import { type CSSProperties, useState } from 'react'
2
+ import type { FieldEntry, FieldReview } from './types'
3
+ import { useStore } from './store'
4
+
5
+ interface Props {
6
+ entry: FieldEntry
7
+ sessionId: string
8
+ isActive: boolean
9
+ review?: FieldReview
10
+ onClick: () => void
11
+ }
12
+
13
+ export function FieldRow({ entry, sessionId, isActive, review, onClick }: Props) {
14
+ const [editing, setEditing] = useState(false)
15
+ const [editValue, setEditValue] = useState(entry.value ?? '')
16
+ const verifyField = useStore((s) => s.verifyField)
17
+ const overrideField = useStore((s) => s.overrideField)
18
+ const rejectField = useStore((s) => s.rejectField)
19
+
20
+ const displayValue = review?.action === 'override' && review.overridden_value != null
21
+ ? review.overridden_value
22
+ : entry.value
23
+
24
+ const isVerified = review?.action === 'verify'
25
+ const isRejected = review?.action === 'reject'
26
+ const isOverridden = review?.action === 'override'
27
+
28
+ const borderStyle: CSSProperties = isVerified
29
+ ? { borderColor: '#16a34a', backgroundColor: '#f0fdf4' }
30
+ : isRejected
31
+ ? { borderColor: '#fca5a5', backgroundColor: '#fef2f2' }
32
+ : isOverridden
33
+ ? { borderColor: '#2563EB', backgroundColor: '#eff6ff' }
34
+ : isActive
35
+ ? { borderColor: '#008080', backgroundColor: '#f0fdfc' }
36
+ : { borderColor: 'transparent', backgroundColor: '#ffffff' }
37
+
38
+ const handleSaveOverride = async () => {
39
+ await overrideField(sessionId, entry.fieldPath, editValue)
40
+ setEditing(false)
41
+ }
42
+
43
+ return (
44
+ <div
45
+ className="rounded-lg border px-3 py-2 cursor-pointer transition-all hover:shadow-sm"
46
+ style={borderStyle}
47
+ onClick={onClick}
48
+ >
49
+ <div className="flex items-start gap-2">
50
+ {/* Label + value */}
51
+ <div className="flex-1 min-w-0">
52
+ <div className="flex items-center gap-2 flex-wrap">
53
+ <span className="text-xs font-semibold text-gray-500 shrink-0">
54
+ {entry.label}
55
+ </span>
56
+ {isVerified && (
57
+ <span className="inline-flex items-center gap-0.5 text-xs text-green-700 font-medium">
58
+ <CheckIcon /> Verified
59
+ </span>
60
+ )}
61
+ {isOverridden && (
62
+ <span className="text-xs text-blue-700 font-medium">Overridden</span>
63
+ )}
64
+ {isRejected && (
65
+ <span className="text-xs text-red-600 font-medium">Flagged</span>
66
+ )}
67
+ </div>
68
+
69
+ {/* Value */}
70
+ {editing ? (
71
+ <div
72
+ className="flex gap-2 mt-1"
73
+ onClick={(e) => e.stopPropagation()}
74
+ >
75
+ <input
76
+ autoFocus
77
+ className="flex-1 text-xs border rounded px-2 py-1 focus:outline-none focus:ring-1 focus:ring-blue-400"
78
+ value={editValue}
79
+ onChange={(e) => setEditValue(e.target.value)}
80
+ onKeyDown={(e) => {
81
+ if (e.key === 'Enter') handleSaveOverride()
82
+ if (e.key === 'Escape') setEditing(false)
83
+ }}
84
+ />
85
+ <button
86
+ onClick={handleSaveOverride}
87
+ className="text-xs px-2 py-1 bg-blue-600 text-white rounded hover:bg-blue-700"
88
+ >
89
+ Save
90
+ </button>
91
+ <button
92
+ onClick={() => setEditing(false)}
93
+ className="text-xs px-2 py-1 bg-gray-200 text-gray-700 rounded hover:bg-gray-300"
94
+ >
95
+ Cancel
96
+ </button>
97
+ </div>
98
+ ) : (
99
+ <p className="text-sm text-gray-800 mt-0.5 truncate">
100
+ {displayValue ?? (
101
+ <span className="text-gray-300 italic">Not extracted</span>
102
+ )}
103
+ </p>
104
+ )}
105
+
106
+ {/* Provenance source hint — or explicit "no location" notice */}
107
+ {!editing && (
108
+ entry.provenance ? (
109
+ <p className="text-xs text-gray-400 mt-0.5 truncate">
110
+ {entry.provenance.source_filename} · p.{entry.provenance.location.page} ·{' '}
111
+ <span className="italic">"{entry.provenance.matched_text.slice(0, 60)}{entry.provenance.matched_text.length > 60 ? '…' : ''}"</span>
112
+ </p>
113
+ ) : (
114
+ <p className="text-xs mt-0.5">
115
+ <span className="inline-flex items-center gap-1 px-1.5 py-0.5 rounded bg-gray-100 text-gray-400 font-medium">
116
+ <span aria-hidden>—</span> No location data
117
+ </span>
118
+ </p>
119
+ )
120
+ )}
121
+ </div>
122
+
123
+ {/* Right side: confidence badge + action buttons */}
124
+ <div
125
+ className="flex items-center gap-1 flex-shrink-0"
126
+ onClick={(e) => e.stopPropagation()}
127
+ >
128
+ {entry.provenance && (
129
+ <ConfidenceBadge score={entry.provenance.match_score} />
130
+ )}
131
+
132
+ {/* Verify */}
133
+ <button
134
+ title="Mark as verified"
135
+ onClick={() => verifyField(sessionId, entry.fieldPath)}
136
+ className={`w-7 h-7 rounded flex items-center justify-center text-sm transition-colors ${
137
+ isVerified
138
+ ? 'bg-green-500 text-white'
139
+ : 'bg-gray-100 text-gray-500 hover:bg-green-100 hover:text-green-700'
140
+ }`}
141
+ >
142
+
143
+ </button>
144
+
145
+ {/* Edit */}
146
+ <button
147
+ title="Override value"
148
+ onClick={() => {
149
+ setEditValue(displayValue ?? '')
150
+ setEditing(true)
151
+ }}
152
+ className="w-7 h-7 rounded flex items-center justify-center text-sm transition-colors"
153
+ style={{ backgroundColor: '#f3f4f6', color: '#6b7280' }}
154
+ onMouseEnter={e => { (e.currentTarget as HTMLElement).style.backgroundColor = '#eff6ff'; (e.currentTarget as HTMLElement).style.color = '#2563EB' }}
155
+ onMouseLeave={e => { (e.currentTarget as HTMLElement).style.backgroundColor = '#f3f4f6'; (e.currentTarget as HTMLElement).style.color = '#6b7280' }}
156
+ >
157
+
158
+ </button>
159
+
160
+ {/* Flag */}
161
+ <button
162
+ title="Flag for review"
163
+ onClick={() => rejectField(sessionId, entry.fieldPath)}
164
+ className={`w-7 h-7 rounded flex items-center justify-center text-sm transition-colors ${
165
+ isRejected
166
+ ? 'bg-red-500 text-white'
167
+ : 'bg-gray-100 text-gray-500 hover:bg-red-100 hover:text-red-600'
168
+ }`}
169
+ >
170
+
171
+ </button>
172
+ </div>
173
+ </div>
174
+ </div>
175
+ )
176
+ }
177
+
178
+ function ConfidenceBadge({ score }: { score: number }) {
179
+ const pct = Math.round(score * 100)
180
+ const [bg, text] =
181
+ pct >= 90
182
+ ? ['bg-green-100 text-green-700', '']
183
+ : pct >= 70
184
+ ? ['bg-yellow-100 text-yellow-700', '']
185
+ : ['bg-red-100 text-red-600', '']
186
+
187
+ return (
188
+ <span className={`text-xs font-mono px-1.5 py-0.5 rounded ${bg} ${text}`}>
189
+ {pct}%
190
+ </span>
191
+ )
192
+ }
193
+
194
+ function CheckIcon() {
195
+ return (
196
+ <svg className="w-3 h-3" viewBox="0 0 12 12" fill="currentColor">
197
+ <path d="M10 3L5 8.5 2 5.5" stroke="currentColor" strokeWidth="1.5"
198
+ strokeLinecap="round" strokeLinejoin="round" fill="none" />
199
+ </svg>
200
+ )
201
+ }
ui/src/PDFPane.tsx ADDED
@@ -0,0 +1,229 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import { useCallback, useEffect, useMemo, useRef, useState } from 'react'
2
+ import { Document, Page } from 'react-pdf'
3
+ import type { FieldProvenance } from './types'
4
+ import { useStore } from './store'
5
+ import { api } from './api'
6
+
7
+ interface Props {
8
+ sessionId: string
9
+ }
10
+
11
+ export function PDFPane({ sessionId }: Props) {
12
+ const sessionData = useStore((s) => s.sessionData)
13
+ const activePdfFile = useStore((s) => s.activePdfFile)
14
+ const activeProvenance = useStore((s) => s.activeProvenance)
15
+ const setActivePdf = useStore((s) => s.setActivePdf)
16
+
17
+ const [numPages, setNumPages] = useState(0)
18
+ const [renderedPages, setRenderedPages] = useState<Set<number>>(new Set())
19
+ const [containerWidth, setContainerWidth] = useState(600)
20
+ const containerRef = useRef<HTMLDivElement>(null)
21
+ const pageRefs = useRef<Map<number, HTMLDivElement>>(new Map())
22
+
23
+ // Track which PDF URL we last requested a scroll for, to avoid re-firing
24
+ const pendingScrollRef = useRef<{ page: number; pdfFile: string } | null>(null)
25
+
26
+ // Unique PDF filenames from provenance
27
+ const pdfFiles = useMemo(() => {
28
+ const seen = new Set<string>()
29
+ return (sessionData?.provenance ?? [])
30
+ .map((p) => p.source_filename)
31
+ .filter((f) => { const fresh = !seen.has(f); seen.add(f); return fresh })
32
+ }, [sessionData?.provenance])
33
+
34
+ // Set container width on resize
35
+ useEffect(() => {
36
+ const el = containerRef.current
37
+ if (!el) return
38
+ const obs = new ResizeObserver(([entry]) => {
39
+ setContainerWidth(Math.floor(entry.contentRect.width) - 24)
40
+ })
41
+ obs.observe(el)
42
+ setContainerWidth(Math.floor(el.clientWidth) - 24)
43
+ return () => obs.disconnect()
44
+ }, [])
45
+
46
+ // When active provenance changes: enqueue a scroll request
47
+ useEffect(() => {
48
+ if (!activeProvenance) return
49
+ pendingScrollRef.current = {
50
+ page: activeProvenance.location.page,
51
+ pdfFile: activeProvenance.source_filename,
52
+ }
53
+ // Reset rendered-pages set when switching documents
54
+ if (activeProvenance.source_filename !== activePdfFile) {
55
+ setRenderedPages(new Set())
56
+ }
57
+ // Try immediately (page already rendered)
58
+ tryScroll()
59
+ }, [activeProvenance]) // eslint-disable-line react-hooks/exhaustive-deps
60
+
61
+ // When a page finishes rendering, check if a scroll is pending for it
62
+ const handlePageRenderSuccess = useCallback((pageNum: number) => {
63
+ setRenderedPages((prev) => new Set([...prev, pageNum]))
64
+ const pending = pendingScrollRef.current
65
+ if (pending && pending.page === pageNum && pending.pdfFile === activePdfFile) {
66
+ const el = pageRefs.current.get(pageNum)
67
+ el?.scrollIntoView({ behavior: 'smooth', block: 'center' })
68
+ pendingScrollRef.current = null
69
+ }
70
+ }, [activePdfFile])
71
+
72
+ function tryScroll() {
73
+ const pending = pendingScrollRef.current
74
+ if (!pending) return
75
+ if (pending.pdfFile !== activePdfFile) return
76
+ const el = pageRefs.current.get(pending.page)
77
+ if (el) {
78
+ el.scrollIntoView({ behavior: 'smooth', block: 'center' })
79
+ pendingScrollRef.current = null
80
+ }
81
+ }
82
+
83
+ // Reset rendered pages when the PDF URL changes
84
+ const pdfUrl = activePdfFile ? api.pdfUrl(sessionId, activePdfFile) : null
85
+ const prevPdfUrlRef = useRef<string | null>(null)
86
+ if (pdfUrl !== prevPdfUrlRef.current) {
87
+ prevPdfUrlRef.current = pdfUrl
88
+ // Clear page refs — old page elements are stale after document switch
89
+ pageRefs.current.clear()
90
+ }
91
+
92
+ // Highlights for the currently displayed PDF
93
+ const highlights = useMemo((): FieldProvenance[] => {
94
+ if (!sessionData || !activePdfFile) return []
95
+ return sessionData.provenance.filter(
96
+ (p) => p.source_filename === activePdfFile,
97
+ )
98
+ }, [sessionData, activePdfFile])
99
+
100
+ return (
101
+ <div className="flex flex-col h-full">
102
+ {/* PDF file selector */}
103
+ {pdfFiles.length > 1 && (
104
+ <div className="flex flex-wrap gap-2 p-3 border-b flex-shrink-0" style={{ backgroundColor: '#1F2937' }}>
105
+ {pdfFiles.map((f) => (
106
+ <button
107
+ key={f}
108
+ onClick={() => setActivePdf(f)}
109
+ className="px-3 py-1 rounded text-xs font-medium transition-colors"
110
+ style={activePdfFile === f
111
+ ? { backgroundColor: '#008080', color: '#ffffff' }
112
+ : { backgroundColor: 'rgba(255,255,255,0.08)', border: '1px solid rgba(255,255,255,0.15)', color: 'rgba(255,255,255,0.7)' }
113
+ }
114
+ >
115
+ {f}
116
+ </button>
117
+ ))}
118
+ </div>
119
+ )}
120
+
121
+ {/* PDF scroll area */}
122
+ <div
123
+ ref={containerRef}
124
+ className="flex-1 overflow-y-auto pdf-scroll-container bg-gray-200 p-3 space-y-4"
125
+ >
126
+ {pdfUrl ? (
127
+ <Document
128
+ file={pdfUrl}
129
+ onLoadSuccess={({ numPages: n }) => {
130
+ setNumPages(n)
131
+ setRenderedPages(new Set())
132
+ }}
133
+ loading={<LoadingPlaceholder />}
134
+ error={<ErrorPlaceholder />}
135
+ >
136
+ {Array.from({ length: numPages }, (_, i) => i + 1).map((pageNum) => {
137
+ const pageHighlights = highlights.filter(
138
+ (h) => h.location.page === pageNum,
139
+ )
140
+ const hasActive =
141
+ activeProvenance?.location.page === pageNum &&
142
+ activeProvenance.source_filename === activePdfFile
143
+
144
+ return (
145
+ <div
146
+ key={pageNum}
147
+ ref={(el) => {
148
+ if (el) pageRefs.current.set(pageNum, el)
149
+ else pageRefs.current.delete(pageNum)
150
+ }}
151
+ // Use block + explicit width so the overlay div always matches
152
+ // the canvas dimensions exactly (inline-block can shrink-wrap)
153
+ style={{ position: 'relative', width: containerWidth }}
154
+ className={`rounded shadow-md transition-shadow overflow-hidden ${
155
+ hasActive ? 'ring-4 ring-blue-500' : ''
156
+ }`}
157
+ >
158
+ <Page
159
+ pageNumber={pageNum}
160
+ width={containerWidth}
161
+ renderTextLayer={false}
162
+ renderAnnotationLayer={false}
163
+ onRenderSuccess={() => handlePageRenderSuccess(pageNum)}
164
+ />
165
+
166
+ {/* Highlight overlay — percentage-based, top-left origin */}
167
+ <div
168
+ style={{ position: 'absolute', inset: 0, pointerEvents: 'none' }}
169
+ aria-hidden
170
+ >
171
+ {pageHighlights.map((h) => {
172
+ const [x0, y0, x1, y1] = h.location.bbox
173
+ const isActive = activeProvenance?.field_path === h.field_path
174
+ return (
175
+ <div
176
+ key={h.field_path}
177
+ style={{
178
+ position: 'absolute',
179
+ left: `${x0}%`,
180
+ top: `${y0}%`,
181
+ width: `${x1 - x0}%`,
182
+ height: `${y1 - y0}%`,
183
+ background: isActive
184
+ ? 'rgba(59, 130, 246, 0.35)' /* blue-500 fill */
185
+ : 'rgba(134, 239, 172, 0.35)', /* green-300 fill */
186
+ border: isActive
187
+ ? '3px solid rgba(37, 99, 235, 1)' /* blue-700 solid */
188
+ : '2px solid rgba(22, 163, 74, 0.9)', /* green-600 */
189
+ borderRadius: 3,
190
+ boxShadow: isActive
191
+ ? '0 0 0 2px rgba(147, 197, 253, 0.6)' /* blue glow */
192
+ : 'none',
193
+ transition: 'background 0.15s, border 0.15s',
194
+ }}
195
+ title={`${h.field_path}: ${h.extracted_value}`}
196
+ />
197
+ )
198
+ })}
199
+ </div>
200
+ </div>
201
+ )
202
+ })}
203
+ </Document>
204
+ ) : (
205
+ <div className="flex items-center justify-center h-full text-gray-400 text-sm">
206
+ No PDF selected
207
+ </div>
208
+ )}
209
+ </div>
210
+ </div>
211
+ )
212
+ }
213
+
214
+ function LoadingPlaceholder() {
215
+ return (
216
+ <div className="flex items-center justify-center p-12 text-gray-400 text-sm">
217
+ Loading PDF…
218
+ </div>
219
+ )
220
+ }
221
+
222
+ function ErrorPlaceholder() {
223
+ return (
224
+ <div className="flex items-center justify-center p-12 text-red-400 text-sm">
225
+ Failed to load PDF.
226
+ </div>
227
+ )
228
+ }
229
+
ui/src/RecordPane.tsx ADDED
@@ -0,0 +1,174 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import { useMemo } from 'react'
2
+ import type { FieldEntry, GoldenRecord } from './types'
3
+ import { useStore } from './store'
4
+ import { FieldRow } from './FieldRow'
5
+
6
+ interface Props {
7
+ sessionId: string
8
+ }
9
+
10
+ const SECTION_LABELS: Record<string, string> = {
11
+ policy_header: 'Policy Header',
12
+ vehicle_details: 'Vehicle Details',
13
+ driver_details: 'Drivers',
14
+ cover_and_excesses: 'Cover & Excesses',
15
+ financial_summary: 'Financial Summary',
16
+ additional_risk_data: 'Additional Risk Data',
17
+ }
18
+
19
+ export function RecordPane({ sessionId }: Props) {
20
+ const sessionData = useStore((s) => s.sessionData)
21
+ const reviewState = useStore((s) => s.reviewState)
22
+ const activeFieldPath = useStore((s) => s.activeFieldPath)
23
+ const setActiveField = useStore((s) => s.setActiveField)
24
+
25
+ const fieldsBySection = useMemo(() => {
26
+ if (!sessionData) return []
27
+ return flattenRecord(sessionData.record, sessionData.provenance.reduce(
28
+ (acc, p) => { acc[p.field_path] = p; return acc },
29
+ {} as Record<string, import('./types').FieldProvenance>,
30
+ ))
31
+ }, [sessionData])
32
+
33
+ if (!sessionData) return null
34
+
35
+ return (
36
+ <div className="flex flex-col h-full">
37
+ {/* Header */}
38
+ <div className="px-5 py-4 border-b flex-shrink-0" style={{ backgroundColor: '#1F2937' }}>
39
+ <h2 className="text-sm font-semibold text-white">Golden Record</h2>
40
+ <p className="text-xs mt-0.5" style={{ color: 'rgba(255,255,255,0.5)' }}>
41
+ Click any field to highlight its source location in the PDF.
42
+ </p>
43
+ </div>
44
+
45
+ {/* Scrollable field list */}
46
+ <div className="flex-1 overflow-y-auto px-4 py-3 space-y-5">
47
+ {fieldsBySection.map(({ section, entries }) => (
48
+ <section key={section}>
49
+ <h3 className="text-xs font-semibold uppercase tracking-wider mb-2 px-1" style={{ color: '#008080' }}>
50
+ {SECTION_LABELS[section] ?? section}
51
+ </h3>
52
+ <div className="space-y-1">
53
+ {entries.map((entry) => (
54
+ <FieldRow
55
+ key={entry.fieldPath}
56
+ entry={entry}
57
+ sessionId={sessionId}
58
+ isActive={activeFieldPath === entry.fieldPath}
59
+ review={reviewState[entry.fieldPath]}
60
+ onClick={() =>
61
+ setActiveField(activeFieldPath === entry.fieldPath ? null : entry)
62
+ }
63
+ />
64
+ ))}
65
+ </div>
66
+ </section>
67
+ ))}
68
+ </div>
69
+ </div>
70
+ )
71
+ }
72
+
73
+ // ── Field flattening helpers ───────────────────────────────────────────────
74
+
75
+ interface SectionGroup {
76
+ section: string
77
+ entries: FieldEntry[]
78
+ }
79
+
80
+ function flattenRecord(
81
+ record: GoldenRecord,
82
+ provenanceMap: Record<string, import('./types').FieldProvenance>,
83
+ ): SectionGroup[] {
84
+ const groups: SectionGroup[] = []
85
+
86
+ for (const [sectionKey, sectionValue] of Object.entries(record)) {
87
+ if (sectionValue == null) continue
88
+
89
+ const entries: FieldEntry[] = []
90
+
91
+ if (Array.isArray(sectionValue)) {
92
+ // driver_details
93
+ sectionValue.forEach((item: Record<string, unknown>, idx: number) => {
94
+ walkObject(
95
+ item,
96
+ `${sectionKey}[${idx}]`,
97
+ `Driver ${idx + 1}`,
98
+ entries,
99
+ provenanceMap,
100
+ )
101
+ })
102
+ } else if (typeof sectionValue === 'object') {
103
+ walkObject(
104
+ sectionValue as Record<string, unknown>,
105
+ sectionKey,
106
+ '',
107
+ entries,
108
+ provenanceMap,
109
+ )
110
+ } else {
111
+ entries.push({
112
+ fieldPath: sectionKey,
113
+ label: formatLabel(sectionKey),
114
+ value: String(sectionValue),
115
+ section: sectionKey,
116
+ provenance: provenanceMap[sectionKey],
117
+ })
118
+ }
119
+
120
+ if (entries.length > 0) {
121
+ groups.push({ section: sectionKey, entries })
122
+ }
123
+ }
124
+
125
+ return groups
126
+ }
127
+
128
+ function walkObject(
129
+ obj: Record<string, unknown>,
130
+ pathPrefix: string,
131
+ _labelPrefix: string,
132
+ out: FieldEntry[],
133
+ provenanceMap: Record<string, import('./types').FieldProvenance>,
134
+ ) {
135
+ for (const [key, val] of Object.entries(obj)) {
136
+ const path = `${pathPrefix}.${key}`
137
+
138
+ if (val == null) continue
139
+
140
+ if (typeof val === 'object' && !Array.isArray(val)) {
141
+ walkObject(val as Record<string, unknown>, path, key, out, provenanceMap)
142
+ } else if (Array.isArray(val)) {
143
+ val.forEach((item, i) => {
144
+ if (item == null) return
145
+ const iPath = `${path}[${i}]`
146
+ if (typeof item === 'object') {
147
+ walkObject(item as Record<string, unknown>, iPath, key, out, provenanceMap)
148
+ } else {
149
+ out.push({
150
+ fieldPath: iPath,
151
+ label: `${formatLabel(key)} [${i}]`,
152
+ value: String(item),
153
+ section: pathPrefix.split('.')[0],
154
+ provenance: provenanceMap[iPath],
155
+ })
156
+ }
157
+ })
158
+ } else {
159
+ out.push({
160
+ fieldPath: path,
161
+ label: formatLabel(key),
162
+ value: String(val),
163
+ section: pathPrefix.split('.')[0],
164
+ provenance: provenanceMap[path],
165
+ })
166
+ }
167
+ }
168
+ }
169
+
170
+ function formatLabel(key: string): string {
171
+ return key
172
+ .replace(/_/g, ' ')
173
+ .replace(/\b\w/g, (c) => c.toUpperCase())
174
+ }
ui/src/ReviewDashboard.tsx ADDED
@@ -0,0 +1,93 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import { PDFPane } from './PDFPane'
2
+ import { RecordPane } from './RecordPane'
3
+ import { useStore } from './store'
4
+ import logoUrl from './assets/ai-toolstack-logo.svg'
5
+
6
+ interface Props {
7
+ sessionId: string
8
+ }
9
+
10
+ export function ReviewDashboard({ sessionId }: Props) {
11
+ const sessionData = useStore((s) => s.sessionData)
12
+ const reviewState = useStore((s) => s.reviewState)
13
+
14
+ const verified = Object.values(reviewState).filter((r) => r.action === 'verify').length
15
+ const overridden = Object.values(reviewState).filter((r) => r.action === 'override').length
16
+ const provTotal = sessionData?.provenance.length ?? 0
17
+ const fieldTotal = sessionData ? _countLeaves(sessionData.record) : 0
18
+
19
+ return (
20
+ <div className="flex flex-col h-screen overflow-hidden" style={{ backgroundColor: '#f1f5f9' }}>
21
+
22
+ {/* ── Top bar ─────────────────────────────────────────────────── */}
23
+ <header className="flex items-center justify-between px-6 py-3 bg-white border-b border-gray-200 shadow-sm z-10 flex-shrink-0">
24
+ <div className="flex items-center gap-4">
25
+ <a href="https://www.ai-toolstack.com/" target="_blank" rel="noopener noreferrer">
26
+ <img src={logoUrl} alt="AI Tool Stack" className="h-6 w-auto" />
27
+ </a>
28
+ {/* Divider */}
29
+ <span className="text-gray-200 select-none">|</span>
30
+ <div className="flex items-center gap-2">
31
+ <svg width="16" height="16" viewBox="0 0 28 28" fill="none" aria-hidden="true">
32
+ <path d="M4 18L14 22L24 18" stroke="#1F2937" strokeWidth="2.5" strokeLinecap="round" strokeLinejoin="round"/>
33
+ <path d="M4 14L14 18L24 14" stroke="#2563EB" strokeWidth="2.5" strokeLinecap="round" strokeLinejoin="round"/>
34
+ <path d="M4 10L14 14L24 10L14 6L4 10Z" stroke="#008080" strokeWidth="2.5" strokeLinecap="round" strokeLinejoin="round"/>
35
+ </svg>
36
+ <span className="text-sm font-semibold" style={{ color: '#1F2937' }}>PolicyTrace</span>
37
+ </div>
38
+ <span className="text-xs text-gray-400 font-mono bg-gray-50 px-2 py-0.5 rounded-lg border border-gray-200">
39
+ {sessionId.slice(0, 8)}…
40
+ </span>
41
+ </div>
42
+ <div className="flex items-center gap-3 text-xs text-gray-500">
43
+ <Stat label="Fields" value={fieldTotal} />
44
+ <StatDivider />
45
+ <Stat label="Located" value={provTotal} />
46
+ <StatDivider />
47
+ <Stat label="Verified" value={verified} color="#16a34a" />
48
+ <StatDivider />
49
+ <Stat label="Overridden" value={overridden} color="#2563EB" />
50
+ </div>
51
+ </header>
52
+
53
+ {/* ── 2-column body ───────────────────────────────────────────── */}
54
+ <div className="flex flex-1 overflow-hidden">
55
+ <div className="w-1/2 border-r border-gray-200 flex flex-col overflow-hidden">
56
+ <PDFPane sessionId={sessionId} />
57
+ </div>
58
+ <div className="w-1/2 flex flex-col overflow-hidden">
59
+ <RecordPane sessionId={sessionId} />
60
+ </div>
61
+ </div>
62
+ </div>
63
+ )
64
+ }
65
+
66
+ function StatDivider() {
67
+ return <span className="text-gray-200 select-none">·</span>
68
+ }
69
+
70
+ function Stat({
71
+ label,
72
+ value,
73
+ color = '#374151',
74
+ }: {
75
+ label: string
76
+ value: number
77
+ color?: string
78
+ }) {
79
+ return (
80
+ <span>
81
+ {label}:{' '}
82
+ <span className="font-semibold" style={{ color }}>{value}</span>
83
+ </span>
84
+ )
85
+ }
86
+
87
+ /** Recursively count leaf values in any nested object (mirrors backend _count_leaves). */
88
+ function _countLeaves(obj: unknown): number {
89
+ if (Array.isArray(obj)) return obj.reduce((acc, v) => acc + _countLeaves(v), 0)
90
+ if (obj && typeof obj === 'object')
91
+ return Object.values(obj).reduce((acc: number, v) => acc + _countLeaves(v), 0)
92
+ return 1
93
+ }
ui/src/SessionPage.tsx ADDED
@@ -0,0 +1,82 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import { useEffect, useState } from 'react'
2
+ import { useNavigate, useParams } from 'react-router-dom'
3
+ import { api } from './api'
4
+ import { ReviewDashboard } from './ReviewDashboard'
5
+ import { useStore } from './store'
6
+
7
+ /**
8
+ * Route: /session/:sessionId
9
+ *
10
+ * Loads session data from the API on mount so the page survives a hard refresh
11
+ * or a direct link (e.g. from a blog post). If the session ID is not found the
12
+ * user is redirected back to the upload page with a clear error message.
13
+ */
14
+ export function SessionPage() {
15
+ const { sessionId } = useParams<{ sessionId: string }>()
16
+ const navigate = useNavigate()
17
+ const setSession = useStore((s) => s.setSession)
18
+ const sessionData = useStore((s) => s.sessionData)
19
+
20
+ const [loading, setLoading] = useState(false)
21
+ const [error, setError] = useState<string | null>(null)
22
+
23
+ useEffect(() => {
24
+ if (!sessionId) {
25
+ navigate('/')
26
+ return
27
+ }
28
+
29
+ // If the store already has data for this exact session (just navigated from
30
+ // the upload page), skip the API call.
31
+ if (sessionData?.session_id === sessionId) return
32
+
33
+ setLoading(true)
34
+ api.getSession(sessionId)
35
+ .then((data) => {
36
+ setSession(data)
37
+ setLoading(false)
38
+ })
39
+ .catch(() => {
40
+ setError(`Session "${sessionId.slice(0, 8)}…" not found or has expired.`)
41
+ setLoading(false)
42
+ })
43
+ }, [sessionId]) // eslint-disable-line react-hooks/exhaustive-deps
44
+
45
+ if (loading) {
46
+ return (
47
+ <div className="min-h-screen flex items-center justify-center" style={{ backgroundColor: '#f8fafc' }}>
48
+ <div className="text-center space-y-3">
49
+ <svg className="animate-spin h-8 w-8 mx-auto" viewBox="0 0 24 24" fill="none"
50
+ style={{ color: '#008080' }}>
51
+ <circle className="opacity-25" cx="12" cy="12" r="10" stroke="currentColor" strokeWidth="4" />
52
+ <path className="opacity-75" fill="currentColor" d="M4 12a8 8 0 018-8v8z" />
53
+ </svg>
54
+ <p className="text-sm text-gray-500">Loading session…</p>
55
+ </div>
56
+ </div>
57
+ )
58
+ }
59
+
60
+ if (error) {
61
+ return (
62
+ <div className="min-h-screen flex items-center justify-center" style={{ backgroundColor: '#f8fafc' }}>
63
+ <div className="text-center space-y-4 max-w-sm">
64
+ <p className="text-sm text-red-600 bg-red-50 border border-red-200 rounded-xl px-4 py-3">
65
+ {error}
66
+ </p>
67
+ <button
68
+ onClick={() => navigate('/')}
69
+ className="text-sm font-medium underline"
70
+ style={{ color: '#2563EB' }}
71
+ >
72
+ ← Back to upload
73
+ </button>
74
+ </div>
75
+ </div>
76
+ )
77
+ }
78
+
79
+ if (!sessionId) return null
80
+
81
+ return <ReviewDashboard sessionId={sessionId} />
82
+ }
ui/src/UploadPage.tsx ADDED
@@ -0,0 +1,210 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import { useCallback, useState } from 'react'
2
+ import { useNavigate } from 'react-router-dom'
3
+ import { api } from './api'
4
+ import { useStore } from './store'
5
+ import logoUrl from './assets/ai-toolstack-logo.svg'
6
+
7
+ const BRAND = {
8
+ dark: '#1F2937',
9
+ blue: '#2563EB',
10
+ teal: '#008080',
11
+ } as const
12
+
13
+ export function UploadPage() {
14
+ const navigate = useNavigate()
15
+ const setSession = useStore((s) => s.setSession)
16
+
17
+ const [loading, setLoading] = useState(false)
18
+ const [error, setError] = useState<string | null>(null)
19
+ const [files, setFiles] = useState<File[]>([])
20
+ const [dragOver, setDragOver] = useState(false)
21
+
22
+ const handleFiles = useCallback((incoming: FileList | null) => {
23
+ if (!incoming) return
24
+ const pdfs = Array.from(incoming).filter((f) => f.name.toLowerCase().endsWith('.pdf'))
25
+ setFiles((prev) => {
26
+ const names = new Set(prev.map((f) => f.name))
27
+ return [...prev, ...pdfs.filter((f) => !names.has(f.name))]
28
+ })
29
+ }, [])
30
+
31
+ const removeFile = (name: string) =>
32
+ setFiles((prev) => prev.filter((f) => f.name !== name))
33
+
34
+ const handleSubmit = async () => {
35
+ if (!files.length) return
36
+ setLoading(true)
37
+ setError(null)
38
+ try {
39
+ const resp = await api.processDocuments(files)
40
+ const sessionData = await api.getSession(resp.session_id)
41
+ setSession(sessionData)
42
+ navigate(`/session/${resp.session_id}`)
43
+ } catch (err: unknown) {
44
+ const msg = err instanceof Error ? err.message : 'An unknown error occurred.'
45
+ setError(msg)
46
+ setLoading(false)
47
+ }
48
+ }
49
+
50
+ return (
51
+ <div className="min-h-screen flex flex-col" style={{ backgroundColor: '#f8fafc' }}>
52
+
53
+ {/* ── Top nav ─────────────────────────────────────────────────── */}
54
+ <header className="flex items-center justify-between px-8 py-4 border-b border-gray-200 bg-white">
55
+ <a href="https://www.ai-toolstack.com/" target="_blank" rel="noopener noreferrer">
56
+ <img src={logoUrl} alt="AI Tool Stack" className="h-7 w-auto" />
57
+ </a>
58
+ <span
59
+ className="text-xs font-medium px-2 py-1 rounded-full"
60
+ style={{ backgroundColor: '#f0fdfc', color: BRAND.teal }}
61
+ >
62
+ Beta
63
+ </span>
64
+ </header>
65
+
66
+ {/* ── Hero ────────────────────────────────────────────────────── */}
67
+ <main className="flex-1 flex flex-col items-center justify-center px-8 py-12">
68
+ <div className="w-full max-w-lg">
69
+
70
+ {/* Title */}
71
+ <div className="mb-8 text-center">
72
+ <div className="inline-flex items-center gap-2 mb-4">
73
+ <svg width="28" height="28" viewBox="0 0 28 28" fill="none" aria-hidden="true">
74
+ <path d="M4 18L14 22L24 18" stroke={BRAND.dark} strokeWidth="2" strokeLinecap="round" strokeLinejoin="round"/>
75
+ <path d="M4 14L14 18L24 14" stroke={BRAND.blue} strokeWidth="2" strokeLinecap="round" strokeLinejoin="round"/>
76
+ <path d="M4 10L14 14L24 10L14 6L4 10Z" stroke={BRAND.teal} strokeWidth="2" strokeLinecap="round" strokeLinejoin="round"/>
77
+ </svg>
78
+ <h1 className="text-2xl font-bold tracking-tight" style={{ color: BRAND.dark }}>
79
+ PolicyTrace
80
+ </h1>
81
+ </div>
82
+ <p className="text-sm text-gray-500 leading-relaxed">
83
+ Upload UK motor insurance PDFs — the pipeline classifies, extracts, and merges
84
+ them into a verified Golden Record with full field-level provenance.
85
+ </p>
86
+ </div>
87
+
88
+ {/* Drop zone */}
89
+ <div
90
+ onDragOver={(e) => { e.preventDefault(); setDragOver(true) }}
91
+ onDragLeave={() => setDragOver(false)}
92
+ onDrop={(e) => {
93
+ e.preventDefault()
94
+ setDragOver(false)
95
+ handleFiles(e.dataTransfer.files)
96
+ }}
97
+ onClick={() => document.getElementById('file-input')?.click()}
98
+ className="rounded-2xl border-2 border-dashed p-10 text-center cursor-pointer transition-all"
99
+ style={{
100
+ borderColor: dragOver ? BRAND.blue : '#d1d5db',
101
+ backgroundColor: dragOver ? '#eff6ff' : '#ffffff',
102
+ }}
103
+ >
104
+ <svg
105
+ className="mx-auto mb-3 h-10 w-10 transition-colors"
106
+ fill="none"
107
+ viewBox="0 0 24 24"
108
+ stroke="currentColor"
109
+ style={{ color: dragOver ? BRAND.blue : '#9ca3af' }}
110
+ >
111
+ <path strokeLinecap="round" strokeLinejoin="round" strokeWidth={1.5}
112
+ d="M7 16a4 4 0 01-.88-7.903A5 5 0 1115.9 6L16 6a5 5 0 011 9.9M15 13l-3-3m0 0l-3 3m3-3v12" />
113
+ </svg>
114
+ <p className="text-sm font-medium text-gray-700">
115
+ Drop PDF files here, or{' '}
116
+ <span style={{ color: BRAND.blue }}>click to browse</span>
117
+ </p>
118
+ <p className="text-xs text-gray-400 mt-1">
119
+ Schedule · Certificate · Statement of Fact · Policy Booklet
120
+ </p>
121
+ <input
122
+ id="file-input"
123
+ type="file"
124
+ accept=".pdf"
125
+ multiple
126
+ className="hidden"
127
+ onChange={(e) => handleFiles(e.target.files)}
128
+ />
129
+ </div>
130
+
131
+ {/* File list */}
132
+ {files.length > 0 && (
133
+ <ul className="mt-4 space-y-2">
134
+ {files.map((f) => (
135
+ <li
136
+ key={f.name}
137
+ className="flex items-center justify-between bg-white border border-gray-200 rounded-xl px-4 py-2.5 text-sm shadow-sm"
138
+ >
139
+ <div className="flex items-center gap-2 min-w-0">
140
+ <span
141
+ className="shrink-0 text-xs font-semibold px-1.5 py-0.5 rounded"
142
+ style={{ backgroundColor: '#fee2e2', color: '#991b1b' }}
143
+ >
144
+ PDF
145
+ </span>
146
+ <span className="text-gray-700 truncate">{f.name}</span>
147
+ </div>
148
+ <button
149
+ onClick={() => removeFile(f.name)}
150
+ className="text-gray-300 hover:text-red-500 ml-3 shrink-0 transition-colors"
151
+ aria-label={`Remove ${f.name}`}
152
+ >
153
+
154
+ </button>
155
+ </li>
156
+ ))}
157
+ </ul>
158
+ )}
159
+
160
+ {/* Error */}
161
+ {error && (
162
+ <div className="mt-4 rounded-xl bg-red-50 border border-red-200 p-3 text-sm text-red-700">
163
+ {error}
164
+ </div>
165
+ )}
166
+
167
+ {/* CTA */}
168
+ <button
169
+ onClick={handleSubmit}
170
+ disabled={!files.length || loading}
171
+ className="mt-6 w-full py-3 px-6 rounded-xl font-semibold text-white transition-colors disabled:opacity-50 disabled:cursor-not-allowed"
172
+ style={{ backgroundColor: loading ? BRAND.teal : BRAND.blue }}
173
+ >
174
+ {loading ? (
175
+ <span className="flex items-center justify-center gap-2">
176
+ <svg className="animate-spin h-4 w-4" viewBox="0 0 24 24" fill="none">
177
+ <circle className="opacity-25" cx="12" cy="12" r="10" stroke="currentColor" strokeWidth="4" />
178
+ <path className="opacity-75" fill="currentColor" d="M4 12a8 8 0 018-8v8z" />
179
+ </svg>
180
+ Extracting — this may take 60 s…
181
+ </span>
182
+ ) : (
183
+ 'Extract & Review'
184
+ )}
185
+ </button>
186
+
187
+ {loading && (
188
+ <p className="text-center text-xs text-gray-400 mt-3">
189
+ Classifying documents · Masking PII · Calling Groq LLM · Building provenance index
190
+ </p>
191
+ )}
192
+ </div>
193
+ </main>
194
+
195
+ {/* ── Footer ──────────────────────────────────────────────────── */}
196
+ <footer className="text-center py-4 text-xs text-gray-400 border-t border-gray-200 bg-white">
197
+ Built on{' '}
198
+ <a
199
+ href="https://www.ai-toolstack.com/"
200
+ target="_blank"
201
+ rel="noopener noreferrer"
202
+ className="underline hover:text-gray-600 transition-colors"
203
+ >
204
+ AI Tool Stack
205
+ </a>{' '}
206
+ · Powered by Groq &amp; Docling
207
+ </footer>
208
+ </div>
209
+ )
210
+ }
ui/src/api.ts ADDED
@@ -0,0 +1,43 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import axios from 'axios'
2
+ import type { ProcessResponse, ReviewAction, ReviewState, SessionData } from './types'
3
+
4
+ const http = axios.create({ baseURL: '/' })
5
+
6
+ export const api = {
7
+ async processDocuments(files: File[]): Promise<ProcessResponse> {
8
+ const form = new FormData()
9
+ for (const f of files) form.append('files', f)
10
+ const { data } = await http.post<ProcessResponse>('/api/process', form, {
11
+ headers: { 'Content-Type': 'multipart/form-data' },
12
+ })
13
+ return data
14
+ },
15
+
16
+ async getSession(sessionId: string): Promise<SessionData> {
17
+ const { data } = await http.get<SessionData>(`/api/session/${sessionId}`)
18
+ return data
19
+ },
20
+
21
+ async getReviewState(sessionId: string): Promise<ReviewState> {
22
+ const { data } = await http.get<ReviewState>(`/api/session/${sessionId}/review-state`)
23
+ return data
24
+ },
25
+
26
+ async updateReview(
27
+ sessionId: string,
28
+ fieldPath: string,
29
+ action: ReviewAction,
30
+ overriddenValue?: string,
31
+ ): Promise<void> {
32
+ await http.patch(`/api/session/${sessionId}/review`, {
33
+ field_path: fieldPath,
34
+ action,
35
+ overridden_value: overriddenValue ?? null,
36
+ })
37
+ },
38
+
39
+ /** URL to stream a PDF — used directly by the PDF viewer component */
40
+ pdfUrl(sessionId: string, filename: string): string {
41
+ return `/api/pdf/${sessionId}/${encodeURIComponent(filename)}`
42
+ },
43
+ }
ui/src/assets/ai-toolstack-logo.svg ADDED
ui/src/index.css ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ @tailwind base;
2
+ @tailwind components;
3
+ @tailwind utilities;
4
+
5
+ @layer base {
6
+ body {
7
+ @apply bg-gray-50 text-gray-900;
8
+ font-family: ui-sans-serif, system-ui, -apple-system, BlinkMacSystemFont, "Segoe UI", sans-serif;
9
+ }
10
+ }
11
+
12
+ /* ── Brand accent utilities ─────────────────────────────────────── */
13
+ .btn-primary {
14
+ @apply py-3 px-6 rounded-xl font-semibold text-white transition-colors;
15
+ background-color: #2563EB;
16
+ }
17
+ .btn-primary:hover { background-color: #1d4ed8; }
18
+ .btn-primary:disabled { @apply opacity-50 cursor-not-allowed; }
19
+
20
+ /* ── react-pdf ──────────────────────────────────────────────────── */
21
+ .react-pdf__Page {
22
+ @apply shadow-md;
23
+ }
24
+ .react-pdf__Page__canvas {
25
+ @apply block;
26
+ }
27
+
28
+ /* Smooth scroll for the PDF pane */
29
+ .pdf-scroll-container {
30
+ scroll-behavior: smooth;
31
+ }
ui/src/main.tsx ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import React from 'react'
2
+ import { pdfjs } from 'react-pdf'
3
+
4
+ // Configure pdfjs worker (Vite resolves this at build time)
5
+ pdfjs.GlobalWorkerOptions.workerSrc = new URL(
6
+ 'pdfjs-dist/build/pdf.worker.min.mjs',
7
+ import.meta.url,
8
+ ).toString()
9
+
10
+ import ReactDOM from 'react-dom/client'
11
+ import { BrowserRouter } from 'react-router-dom'
12
+ import App from './App'
13
+ import './index.css'
14
+ import 'react-pdf/dist/Page/AnnotationLayer.css'
15
+ import 'react-pdf/dist/Page/TextLayer.css'
16
+
17
+ ReactDOM.createRoot(document.getElementById('root')!).render(
18
+ <React.StrictMode>
19
+ <BrowserRouter>
20
+ <App />
21
+ </BrowserRouter>
22
+ </React.StrictMode>,
23
+ )
ui/src/store.ts ADDED
@@ -0,0 +1,88 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import { create } from 'zustand'
2
+ import { api } from './api'
3
+ import type { FieldEntry, FieldProvenance, ReviewAction, ReviewState, SessionData } from './types'
4
+
5
+ interface AppState {
6
+ // Session data
7
+ sessionData: SessionData | null
8
+ reviewState: ReviewState
9
+
10
+ // UI state
11
+ activePdfFile: string | null // filename of the PDF currently displayed
12
+ activeFieldPath: string | null // field path the user clicked
13
+ activeProvenance: FieldProvenance | null
14
+
15
+ // Actions
16
+ setSession: (data: SessionData) => void
17
+ loadReviewState: (sessionId: string) => Promise<void>
18
+ setActiveField: (entry: FieldEntry | null) => void
19
+ verifyField: (sessionId: string, fieldPath: string) => Promise<void>
20
+ overrideField: (sessionId: string, fieldPath: string, newValue: string) => Promise<void>
21
+ rejectField: (sessionId: string, fieldPath: string) => Promise<void>
22
+ setActivePdf: (filename: string) => void
23
+ }
24
+
25
+ export const useStore = create<AppState>((set, get) => ({
26
+ sessionData: null,
27
+ reviewState: {},
28
+ activePdfFile: null,
29
+ activeFieldPath: null,
30
+ activeProvenance: null,
31
+
32
+ setSession(data) {
33
+ // Set default active PDF to the first source file found in provenance
34
+ const firstPdf = data.provenance[0]?.source_filename ?? null
35
+ set({ sessionData: data, activePdfFile: firstPdf })
36
+ },
37
+
38
+ async loadReviewState(sessionId) {
39
+ const state = await api.getReviewState(sessionId)
40
+ set({ reviewState: state })
41
+ },
42
+
43
+ setActiveField(entry) {
44
+ if (!entry) {
45
+ set({ activeFieldPath: null, activeProvenance: null })
46
+ return
47
+ }
48
+ const { sessionData } = get()
49
+ const provenance = sessionData?.provenance.find(
50
+ (p) => p.field_path === entry.fieldPath,
51
+ ) ?? null
52
+
53
+ set({
54
+ activeFieldPath: entry.fieldPath,
55
+ activeProvenance: provenance,
56
+ // Switch PDF pane to the file that contains this field
57
+ activePdfFile: provenance?.source_filename ?? get().activePdfFile,
58
+ })
59
+ },
60
+
61
+ async verifyField(sessionId, fieldPath) {
62
+ await _applyReview(sessionId, fieldPath, 'verify', undefined, set)
63
+ },
64
+
65
+ async overrideField(sessionId, fieldPath, newValue) {
66
+ await _applyReview(sessionId, fieldPath, 'override', newValue, set)
67
+ },
68
+
69
+ async rejectField(sessionId, fieldPath) {
70
+ await _applyReview(sessionId, fieldPath, 'reject', undefined, set)
71
+ },
72
+
73
+ setActivePdf(filename) {
74
+ set({ activePdfFile: filename })
75
+ },
76
+ }))
77
+
78
+ async function _applyReview(
79
+ sessionId: string,
80
+ fieldPath: string,
81
+ action: ReviewAction,
82
+ overriddenValue: string | undefined,
83
+ set: (partial: Partial<AppState>) => void,
84
+ ) {
85
+ await api.updateReview(sessionId, fieldPath, action, overriddenValue)
86
+ const fresh = await api.getReviewState(sessionId)
87
+ set({ reviewState: fresh })
88
+ }
ui/src/types.ts ADDED
@@ -0,0 +1,170 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ // ── Geometry ──────────────────────────────────────────────────────────────
2
+
3
+ export interface Location {
4
+ page: number
5
+ /** [x0%, y0%, x1%, y1%] — top-left origin, 0–100 range, percent of page */
6
+ bbox: [number, number, number, number]
7
+ }
8
+
9
+ export interface FieldProvenance {
10
+ field_path: string
11
+ extracted_value: string
12
+ matched_text: string
13
+ /** 0.0–1.0 */
14
+ match_score: number
15
+ source_filename: string
16
+ location: Location
17
+ }
18
+
19
+ // ── Golden Record sub-types ───────────────────────────────────────────────
20
+
21
+ export interface PeriodOfCover {
22
+ start_date?: string
23
+ expiry_date?: string
24
+ issue_date?: string
25
+ }
26
+
27
+ export interface PolicyHeader {
28
+ policy_number?: string
29
+ insurer?: string
30
+ product_name?: string
31
+ period_of_cover?: PeriodOfCover
32
+ }
33
+
34
+ export interface SecurityDetails {
35
+ has_security_device?: boolean
36
+ tracker_fitted?: boolean
37
+ modifications?: string
38
+ }
39
+
40
+ export interface VehicleDetails {
41
+ vrm?: string
42
+ make?: string
43
+ model?: string
44
+ fuel_type?: string
45
+ transmission?: string
46
+ estimated_value?: string
47
+ annual_mileage?: number
48
+ overnight_postcode?: string
49
+ kept_location?: string
50
+ security?: SecurityDetails
51
+ }
52
+
53
+ export interface Driver {
54
+ name: string
55
+ dob?: string
56
+ relationship?: string
57
+ occupation?: string
58
+ license_type?: string
59
+ is_main_driver: boolean
60
+ specific_excess?: number
61
+ }
62
+
63
+ export interface NoClaimsDiscount {
64
+ years?: number
65
+ protected?: boolean
66
+ }
67
+
68
+ export interface ExcessBreakdown {
69
+ standard_compulsory?: number
70
+ voluntary?: number
71
+ total_accidental_damage?: number
72
+ fire?: number
73
+ theft?: number
74
+ windscreen_repair?: number
75
+ windscreen_replacement?: number
76
+ own_repairer_additional_excess?: number
77
+ }
78
+
79
+ export interface CoverAndExcesses {
80
+ cover_type?: string
81
+ class_of_use?: string
82
+ driving_other_cars?: boolean
83
+ no_claims_discount?: NoClaimsDiscount
84
+ excess_breakdown?: ExcessBreakdown
85
+ }
86
+
87
+ export interface OptionalExtras {
88
+ motor_legal_protection?: number | string
89
+ breakdown_roadside_assistance?: number | string
90
+ enhanced_personal_accident?: number | string
91
+ hire_car?: number | string
92
+ key_cover?: number | string
93
+ }
94
+
95
+ export interface FinancialSummary {
96
+ total_annual_premium?: number
97
+ optional_extras?: OptionalExtras
98
+ }
99
+
100
+ export interface AdditionalRiskData {
101
+ home_ownership?: string
102
+ children_under_16?: boolean
103
+ number_of_cars_in_household?: number
104
+ non_motoring_convictions?: boolean
105
+ endorsements?: string
106
+ }
107
+
108
+ export interface Citations {
109
+ vehicle_model?: string
110
+ excess_details?: string
111
+ class_of_use?: string
112
+ driver_ages?: string
113
+ premium_breakdown?: string
114
+ }
115
+
116
+ export interface GoldenRecord {
117
+ policy_header?: PolicyHeader
118
+ vehicle_details?: VehicleDetails
119
+ driver_details: Driver[]
120
+ cover_and_excesses?: CoverAndExcesses
121
+ financial_summary?: FinancialSummary
122
+ additional_risk_data?: AdditionalRiskData
123
+ citations?: Citations
124
+ }
125
+
126
+ export interface ConflictEntry {
127
+ field: string
128
+ schedule_value?: string
129
+ certificate_value?: string
130
+ winner: 'schedule' | 'certificate' | 'fallback' | string
131
+ }
132
+
133
+ // ── Session ───────────────────────────────────────────────────────────────
134
+
135
+ export interface SessionData {
136
+ record: GoldenRecord
137
+ provenance: FieldProvenance[]
138
+ conflicts?: ConflictEntry[]
139
+ session_id: string
140
+ }
141
+
142
+ // ── Review state ──────────────────────────────────────────────────────────
143
+
144
+ export type ReviewAction = 'verify' | 'reject' | 'override'
145
+
146
+ export interface FieldReview {
147
+ action: ReviewAction
148
+ overridden_value?: string
149
+ reviewer?: string
150
+ }
151
+
152
+ export type ReviewState = Record<string, FieldReview>
153
+
154
+ // ── Flat field entry (used by the form panel) ─────────────────────────────
155
+
156
+ export interface FieldEntry {
157
+ fieldPath: string
158
+ label: string
159
+ value: string | null
160
+ section: string
161
+ provenance?: FieldProvenance
162
+ }
163
+
164
+ // ── API response types ────────────────────────────────────────────────────
165
+
166
+ export interface ProcessResponse {
167
+ session_id: string
168
+ fields_extracted: number
169
+ provenance_coverage: number
170
+ }
ui/src/vite-env.d.ts ADDED
@@ -0,0 +1 @@
 
 
1
+ /// <reference types="vite/client" />
ui/tailwind.config.js ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ /** @type {import('tailwindcss').Config} */
2
+ export default {
3
+ content: [
4
+ "./index.html",
5
+ "./src/**/*.{js,ts,jsx,tsx}",
6
+ ],
7
+ theme: {
8
+ extend: {
9
+ colors: {
10
+ brand: {
11
+ dark: '#1F2937',
12
+ blue: '#2563EB',
13
+ teal: '#008080',
14
+ 50: '#f0fdfc',
15
+ 100: '#ccfbf1',
16
+ 600: '#2563EB',
17
+ 700: '#1d4ed8',
18
+ },
19
+ },
20
+ fontFamily: {
21
+ sans: ['Inter', 'ui-sans-serif', 'system-ui', 'sans-serif'],
22
+ },
23
+ },
24
+ },
25
+ plugins: [],
26
+ }
ui/tsconfig.app.json ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "compilerOptions": {
3
+ "target": "ES2020",
4
+ "useDefineForClassFields": true,
5
+ "lib": ["ES2020", "DOM", "DOM.Iterable"],
6
+ "module": "ESNext",
7
+ "skipLibCheck": true,
8
+ "moduleResolution": "bundler",
9
+ "allowImportingTsExtensions": true,
10
+ "resolveJsonModule": true,
11
+ "isolatedModules": true,
12
+ "moduleDetection": "force",
13
+ "noEmit": true,
14
+ "jsx": "react-jsx",
15
+ "strict": true,
16
+ "noUnusedLocals": false,
17
+ "noUnusedParameters": false,
18
+ "noFallthroughCasesInSwitch": true
19
+ },
20
+ "include": ["src"]
21
+ }
ui/tsconfig.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "files": [],
3
+ "references": [
4
+ { "path": "./tsconfig.app.json" },
5
+ { "path": "./tsconfig.node.json" }
6
+ ]
7
+ }
ui/tsconfig.node.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "compilerOptions": {
3
+ "target": "ES2022",
4
+ "lib": ["ES2023"],
5
+ "module": "ESNext",
6
+ "skipLibCheck": true,
7
+ "moduleResolution": "bundler",
8
+ "allowImportingTsExtensions": true,
9
+ "isolatedModules": true,
10
+ "moduleDetection": "force",
11
+ "noEmit": true
12
+ },
13
+ "include": ["vite.config.ts"]
14
+ }
ui/vite.config.ts ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import { defineConfig } from 'vite'
2
+ import react from '@vitejs/plugin-react'
3
+
4
+ export default defineConfig({
5
+ plugins: [react()],
6
+ server: {
7
+ port: 5173,
8
+ proxy: {
9
+ // Forward all /api/* requests to the FastAPI backend
10
+ '/api': {
11
+ target: 'http://localhost:8000',
12
+ changeOrigin: true,
13
+ },
14
+ },
15
+ },
16
+ })