File size: 4,112 Bytes
be54038
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
# PolicyTrace Architecture

PolicyTrace is built as a two-part application:

- A Python backend that performs PDF conversion, extraction, arbitration, provenance matching, and session storage.
- A React frontend that lets a human reviewer inspect every extracted field against the source PDF.

## Core Flow

```mermaid
sequenceDiagram
    participant User
    participant UI as React UI
    participant API as FastAPI
    participant Docling
    participant LLM as Groq LLM
    participant Arbiter
    participant Prov as Provenance matcher

    User->>UI: Upload PDF pack
    UI->>API: POST /api/process
    API->>Docling: Convert PDFs to Markdown and geometry
    API->>API: Mask selected PII
    API->>LLM: Classify document type
    API->>LLM: Extract typed Golden Record fields
    API->>Arbiter: Merge Schedule and Certificate
    Arbiter-->>API: Golden Record plus conflicts
    API->>Prov: Match fields to PDF text geometry
    Prov-->>API: Field-level provenance
    API-->>UI: Session ID
    UI->>API: GET /api/session/{id}
    API-->>UI: Record, provenance, conflicts
```

## Backend Modules

### `src/agents.py`

Responsible for document-level work:

- Convert PDF to Markdown using Docling.
- Build a Docling geometry corpus for provenance.
- Mask selected PII before LLM calls.
- Classify document type.
- Route text to specialist extraction prompts.
- Return a `UKMotorGoldenRecord` Pydantic model.

### `src/schema.py`

Defines the canonical output contract:

- `UKMotorGoldenRecord`
- policy header
- vehicle details
- driver details
- cover and excesses
- financial summary
- additional risk data
- field provenance
- conflict entries

The schema keeps most fields optional because each source document is only partially authoritative.

### `src/arbiter.py`

Merges Schedule and Certificate records using a hierarchy of truth.

Schedule wins for:

- vehicle details
- cover type
- no claims discount
- excess breakdown
- financial summary
- driver DOB, occupation, licence type

Certificate wins for:

- class of use
- driving other cars
- legal driver entitlement details when present

When two documents disagree, the arbiter records a `ConflictEntry`.

### `src/provenance.py`

Builds field-level PDF provenance after extraction.

The LLM returns canonical values, such as ISO dates and numeric amounts, but PDF text usually contains raw phrases like `15/04/2026 at 00:00 hours` or `GBP 703.28`.

To bridge that gap, prompts ask the LLM to also provide hidden `field_citations`: verbatim phrases copied from the source document. These citations are excluded from the final serialised record but used for matching against Docling text geometry.

### `src/api.py`

FastAPI service for the review UI:

- `GET /api/health`
- `POST /api/process`
- `GET /api/session/{id}`
- `GET /api/pdf/{session_id}/{filename}`
- `PATCH /api/session/{id}/review`
- `GET /api/session/{id}/review-state`
- `DELETE /api/session/{id}`

When `ui/dist` exists, the API also serves the production React app and supports direct `/session/{id}` refreshes.

## Frontend Modules

### `ui/src/UploadPage.tsx`

Upload screen for PDF packs.

### `ui/src/SessionPage.tsx`

Loads an existing session from the API so sessions can be opened directly from a URL.

### `ui/src/ReviewDashboard.tsx`

Two-column review layout: PDF viewer on the left, Golden Record fields on the right.

### `ui/src/PDFPane.tsx`

Renders PDFs with `react-pdf`, overlays provenance boxes, and scrolls to selected fields.

### `ui/src/RecordPane.tsx` and `ui/src/FieldRow.tsx`

Flatten the nested Golden Record into reviewable field rows with verify, override, and flag actions.

## Why This Architecture

The system deliberately separates concerns:

- The LLM extracts structured values.
- Pydantic validates the shape.
- The arbiter applies domain-specific source authority.
- Provenance is calculated after extraction instead of trusting the model to invent coordinates.
- The UI keeps humans in the loop where confidence, evidence, or conflicts need review.

That separation is what turns the project from a prompt demo into a deployable workflow.