Spaces:
Running
Running
File size: 4,112 Bytes
be54038 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 | # PolicyTrace Architecture
PolicyTrace is built as a two-part application:
- A Python backend that performs PDF conversion, extraction, arbitration, provenance matching, and session storage.
- A React frontend that lets a human reviewer inspect every extracted field against the source PDF.
## Core Flow
```mermaid
sequenceDiagram
participant User
participant UI as React UI
participant API as FastAPI
participant Docling
participant LLM as Groq LLM
participant Arbiter
participant Prov as Provenance matcher
User->>UI: Upload PDF pack
UI->>API: POST /api/process
API->>Docling: Convert PDFs to Markdown and geometry
API->>API: Mask selected PII
API->>LLM: Classify document type
API->>LLM: Extract typed Golden Record fields
API->>Arbiter: Merge Schedule and Certificate
Arbiter-->>API: Golden Record plus conflicts
API->>Prov: Match fields to PDF text geometry
Prov-->>API: Field-level provenance
API-->>UI: Session ID
UI->>API: GET /api/session/{id}
API-->>UI: Record, provenance, conflicts
```
## Backend Modules
### `src/agents.py`
Responsible for document-level work:
- Convert PDF to Markdown using Docling.
- Build a Docling geometry corpus for provenance.
- Mask selected PII before LLM calls.
- Classify document type.
- Route text to specialist extraction prompts.
- Return a `UKMotorGoldenRecord` Pydantic model.
### `src/schema.py`
Defines the canonical output contract:
- `UKMotorGoldenRecord`
- policy header
- vehicle details
- driver details
- cover and excesses
- financial summary
- additional risk data
- field provenance
- conflict entries
The schema keeps most fields optional because each source document is only partially authoritative.
### `src/arbiter.py`
Merges Schedule and Certificate records using a hierarchy of truth.
Schedule wins for:
- vehicle details
- cover type
- no claims discount
- excess breakdown
- financial summary
- driver DOB, occupation, licence type
Certificate wins for:
- class of use
- driving other cars
- legal driver entitlement details when present
When two documents disagree, the arbiter records a `ConflictEntry`.
### `src/provenance.py`
Builds field-level PDF provenance after extraction.
The LLM returns canonical values, such as ISO dates and numeric amounts, but PDF text usually contains raw phrases like `15/04/2026 at 00:00 hours` or `GBP 703.28`.
To bridge that gap, prompts ask the LLM to also provide hidden `field_citations`: verbatim phrases copied from the source document. These citations are excluded from the final serialised record but used for matching against Docling text geometry.
### `src/api.py`
FastAPI service for the review UI:
- `GET /api/health`
- `POST /api/process`
- `GET /api/session/{id}`
- `GET /api/pdf/{session_id}/{filename}`
- `PATCH /api/session/{id}/review`
- `GET /api/session/{id}/review-state`
- `DELETE /api/session/{id}`
When `ui/dist` exists, the API also serves the production React app and supports direct `/session/{id}` refreshes.
## Frontend Modules
### `ui/src/UploadPage.tsx`
Upload screen for PDF packs.
### `ui/src/SessionPage.tsx`
Loads an existing session from the API so sessions can be opened directly from a URL.
### `ui/src/ReviewDashboard.tsx`
Two-column review layout: PDF viewer on the left, Golden Record fields on the right.
### `ui/src/PDFPane.tsx`
Renders PDFs with `react-pdf`, overlays provenance boxes, and scrolls to selected fields.
### `ui/src/RecordPane.tsx` and `ui/src/FieldRow.tsx`
Flatten the nested Golden Record into reviewable field rows with verify, override, and flag actions.
## Why This Architecture
The system deliberately separates concerns:
- The LLM extracts structured values.
- Pydantic validates the shape.
- The arbiter applies domain-specific source authority.
- Provenance is calculated after extraction instead of trusting the model to invent coordinates.
- The UI keeps humans in the loop where confidence, evidence, or conflicts need review.
That separation is what turns the project from a prompt demo into a deployable workflow.
|