Spaces:
Running
PolicyTrace Architecture
PolicyTrace is built as a two-part application:
- A Python backend that performs PDF conversion, extraction, arbitration, provenance matching, and session storage.
- A React frontend that lets a human reviewer inspect every extracted field against the source PDF.
Core Flow
sequenceDiagram
participant User
participant UI as React UI
participant API as FastAPI
participant Docling
participant LLM as Groq LLM
participant Arbiter
participant Prov as Provenance matcher
User->>UI: Upload PDF pack
UI->>API: POST /api/process
API->>Docling: Convert PDFs to Markdown and geometry
API->>API: Mask selected PII
API->>LLM: Classify document type
API->>LLM: Extract typed Golden Record fields
API->>Arbiter: Merge Schedule and Certificate
Arbiter-->>API: Golden Record plus conflicts
API->>Prov: Match fields to PDF text geometry
Prov-->>API: Field-level provenance
API-->>UI: Session ID
UI->>API: GET /api/session/{id}
API-->>UI: Record, provenance, conflicts
Backend Modules
src/agents.py
Responsible for document-level work:
- Convert PDF to Markdown using Docling.
- Build a Docling geometry corpus for provenance.
- Mask selected PII before LLM calls.
- Classify document type.
- Route text to specialist extraction prompts.
- Return a
UKMotorGoldenRecordPydantic model.
src/schema.py
Defines the canonical output contract:
UKMotorGoldenRecord- policy header
- vehicle details
- driver details
- cover and excesses
- financial summary
- additional risk data
- field provenance
- conflict entries
The schema keeps most fields optional because each source document is only partially authoritative.
src/arbiter.py
Merges Schedule and Certificate records using a hierarchy of truth.
Schedule wins for:
- vehicle details
- cover type
- no claims discount
- excess breakdown
- financial summary
- driver DOB, occupation, licence type
Certificate wins for:
- class of use
- driving other cars
- legal driver entitlement details when present
When two documents disagree, the arbiter records a ConflictEntry.
src/provenance.py
Builds field-level PDF provenance after extraction.
The LLM returns canonical values, such as ISO dates and numeric amounts, but PDF text usually contains raw phrases like 15/04/2026 at 00:00 hours or GBP 703.28.
To bridge that gap, prompts ask the LLM to also provide hidden field_citations: verbatim phrases copied from the source document. These citations are excluded from the final serialised record but used for matching against Docling text geometry.
src/api.py
FastAPI service for the review UI:
GET /api/healthPOST /api/processGET /api/session/{id}GET /api/pdf/{session_id}/{filename}PATCH /api/session/{id}/reviewGET /api/session/{id}/review-stateDELETE /api/session/{id}
When ui/dist exists, the API also serves the production React app and supports direct /session/{id} refreshes.
Frontend Modules
ui/src/UploadPage.tsx
Upload screen for PDF packs.
ui/src/SessionPage.tsx
Loads an existing session from the API so sessions can be opened directly from a URL.
ui/src/ReviewDashboard.tsx
Two-column review layout: PDF viewer on the left, Golden Record fields on the right.
ui/src/PDFPane.tsx
Renders PDFs with react-pdf, overlays provenance boxes, and scrolls to selected fields.
ui/src/RecordPane.tsx and ui/src/FieldRow.tsx
Flatten the nested Golden Record into reviewable field rows with verify, override, and flag actions.
Why This Architecture
The system deliberately separates concerns:
- The LLM extracts structured values.
- Pydantic validates the shape.
- The arbiter applies domain-specific source authority.
- Provenance is calculated after extraction instead of trusting the model to invent coordinates.
- The UI keeps humans in the loop where confidence, evidence, or conflicts need review.
That separation is what turns the project from a prompt demo into a deployable workflow.