Spaces:
Running
Running
| # PolicyTrace Architecture | |
| PolicyTrace is built as a two-part application: | |
| - A Python backend that performs PDF conversion, extraction, arbitration, provenance matching, and session storage. | |
| - A React frontend that lets a human reviewer inspect every extracted field against the source PDF. | |
| ## Core Flow | |
| ```mermaid | |
| sequenceDiagram | |
| participant User | |
| participant UI as React UI | |
| participant API as FastAPI | |
| participant Docling | |
| participant LLM as Groq LLM | |
| participant Arbiter | |
| participant Prov as Provenance matcher | |
| User->>UI: Upload PDF pack | |
| UI->>API: POST /api/process | |
| API->>Docling: Convert PDFs to Markdown and geometry | |
| API->>API: Mask selected PII | |
| API->>LLM: Classify document type | |
| API->>LLM: Extract typed Golden Record fields | |
| API->>Arbiter: Merge Schedule and Certificate | |
| Arbiter-->>API: Golden Record plus conflicts | |
| API->>Prov: Match fields to PDF text geometry | |
| Prov-->>API: Field-level provenance | |
| API-->>UI: Session ID | |
| UI->>API: GET /api/session/{id} | |
| API-->>UI: Record, provenance, conflicts | |
| ``` | |
| ## Backend Modules | |
| ### `src/agents.py` | |
| Responsible for document-level work: | |
| - Convert PDF to Markdown using Docling. | |
| - Build a Docling geometry corpus for provenance. | |
| - Mask selected PII before LLM calls. | |
| - Classify document type. | |
| - Route text to specialist extraction prompts. | |
| - Return a `UKMotorGoldenRecord` Pydantic model. | |
| ### `src/schema.py` | |
| Defines the canonical output contract: | |
| - `UKMotorGoldenRecord` | |
| - policy header | |
| - vehicle details | |
| - driver details | |
| - cover and excesses | |
| - financial summary | |
| - additional risk data | |
| - field provenance | |
| - conflict entries | |
| The schema keeps most fields optional because each source document is only partially authoritative. | |
| ### `src/arbiter.py` | |
| Merges Schedule and Certificate records using a hierarchy of truth. | |
| Schedule wins for: | |
| - vehicle details | |
| - cover type | |
| - no claims discount | |
| - excess breakdown | |
| - financial summary | |
| - driver DOB, occupation, licence type | |
| Certificate wins for: | |
| - class of use | |
| - driving other cars | |
| - legal driver entitlement details when present | |
| When two documents disagree, the arbiter records a `ConflictEntry`. | |
| ### `src/provenance.py` | |
| Builds field-level PDF provenance after extraction. | |
| The LLM returns canonical values, such as ISO dates and numeric amounts, but PDF text usually contains raw phrases like `15/04/2026 at 00:00 hours` or `GBP 703.28`. | |
| To bridge that gap, prompts ask the LLM to also provide hidden `field_citations`: verbatim phrases copied from the source document. These citations are excluded from the final serialised record but used for matching against Docling text geometry. | |
| ### `src/api.py` | |
| FastAPI service for the review UI: | |
| - `GET /api/health` | |
| - `POST /api/process` | |
| - `GET /api/session/{id}` | |
| - `GET /api/pdf/{session_id}/{filename}` | |
| - `PATCH /api/session/{id}/review` | |
| - `GET /api/session/{id}/review-state` | |
| - `DELETE /api/session/{id}` | |
| When `ui/dist` exists, the API also serves the production React app and supports direct `/session/{id}` refreshes. | |
| ## Frontend Modules | |
| ### `ui/src/UploadPage.tsx` | |
| Upload screen for PDF packs. | |
| ### `ui/src/SessionPage.tsx` | |
| Loads an existing session from the API so sessions can be opened directly from a URL. | |
| ### `ui/src/ReviewDashboard.tsx` | |
| Two-column review layout: PDF viewer on the left, Golden Record fields on the right. | |
| ### `ui/src/PDFPane.tsx` | |
| Renders PDFs with `react-pdf`, overlays provenance boxes, and scrolls to selected fields. | |
| ### `ui/src/RecordPane.tsx` and `ui/src/FieldRow.tsx` | |
| Flatten the nested Golden Record into reviewable field rows with verify, override, and flag actions. | |
| ## Why This Architecture | |
| The system deliberately separates concerns: | |
| - The LLM extracts structured values. | |
| - Pydantic validates the shape. | |
| - The arbiter applies domain-specific source authority. | |
| - Provenance is calculated after extraction instead of trusting the model to invent coordinates. | |
| - The UI keeps humans in the loop where confidence, evidence, or conflicts need review. | |
| That separation is what turns the project from a prompt demo into a deployable workflow. | |