AI-PolicyTrace / docs /architecture.md
teja141290's picture
Deploy PolicyTrace Hugging Face Space
be54038
# PolicyTrace Architecture
PolicyTrace is built as a two-part application:
- A Python backend that performs PDF conversion, extraction, arbitration, provenance matching, and session storage.
- A React frontend that lets a human reviewer inspect every extracted field against the source PDF.
## Core Flow
```mermaid
sequenceDiagram
participant User
participant UI as React UI
participant API as FastAPI
participant Docling
participant LLM as Groq LLM
participant Arbiter
participant Prov as Provenance matcher
User->>UI: Upload PDF pack
UI->>API: POST /api/process
API->>Docling: Convert PDFs to Markdown and geometry
API->>API: Mask selected PII
API->>LLM: Classify document type
API->>LLM: Extract typed Golden Record fields
API->>Arbiter: Merge Schedule and Certificate
Arbiter-->>API: Golden Record plus conflicts
API->>Prov: Match fields to PDF text geometry
Prov-->>API: Field-level provenance
API-->>UI: Session ID
UI->>API: GET /api/session/{id}
API-->>UI: Record, provenance, conflicts
```
## Backend Modules
### `src/agents.py`
Responsible for document-level work:
- Convert PDF to Markdown using Docling.
- Build a Docling geometry corpus for provenance.
- Mask selected PII before LLM calls.
- Classify document type.
- Route text to specialist extraction prompts.
- Return a `UKMotorGoldenRecord` Pydantic model.
### `src/schema.py`
Defines the canonical output contract:
- `UKMotorGoldenRecord`
- policy header
- vehicle details
- driver details
- cover and excesses
- financial summary
- additional risk data
- field provenance
- conflict entries
The schema keeps most fields optional because each source document is only partially authoritative.
### `src/arbiter.py`
Merges Schedule and Certificate records using a hierarchy of truth.
Schedule wins for:
- vehicle details
- cover type
- no claims discount
- excess breakdown
- financial summary
- driver DOB, occupation, licence type
Certificate wins for:
- class of use
- driving other cars
- legal driver entitlement details when present
When two documents disagree, the arbiter records a `ConflictEntry`.
### `src/provenance.py`
Builds field-level PDF provenance after extraction.
The LLM returns canonical values, such as ISO dates and numeric amounts, but PDF text usually contains raw phrases like `15/04/2026 at 00:00 hours` or `GBP 703.28`.
To bridge that gap, prompts ask the LLM to also provide hidden `field_citations`: verbatim phrases copied from the source document. These citations are excluded from the final serialised record but used for matching against Docling text geometry.
### `src/api.py`
FastAPI service for the review UI:
- `GET /api/health`
- `POST /api/process`
- `GET /api/session/{id}`
- `GET /api/pdf/{session_id}/{filename}`
- `PATCH /api/session/{id}/review`
- `GET /api/session/{id}/review-state`
- `DELETE /api/session/{id}`
When `ui/dist` exists, the API also serves the production React app and supports direct `/session/{id}` refreshes.
## Frontend Modules
### `ui/src/UploadPage.tsx`
Upload screen for PDF packs.
### `ui/src/SessionPage.tsx`
Loads an existing session from the API so sessions can be opened directly from a URL.
### `ui/src/ReviewDashboard.tsx`
Two-column review layout: PDF viewer on the left, Golden Record fields on the right.
### `ui/src/PDFPane.tsx`
Renders PDFs with `react-pdf`, overlays provenance boxes, and scrolls to selected fields.
### `ui/src/RecordPane.tsx` and `ui/src/FieldRow.tsx`
Flatten the nested Golden Record into reviewable field rows with verify, override, and flag actions.
## Why This Architecture
The system deliberately separates concerns:
- The LLM extracts structured values.
- Pydantic validates the shape.
- The arbiter applies domain-specific source authority.
- Provenance is calculated after extraction instead of trusting the model to invent coordinates.
- The UI keeps humans in the loop where confidence, evidence, or conflicts need review.
That separation is what turns the project from a prompt demo into a deployable workflow.