Spaces:

AItoolstack
/

AI-PolicyTrace

Running

App Files Files Community

AI-PolicyTrace / docs /architecture.md

teja141290

Deploy PolicyTrace Hugging Face Space

be54038 5 days ago

preview code

raw

history blame contribute delete

4.11 kB

PolicyTrace Architecture

PolicyTrace is built as a two-part application:

A Python backend that performs PDF conversion, extraction, arbitration, provenance matching, and session storage.
A React frontend that lets a human reviewer inspect every extracted field against the source PDF.

Core Flow

sequenceDiagram
    participant User
    participant UI as React UI
    participant API as FastAPI
    participant Docling
    participant LLM as Groq LLM
    participant Arbiter
    participant Prov as Provenance matcher

    User->>UI: Upload PDF pack
    UI->>API: POST /api/process
    API->>Docling: Convert PDFs to Markdown and geometry
    API->>API: Mask selected PII
    API->>LLM: Classify document type
    API->>LLM: Extract typed Golden Record fields
    API->>Arbiter: Merge Schedule and Certificate
    Arbiter-->>API: Golden Record plus conflicts
    API->>Prov: Match fields to PDF text geometry
    Prov-->>API: Field-level provenance
    API-->>UI: Session ID
    UI->>API: GET /api/session/{id}
    API-->>UI: Record, provenance, conflicts

Backend Modules

`src/agents.py`

Responsible for document-level work:

Convert PDF to Markdown using Docling.
Build a Docling geometry corpus for provenance.
Mask selected PII before LLM calls.
Classify document type.
Route text to specialist extraction prompts.
Return a UKMotorGoldenRecord Pydantic model.

`src/schema.py`

Defines the canonical output contract:

UKMotorGoldenRecord
policy header
vehicle details
driver details
cover and excesses
financial summary
additional risk data
field provenance
conflict entries

The schema keeps most fields optional because each source document is only partially authoritative.

`src/arbiter.py`

Merges Schedule and Certificate records using a hierarchy of truth.

Schedule wins for:

vehicle details
cover type
no claims discount
excess breakdown
financial summary
driver DOB, occupation, licence type

Certificate wins for:

class of use
driving other cars
legal driver entitlement details when present

When two documents disagree, the arbiter records a ConflictEntry.

`src/provenance.py`

Builds field-level PDF provenance after extraction.

The LLM returns canonical values, such as ISO dates and numeric amounts, but PDF text usually contains raw phrases like 15/04/2026 at 00:00 hours or GBP 703.28.

To bridge that gap, prompts ask the LLM to also provide hidden field_citations: verbatim phrases copied from the source document. These citations are excluded from the final serialised record but used for matching against Docling text geometry.

`src/api.py`

FastAPI service for the review UI:

GET /api/health
POST /api/process
GET /api/session/{id}
GET /api/pdf/{session_id}/{filename}
PATCH /api/session/{id}/review
GET /api/session/{id}/review-state
DELETE /api/session/{id}

When ui/dist exists, the API also serves the production React app and supports direct /session/{id} refreshes.

Frontend Modules

`ui/src/UploadPage.tsx`

Upload screen for PDF packs.

`ui/src/SessionPage.tsx`

Loads an existing session from the API so sessions can be opened directly from a URL.

`ui/src/ReviewDashboard.tsx`

Two-column review layout: PDF viewer on the left, Golden Record fields on the right.

`ui/src/PDFPane.tsx`

Renders PDFs with react-pdf, overlays provenance boxes, and scrolls to selected fields.

`ui/src/RecordPane.tsx` and `ui/src/FieldRow.tsx`

Flatten the nested Golden Record into reviewable field rows with verify, override, and flag actions.

Why This Architecture

The system deliberately separates concerns:

The LLM extracts structured values.
Pydantic validates the shape.
The arbiter applies domain-specific source authority.
Provenance is calculated after extraction instead of trusting the model to invent coordinates.
The UI keeps humans in the loop where confidence, evidence, or conflicts need review.

That separation is what turns the project from a prompt demo into a deployable workflow.

PolicyTrace Architecture

Core Flow

Backend Modules

src/agents.py

src/schema.py

src/arbiter.py

src/provenance.py

src/api.py