guide / docs /architecture.md
anmol-iisc's picture
UI enhancements, letter text redundant text removed
d230384
|
Raw
History Blame Contribute Delete
31 kB

G.U.I.D.E. β€” Technical Architecture & Specification

1. Overview

G.U.I.D.E. (Grievance Utility for Information Extraction, Drafting and Enrichment) is a four-layer, spec-driven system built for consumer complaint resolution. Every component has a clear contract; layers communicate through defined interfaces so each piece can be tested and replaced independently.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                        GRADIO FRONTEND                          β”‚
β”‚  Chat Β· Verify (HITL) Β· Documents Β· Draft Β· Escalation Β· About β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                           β”‚ HTTP (REST)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                       FASTAPI BACKEND                           β”‚
β”‚   /api/session  /api/message  /api/upload                       β”‚
β”‚   /api/session/{id}/validate-entities   /api/status             β”‚
β””β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
   β”‚                  β”‚
   β”‚ Step 1 (local)   β”‚ Step 2 (external API β€” after redaction)
   β”‚                  β”‚
β”Œβ”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  PRESIDIO       β”‚  β”‚  CLAUDE MANAGED AGENT (CMA)                 β”‚
β”‚  PIIRedactor    β”‚  β”‚                                             β”‚
β”‚  (runs locally) β”‚  β”‚  Tools:                                     β”‚
β”‚                 β”‚  β”‚  - classify_domain()  ──► DomainClassifier  β”‚
β”‚  Redacts:       β”‚  β”‚  - extract_entities() ──► EvidenceNER       β”‚
β”‚  PERSON         β”‚  β”‚  - process_document() ──► OCR / ViT + NER  β”‚
β”‚  PHONE_NUMBER   β”‚  β”‚  - draft_complaint()  ──► Claude (internal) β”‚
β”‚  EMAIL_ADDRESS  β”‚  β”‚  - recommend_action() ──► NextActionPredict β”‚
β”‚  CREDIT_CARD    β”‚  β”‚                                             β”‚
β”‚  IN_AADHAAR     β”‚  β”‚  HITL gate: agent pauses before drafting    β”‚
β”‚  IN_PAN         β”‚  β”‚  and requests user confirmation of entities β”‚
β”‚  IBAN_CODE      β”‚  β”‚                                             β”‚
β”‚  ...            β”‚  β”‚  Memory: per-user session state             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                      β”‚
                         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                         β”‚   DEEP LEARNING LAYER                    β”‚
                         β”‚                                          β”‚
                         β”‚  1. DomainClassifier (DistilBERT)        β”‚
                         β”‚  2. EvidenceNER      (DistilBERT tokens) β”‚
                         β”‚  3. DocumentViT      (ViT image encoder) β”‚
                         β”‚  4. NextActionPredictor (MLP)            β”‚
                         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                      β”‚
                         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                         β”‚   DOCUMENT PROCESSOR     β”‚
                         β”‚  Tesseract OCR            β”‚
                         β”‚  pdfplumber (PDF parse)   β”‚
                         β”‚  PIL (image pre-process)  β”‚
                         β”‚  ViT (image understanding)β”‚
                         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

2. Component Specifications

2.0 Privacy Preprocessing β€” Microsoft Presidio

This layer runs locally before any message is forwarded to Claude or any external service. It is the first step in the pipeline.

Attribute Value
Library presidio-analyzer + presidio-anonymizer
NLP engine spaCy en_core_web_lg (local, no network call)
Trigger Every /api/session/{id}/message call
Entity types PERSON, PHONE_NUMBER, EMAIL_ADDRESS, CREDIT_CARD, IBAN_CODE, US_BANK_NUMBER, IN_AADHAAR, IN_PAN, IN_VEHICLE_REGISTRATION
Replacement Each detected span β†’ <ENTITY_TYPE> placeholder
Output RedactionResult(redacted_text, pii_types_found, pii_redacted)
Failure mode On any error, original text is returned unchanged (fail-open so pipeline is never blocked)

Why local?
The user's name, account number, and Aadhaar UID must never leave the device in plaintext. Running Presidio in-process (same Python server) ensures redaction happens before TCP bytes are written to the Anthropic API.

What the DL models receive:
The local DL models (DomainClassifier, EvidenceNER) are called by Claude through tool calls β€” they therefore also receive redacted text. The EvidenceNER model is still effective because it targets structural patterns (amounts, dates, reference IDs) that are not redacted by Presidio.


2.1 Deep Learning Layer

2.1.1 DomainClassifier

Attribute Value
Architecture distilbert-base-uncased + linear classification head
Task Multi-class text classification
Classes (6) ecommerce, telecom, banking, cibil, insurance, general
Training data CFPB Consumer Complaint Database (3M+ rows) β€” one-time download from Kaggle. Save as data/raw/complaints.csv.
Script python -m src.classifier.train --cfpb_csv data/raw/complaints.csv --output_dir models/domain_classifier
Input Redacted complaint text (string, max 512 tokens)
Output DomainResult(domain: str, confidence: float, all_probs: dict, low_confidence: bool)
Confidence threshold 0.50 β€” results below this set low_confidence=True
Low-confidence path Agent asks user one clarifying domain question; does not proceed until user confirms
Keyword fallback Used when no checkpoint exists; always returns confidence=0.0, low_confidence=True
Fine-tune time ~30 min CPU / ~5 min GPU (T4)
Library HuggingFace transformers + datasets

Why DistilBERT?
DistilBERT is 40% smaller and 60% faster than BERT-base with only 3% accuracy loss. For a project with limited compute, it is the ideal starting point. The CFPB dataset maps naturally to our 6 classes after label remapping.

Low-confidence handling:
general is the intentional catch-all class β€” the model never returns an error, only a domain + probability. However, a low probability on all classes (e.g., the complaint text is too short or ambiguous) means the winning domain is unreliable. When confidence < 0.50 the low_confidence flag is set and the CMA agent pauses to ask the user one clarifying question ("Is this about e-commerce, telecom, banking, credit score, insurance, or other?") before continuing. The user's answer overrides the model's suggestion and is stored with domain_source = "user_confirmed" so later tools know the domain is authoritative.


2.1.2 EvidenceNER

Attribute Value
Architecture distilbert-base-uncased with token classification head
Task Named Entity Recognition (NER) on complaint text
Entity types ORG, AMOUNT, DATE, REF_ID, ACCOUNT, PERSON
Training ~4,000 synthetic complaint sentences generated in-memory by src/ner/train.py (no download needed). Optionally augmented with CoNLL-2003 via HuggingFace if internet is available (maps PER→PERSON, ORG→ORG; discards LOC/MISC).
Script python -m src.ner.train --output_dir models/evidence_ner
Input Redacted text (from user or OCR)
Output List of {text, label, start, end, confidence} spans

Entities and their use in drafting:

Entity Example Used for
ORG "Flipkart", "HDFC Bank" Complaint addressee
AMOUNT "β‚Ή4,299", "Rs. 1,200" Financial loss quantified
DATE "12 March 2024", "last Tuesday" Incident timeline
REF_ID "Order #OD-2930291", "TXN123" Evidence reference
ACCOUNT "XXXX-1234", "loan account" Dispute target
PERSON "customer care executive" Named witness/contact

2.1.3 DocumentViT

Attribute Value
Architecture Vision Transformer (google/vit-base-patch16-224 fine-tuned)
Task Structured evidence extraction from document images
Input Scanned receipt / bill / screenshot (PIL Image)
Output List of {text, label, confidence} spans (same schema as NER)
When used After OCR; ViT runs as a complementary pass on image-type docs
Library HuggingFace transformers (ViTForImageClassification + custom head)

Why ViT alongside OCR?
Tesseract OCR excels at clean printed text but struggles with handwriting, logos, and table structures. The ViT model, fine-tuned on receipt and bill images, directly classifies image regions and extracts amount/date/provider fields β€” especially useful for blurry screenshots and poorly-scanned documents.


2.1.4 NextActionPredictor

Attribute Value
Architecture 2-hidden-layer MLP (12-dim input β†’ 64 β†’ 64 β†’ 6)
Input 12-dim feature vector: domain one-hot (6) + entity flags (5) + prior_contact (1)
Output Ranked list of {action, authority, url, confidence}
Actions 6 classes: company_support, nch, trai, rbi_ombudsman, irdai, legal
Training data ~6,000 synthetic (domain, entity_flags, prior_contact β†’ action) examples generated in-memory from DOMAIN_ACTION_PRIORS; no download needed. Trains in < 30 seconds on CPU.
Script python -m src.next_action.train --output_dir models/next_action
Fallback If no checkpoint exists, DOMAIN_ACTION_PRIORS rule-based mapping is used so the pipeline always works.

Escalation routing logic:

Domain Primary Authority Secondary
E-commerce Company support β†’ NCH Consumer Forum
Telecom Company support β†’ TRAI NCH
Banking Company support β†’ RBI BO Banking Ombudsman
CIBIL Bureau direct β†’ RBI BO SEBI (if investment)
Insurance Company support β†’ IRDAI Insurance Ombudsman
General Company support β†’ NCH Consumer Forum

2.2 Claude Managed Agent (CMA)

The CMA is the orchestration layer. It maintains per-user session state (conversation history, extracted entities, uploaded docs, draft versions) and decides at each turn which tool to invoke.

Key constraint: Claude only ever sees redacted text (PII replaced with <ENTITY_TYPE> placeholders by Presidio before the API call). This is documented in the system prompt so Claude knows not to try to recover original values.

Agent System Prompt Summary

You are G.U.I.D.E., an expert consumer complaint assistant.
PII has already been redacted locally β€” work with placeholders as-is.

Rules:
1. Always classify the domain first using classify_domain().
   β€’ If low_confidence=false (β‰₯ 0.50): store domain and proceed.
   β€’ If low_confidence=true (< 0.50 or keyword fallback): ask the user ONE
     clarifying question ("Is this about e-commerce, telecom, banking, credit
     score, insurance, or other?") before continuing. Store domain_source=
     "user_confirmed" when the domain comes from the user.
   β€’ If classify_domain() errors: same clarifying question as above.
2. Ask ONE targeted follow-up question at a time if information is missing.
3. If documents are uploaded, always run process_document before drafting.
4. HITL gate: Before calling draft_complaint, present extracted details
   as a numbered summary and ask the user to confirm them.  Wait for
   [USER CONFIRMED] message before proceeding.
5. Never draft until domain, provider, date, amount, prior contact, and
   desired resolution are all known AND user-confirmed.
6. Generate drafts in formal English: Subject / To / Body / From.
7. Always recommend the next escalation step with specific portal URLs.

Tool Specifications

Tool Input Output Calls
classify_domain complaint_text: str DomainResult DL DomainClassifier
extract_entities text: str List[Entity] DL EvidenceNER
process_document file_path: str {raw_text, entities} OCR + ViT + EvidenceNER
draft_complaint complaint_context: dict ComplaintDraft Claude (internal)
recommend_action domain: str, entities: dict List[EscalationAction] DL NextAction
store_memory key: str, value: any None CMA Memory Store
get_memory key: str any CMA Memory Store

CMA Decision Flow (per user turn)

User message received
        β”‚
        β–Ό  ── PRESIDIO (API layer, before agent) ───────────────
  PII redacted locally β†’ redacted_text forwarded to Claude
        β”‚
        β–Ό
  Is domain known? ──No──► call classify_domain() ──► store in memory
        β”‚
       Yes
        β”‚
        β–Ό
  Are minimum fields complete? ──No──► ask ONE follow-up question
  (provider, date, amount, ref)
        β”‚
       Yes
        β”‚
        β–Ό
  Was a document uploaded? ──Yes──► call process_document() ──► merge entities
        β”‚
       No
        β”‚
        β–Ό  ── HITL GATE ──────────────────────────────────────────
  Present extracted details summary β†’ ask user to confirm
        β”‚
  Wait for [USER CONFIRMED] message (from /validate-entities endpoint)
        β”‚
        β–Ό
  Has user confirmed entities? ──Yes──► call draft_complaint() ──► show draft
        β”‚
        β–Ό
  Has user asked next steps? ──Yes──► call recommend_action() ──► show escalation

2.3 Human-in-the-Loop (HITL) Validation

After all required fields are collected and before draft generation, the system pauses and requires explicit user confirmation of the extracted entities.

Step Component Description
1 CMA Presents extracted entities as a numbered summary in chat
2 Frontend Populates the Verify Entities tab with pre-filled editable fields
3 User Reviews, edits any incorrect value, and clicks "Confirm & Generate Draft"
4 API POST /api/session/{id}/validate-entities sends verified entities to CMA
5 CMA Receives [USER CONFIRMED] message and calls draft_complaint()

Why HITL?
PII redaction replaces some values with placeholders (e.g., a name becomes <PERSON>). The HITL step lets the user supply the correct readable label (e.g., "HDFC Bank" rather than just <ORG>) that will appear in the final draft, improving both accuracy and trust in the generated complaint.


2.4 Document Processor

Feature Implementation
PDF parsing pdfplumber (text-native PDFs)
Image OCR pytesseract + Pillow (pre-process)
ViT pass google/vit-base-patch16-224 fine-tuned on receipt/bill images
Pre-process Greyscale β†’ adaptive threshold β†’ deskew
Output Clean extracted text + NER entities
Formats PDF, PNG, JPG, JPEG, WEBP

2.5 FastAPI Backend

Endpoint Method Description
/api/health GET Health check (all components)
/api/session/create POST Create new CMA session
/api/session/{id}/message POST Send message β†’ Presidio redact β†’ agent
/api/session/{id}/upload POST Upload a document to session
/api/session/{id}/validate-entities POST HITL: submit user-confirmed entities
/api/session/{id}/history GET Retrieve conversation history
/api/classify POST Direct DL classification (debug)
/api/extract POST Direct NER extraction (debug)

2.6 Gradio Frontend

Tabs:

  1. Chat β€” Conversational interface; shows πŸ”’ privacy badge when PII is redacted
  2. Verify Entities β€” HITL panel: editable entity fields + "Confirm & Generate Draft"
  3. Documents β€” Drag-and-drop upload; shows extracted entities
  4. Complaint Draft β€” Rendered complaint with copy/download
  5. Escalation Guide β€” Recommended authorities with portal links
  6. About β€” Architecture diagram, model cards, tech stack

3. Technology Stack

Layer Technology Reason
Launcher start.py (stdlib only) Single script β€” trains all models then starts servers
Privacy Microsoft Presidio + spaCy Local PII redaction, no cloud call
DL Models HuggingFace Transformers Industry standard for NLP + ViT
Classifier data CFPB dataset (Kaggle, one-time) 3M+ real complaints, public license
NER data Synthetic in-memory Template-generated; no download required
NextAction data Synthetic in-memory Generated from domain priors; no download
Agent Anthropic CMA (default claude-sonnet-4-6, set via GUIDE_MODEL) Stateful, tool-using agent
Backend FastAPI + Uvicorn Async, fast, OpenAPI auto-docs
Frontend Gradio 4.x ML-native UI, file upload, chat
OCR pytesseract + pdfplumber Proven, open-source
ViT doc model HuggingFace ViT Image-based evidence extraction
Env Python 3.10+ Required by CMA SDK
Config python-dotenv Secure API key management
Notebooks (optional) Jupyter EDA and demo only; not required to run system

4. Data Flow β€” End to End

User types: "Rahul Sharma β€” Flipkart hasn't refunded β‚Ή4,299 for order OD-123
             cancelled 3 weeks ago. My phone is 9876543210."
                β”‚
                β–Ό  ── LOCAL ONLY ─────────────────────────────────────────────
        Presidio PIIRedactor detects: PERSON("Rahul Sharma"), PHONE("9876543210")
        Redacted: "<PERSON> β€” Flipkart hasn't refunded β‚Ή4,299 for order OD-123
                   cancelled 3 weeks ago. My phone is <PHONE_NUMBER>."
                β”‚
                β–Ό  ── EXTERNAL API ────────────────────────────────────────────
        FastAPI β†’ CMA session with redacted text
                β”‚
        Claude CMA agent processes redacted message
                β”‚
                β”œβ”€β”€β–Ί classify_domain(...)
                β”‚         └──► DomainClassifier β†’ {domain: "ecommerce", conf: 0.97}
                β”‚
                β”œβ”€β”€β–Ί extract_entities(...)
                β”‚         └──► EvidenceNER β†’ [ORG:"Flipkart", AMOUNT:"β‚Ή4,299",
                β”‚                             REF_ID:"OD-123", DATE:"3 weeks ago"]
                β”‚
                └──► HITL gate: "I have extracted the following β€” please confirm:
                                  - Company: Flipkart
                                  - Amount: β‚Ή4,299
                                  - Order ID: OD-123
                                  - Date: 3 weeks ago
                                 Is this correct?"
                β”‚
        User reviews in "Verify Entities" tab β†’ edits if needed β†’ clicks Confirm
                β”‚
                β–Ό
        POST /validate-entities β†’ [USER CONFIRMED] β†’ draft_complaint()
                β”‚
        ComplaintDraft generated and shown in Draft tab
                β”‚
        recommend_action(domain="ecommerce") β†’ [NCH, Consumer Forum]
                β”‚
        Gradio renders: Draft Β· Evidence table Β· Escalation panel

5. Project Directory Structure

Project_ResolveAI/
β”‚
β”œβ”€β”€ start.py                       ← SINGLE ENTRY POINT β€” trains all models then starts servers
β”‚
β”œβ”€β”€ docs/
β”‚   β”œβ”€β”€ abstract.md                ← project abstract (G.U.I.D.E.)
β”‚   └── architecture.md            ← this file (spec)
β”‚
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ privacy/                   # Presidio PII redaction (runs before any external call)
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   └── redactor.py            ← PIIRedactor singleton
β”‚   β”‚
β”‚   β”œβ”€β”€ classifier/                # DL Domain Classifier (DistilBERT)
β”‚   β”‚   β”œβ”€β”€ model.py               ← DomainClassifier, CFPB_PRODUCT_MAP, LABEL2ID
β”‚   β”‚   β”œβ”€β”€ dataset.py             ← load_cfpb_csv(), clean_complaint_text(), ComplaintDataset
β”‚   β”‚   β”œβ”€β”€ train.py               ← CLI: --cfpb_csv, --output_dir, --epochs, --batch_size
β”‚   β”‚   └── predict.py             ← DomainPredictor singleton, classify_domain()
β”‚   β”‚
β”‚   β”œβ”€β”€ ner/                       # DL NER model (DistilBERT token classifier)
β”‚   β”‚   β”œβ”€β”€ model.py               ← EvidenceNER, NER_LABELS, NER_LABEL2ID
β”‚   β”‚   β”œβ”€β”€ train.py               ← Generates synthetic data in-memory; CLI: --output_dir
β”‚   β”‚   └── predict.py             ← NERPredictor singleton, extract_entities()
β”‚   β”‚
β”‚   β”œβ”€β”€ next_action/               # Next-action MLP predictor
β”‚   β”‚   β”œβ”€β”€ model.py               ← NextActionMLP, DOMAIN_ACTION_PRIORS, build_feature_vector()
β”‚   β”‚   β”œβ”€β”€ train.py               ← Generates synthetic features; CLI: --output_dir, --epochs
β”‚   β”‚   └── predict.py             ← NextActionPredictor singleton (MLP or rule-based fallback)
β”‚   β”‚
β”‚   β”œβ”€β”€ document_processor/        # OCR + PDF parsing + ViT
β”‚   β”‚   β”œβ”€β”€ ocr.py                 ← Tesseract + pdfplumber pipeline
β”‚   β”‚   └── vit_extractor.py       ← ViT-based image evidence extraction
β”‚   β”‚
β”‚   β”œβ”€β”€ agent/                     # Claude CMA integration
β”‚   β”‚   β”œβ”€β”€ tools.py               ← Tool definitions (JSON Schema) + execute_tool()
β”‚   β”‚   β”œβ”€β”€ prompts.py             ← SYSTEM_PROMPT (HITL rule 6, privacy context)
β”‚   β”‚   └── session.py             ← AgentManager singleton, send_message()
β”‚   β”‚
β”‚   └── api/                       # FastAPI application
β”‚       β”œβ”€β”€ main.py                ← Lifespan: Presidio β†’ DL models β†’ CMA agent
β”‚       β”œβ”€β”€ routes.py              ← /message (Presidioβ†’agent), /validate-entities (HITL)
β”‚       └── schemas.py             ← Pydantic models incl. HITL + pii_redacted fields
β”‚
β”œβ”€β”€ ui/
β”‚   └── app.py                     ← Gradio: Chat Β· Verify Β· Docs Β· Draft Β· Escalation Β· About
β”‚
β”œβ”€β”€ notebooks/                     # OPTIONAL β€” EDA and interactive demos only
β”‚   β”œβ”€β”€ 01_data_exploration.ipynb  ← Explore CFPB dataset, save processed CSV
β”‚   β”œβ”€β”€ 02_classifier_training.ipynb
β”‚   β”œβ”€β”€ 04_cma_agent_demo.ipynb
β”‚   └── 05_end_to_end_demo.ipynb
β”‚
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ raw/                       ← Place CFPB complaints.csv here (one-time download)
β”‚   β”œβ”€β”€ processed/                 ← Output of EDA notebook; not required for training
β”‚   └── sample_complaints/         ← Synthetic domain-specific CSVs for augmentation
β”‚
β”œβ”€β”€ models/                        ← Created by training; populated by start.py
β”‚   β”œβ”€β”€ domain_classifier/         ← best_model.pt + tokenizer files
β”‚   β”œβ”€β”€ evidence_ner/              ← best_model.pt + tokenizer files
β”‚   β”œβ”€β”€ document_vit/              ← ViT checkpoint
β”‚   └── next_action/               ← best_model.pt
β”‚
β”œβ”€β”€ CLAUDE.md                      ← Guidance for Claude Code
β”œβ”€β”€ requirements.txt               ← presidio-analyzer, presidio-anonymizer, spacy, torch, etc.
β”œβ”€β”€ .env.example                   ← Template β€” copy to .env and add ANTHROPIC_API_KEY
β”œβ”€β”€ .gitignore
└── README.md

6. Setup and Running

Step 1 β€” Get an Anthropic API key

  1. Go to https://console.anthropic.com β†’ Sign up (free tier available)
  2. Navigate to API Keys β†’ Create Key
  3. Copy the key (shown only once)
  4. In the project root, copy the template and fill in your key:
    cp .env.example .env
    # then edit .env:
    ANTHROPIC_API_KEY=sk-ant-...
    
    Never commit .env to git (listed in .gitignore).

Step 2 β€” Install dependencies

pip install -r requirements.txt
python -m spacy download en_core_web_lg      # Presidio NLP model (local, ~750 MB)

Step 3 β€” Download CFPB data (first run only)

The DomainClassifier requires the CFPB Consumer Complaint Database:

  • Download from Kaggle: consumer-complaint-database dataset
  • Save the CSV to data/raw/complaints.csv
  • Size: ~600 MB; one-time download; not committed to git

This is only needed to train the classifier. The NER and NextAction models generate their training data in-memory automatically.

Step 4 β€” Run (single command)

# First run β€” trains all models then starts servers:
python start.py --cfpb_csv data/raw/complaints.csv

# After first run β€” models already trained, skip training:
python start.py --no-train

# Force retrain everything:
python start.py --cfpb_csv data/raw/complaints.csv --train

# Train only (no servers):
python start.py --cfpb_csv data/raw/complaints.csv --train-only

When running, start.py will:

  1. Validate .env and ANTHROPIC_API_KEY
  2. Train DomainClassifier (~30 min CPU / ~5 min GPU T4) β€” skipped if checkpoint exists
  3. Train EvidenceNER (~10 min CPU) β€” skipped if checkpoint exists
  4. Train NextActionMLP (< 30 sec CPU) β€” skipped if checkpoint exists
  5. Start FastAPI at http://localhost:8000 (Swagger docs at /docs)
  6. Start Gradio UI at http://localhost:7860

Both servers print to the same terminal, prefixed with [API] or [UI]. Press Ctrl+C to stop everything cleanly.


7. Development Phases

Phase Deliverable Status
0 Abstract submitted (G.U.I.D.E.) Done βœ“
1 Architecture + spec (docs/architecture.md) Done βœ“
2 Project scaffold + environment setup Done βœ“
3 Presidio PII redaction layer (src/privacy/) Done βœ“
4 DL: DomainClassifier β€” model, dataset, train Done βœ“
5 DL: EvidenceNER + NextActionMLP β€” model, train Done βœ“
6 DL: ViT document extractor (fine-tuning) In progress
7 CMA agent + tools integration Done βœ“
8 Document processor (OCR + ViT pipeline) Done βœ“
9 FastAPI backend (HITL endpoint, schemas) Done βœ“
10 Gradio UI (Verify tab, privacy badge, HITL) Done βœ“
11 Single launcher (start.py) + CLAUDE.md Done βœ“
12 Integration testing + demo notebooks Remaining
13 Final report + presentation Remaining