Spaces:

dings4ever
/

guide

Sleeping

App Files Files Community

guide / docs /architecture.md

anmol-iisc

UI enhancements, letter text redundant text removed

d230384 11 days ago

preview code

Raw

History Blame Contribute Delete

31 kB

G.U.I.D.E. — Technical Architecture & Specification

1. Overview

G.U.I.D.E. (Grievance Utility for Information Extraction, Drafting and Enrichment) is a four-layer, spec-driven system built for consumer complaint resolution. Every component has a clear contract; layers communicate through defined interfaces so each piece can be tested and replaced independently.

┌─────────────────────────────────────────────────────────────────┐
│                        GRADIO FRONTEND                          │
│  Chat · Verify (HITL) · Documents · Draft · Escalation · About │
└──────────────────────────┬──────────────────────────────────────┘
                           │ HTTP (REST)
┌──────────────────────────▼──────────────────────────────────────┐
│                       FASTAPI BACKEND                           │
│   /api/session  /api/message  /api/upload                       │
│   /api/session/{id}/validate-entities   /api/status             │
└──┬──────────────────┬────────────────────────────────────────────┘
   │                  │
   │ Step 1 (local)   │ Step 2 (external API — after redaction)
   │                  │
┌──▼──────────────┐  ┌▼────────────────────────────────────────────┐
│  PRESIDIO       │  │  CLAUDE MANAGED AGENT (CMA)                 │
│  PIIRedactor    │  │                                             │
│  (runs locally) │  │  Tools:                                     │
│                 │  │  - classify_domain()  ──► DomainClassifier  │
│  Redacts:       │  │  - extract_entities() ──► EvidenceNER       │
│  PERSON         │  │  - process_document() ──► OCR / ViT + NER  │
│  PHONE_NUMBER   │  │  - draft_complaint()  ──► Claude (internal) │
│  EMAIL_ADDRESS  │  │  - recommend_action() ──► NextActionPredict │
│  CREDIT_CARD    │  │                                             │
│  IN_AADHAAR     │  │  HITL gate: agent pauses before drafting    │
│  IN_PAN         │  │  and requests user confirmation of entities │
│  IBAN_CODE      │  │                                             │
│  ...            │  │  Memory: per-user session state             │
└─────────────────┘  └─────────────────────────────────────────────┘
                                      │
                         ┌────────────▼─────────────────────────────┐
                         │   DEEP LEARNING LAYER                    │
                         │                                          │
                         │  1. DomainClassifier (DistilBERT)        │
                         │  2. EvidenceNER      (DistilBERT tokens) │
                         │  3. DocumentViT      (ViT image encoder) │
                         │  4. NextActionPredictor (MLP)            │
                         └──────────────────────────────────────────┘
                                      │
                         ┌────────────▼─────────────┐
                         │   DOCUMENT PROCESSOR     │
                         │  Tesseract OCR            │
                         │  pdfplumber (PDF parse)   │
                         │  PIL (image pre-process)  │
                         │  ViT (image understanding)│
                         └──────────────────────────┘

2. Component Specifications

2.0 Privacy Preprocessing — Microsoft Presidio

This layer runs locally before any message is forwarded to Claude or any external service. It is the first step in the pipeline.

Attribute	Value
Library	`presidio-analyzer` + `presidio-anonymizer`
NLP engine	spaCy `en_core_web_lg` (local, no network call)
Trigger	Every `/api/session/{id}/message` call
Entity types	`PERSON`, `PHONE_NUMBER`, `EMAIL_ADDRESS`, `CREDIT_CARD`, `IBAN_CODE`, `US_BANK_NUMBER`, `IN_AADHAAR`, `IN_PAN`, `IN_VEHICLE_REGISTRATION`
Replacement	Each detected span → `<ENTITY_TYPE>` placeholder
Output	`RedactionResult(redacted_text, pii_types_found, pii_redacted)`
Failure mode	On any error, original text is returned unchanged (fail-open so pipeline is never blocked)

Why local?
The user's name, account number, and Aadhaar UID must never leave the device in plaintext. Running Presidio in-process (same Python server) ensures redaction happens before TCP bytes are written to the Anthropic API.

What the DL models receive:
The local DL models (DomainClassifier, EvidenceNER) are called by Claude through tool calls — they therefore also receive redacted text. The EvidenceNER model is still effective because it targets structural patterns (amounts, dates, reference IDs) that are not redacted by Presidio.

2.1 Deep Learning Layer

2.1.1 DomainClassifier

Attribute	Value
Architecture	`distilbert-base-uncased` + linear classification head
Task	Multi-class text classification
Classes (6)	`ecommerce`, `telecom`, `banking`, `cibil`, `insurance`, `general`
Training data	CFPB Consumer Complaint Database (3M+ rows) — one-time download from Kaggle. Save as `data/raw/complaints.csv`.
Script	`python -m src.classifier.train --cfpb_csv data/raw/complaints.csv --output_dir models/domain_classifier`
Input	Redacted complaint text (string, max 512 tokens)
Output	`DomainResult(domain: str, confidence: float, all_probs: dict, low_confidence: bool)`
Confidence threshold	`0.50` — results below this set `low_confidence=True`
Low-confidence path	Agent asks user one clarifying domain question; does not proceed until user confirms
Keyword fallback	Used when no checkpoint exists; always returns `confidence=0.0`, `low_confidence=True`
Fine-tune time	~30 min CPU / ~5 min GPU (T4)
Library	HuggingFace `transformers` + `datasets`

Why DistilBERT?
DistilBERT is 40% smaller and 60% faster than BERT-base with only 3% accuracy loss. For a project with limited compute, it is the ideal starting point. The CFPB dataset maps naturally to our 6 classes after label remapping.

Low-confidence handling:
general is the intentional catch-all class — the model never returns an error, only a domain + probability. However, a low probability on all classes (e.g., the complaint text is too short or ambiguous) means the winning domain is unreliable. When confidence < 0.50 the low_confidence flag is set and the CMA agent pauses to ask the user one clarifying question ("Is this about e-commerce, telecom, banking, credit score, insurance, or other?") before continuing. The user's answer overrides the model's suggestion and is stored with domain_source = "user_confirmed" so later tools know the domain is authoritative.

2.1.2 EvidenceNER

Attribute	Value
Architecture	`distilbert-base-uncased` with token classification head
Task	Named Entity Recognition (NER) on complaint text
Entity types	`ORG`, `AMOUNT`, `DATE`, `REF_ID`, `ACCOUNT`, `PERSON`
Training	~4,000 synthetic complaint sentences generated in-memory by `src/ner/train.py` (no download needed). Optionally augmented with CoNLL-2003 via HuggingFace if internet is available (maps PER→PERSON, ORG→ORG; discards LOC/MISC).
Script	`python -m src.ner.train --output_dir models/evidence_ner`
Input	Redacted text (from user or OCR)
Output	List of `{text, label, start, end, confidence}` spans

Entities and their use in drafting:

Entity	Example	Used for
ORG	"Flipkart", "HDFC Bank"	Complaint addressee
AMOUNT	"₹4,299", "Rs. 1,200"	Financial loss quantified
DATE	"12 March 2024", "last Tuesday"	Incident timeline
REF_ID	"Order #OD-2930291", "TXN123"	Evidence reference
ACCOUNT	"XXXX-1234", "loan account"	Dispute target
PERSON	"customer care executive"	Named witness/contact

2.1.3 DocumentViT

Attribute	Value
Architecture	Vision Transformer (`google/vit-base-patch16-224` fine-tuned)
Task	Structured evidence extraction from document images
Input	Scanned receipt / bill / screenshot (PIL Image)
Output	List of `{text, label, confidence}` spans (same schema as NER)
When used	After OCR; ViT runs as a complementary pass on image-type docs
Library	HuggingFace `transformers` (`ViTForImageClassification` + custom head)

Why ViT alongside OCR?
Tesseract OCR excels at clean printed text but struggles with handwriting, logos, and table structures. The ViT model, fine-tuned on receipt and bill images, directly classifies image regions and extracts amount/date/provider fields — especially useful for blurry screenshots and poorly-scanned documents.

2.1.4 NextActionPredictor

Attribute	Value
Architecture	2-hidden-layer MLP (12-dim input → 64 → 64 → 6)
Input	12-dim feature vector: domain one-hot (6) + entity flags (5) + prior_contact (1)
Output	Ranked list of `{action, authority, url, confidence}`
Actions	6 classes: `company_support`, `nch`, `trai`, `rbi_ombudsman`, `irdai`, `legal`
Training data	~6,000 synthetic (domain, entity_flags, prior_contact → action) examples generated in-memory from `DOMAIN_ACTION_PRIORS`; no download needed. Trains in < 30 seconds on CPU.
Script	`python -m src.next_action.train --output_dir models/next_action`
Fallback	If no checkpoint exists, `DOMAIN_ACTION_PRIORS` rule-based mapping is used so the pipeline always works.

Escalation routing logic:

Domain	Primary Authority	Secondary
E-commerce	Company support → NCH	Consumer Forum
Telecom	Company support → TRAI	NCH
Banking	Company support → RBI BO	Banking Ombudsman
CIBIL	Bureau direct → RBI BO	SEBI (if investment)
Insurance	Company support → IRDAI	Insurance Ombudsman
General	Company support → NCH	Consumer Forum

2.2 Claude Managed Agent (CMA)

The CMA is the orchestration layer. It maintains per-user session state (conversation history, extracted entities, uploaded docs, draft versions) and decides at each turn which tool to invoke.

Key constraint: Claude only ever sees redacted text (PII replaced with <ENTITY_TYPE> placeholders by Presidio before the API call). This is documented in the system prompt so Claude knows not to try to recover original values.

Agent System Prompt Summary

You are G.U.I.D.E., an expert consumer complaint assistant.
PII has already been redacted locally — work with placeholders as-is.

Rules:
1. Always classify the domain first using classify_domain().
   • If low_confidence=false (≥ 0.50): store domain and proceed.
   • If low_confidence=true (< 0.50 or keyword fallback): ask the user ONE
     clarifying question ("Is this about e-commerce, telecom, banking, credit
     score, insurance, or other?") before continuing. Store domain_source=
     "user_confirmed" when the domain comes from the user.
   • If classify_domain() errors: same clarifying question as above.
2. Ask ONE targeted follow-up question at a time if information is missing.
3. If documents are uploaded, always run process_document before drafting.
4. HITL gate: Before calling draft_complaint, present extracted details
   as a numbered summary and ask the user to confirm them.  Wait for
   [USER CONFIRMED] message before proceeding.
5. Never draft until domain, provider, date, amount, prior contact, and
   desired resolution are all known AND user-confirmed.
6. Generate drafts in formal English: Subject / To / Body / From.
7. Always recommend the next escalation step with specific portal URLs.

Tool Specifications

Tool	Input	Output	Calls
`classify_domain`	`complaint_text: str`	`DomainResult`	DL DomainClassifier
`extract_entities`	`text: str`	`List[Entity]`	DL EvidenceNER
`process_document`	`file_path: str`	`{raw_text, entities}`	OCR + ViT + EvidenceNER
`draft_complaint`	`complaint_context: dict`	`ComplaintDraft`	Claude (internal)
`recommend_action`	`domain: str, entities: dict`	`List[EscalationAction]`	DL NextAction
`store_memory`	`key: str, value: any`	`None`	CMA Memory Store
`get_memory`	`key: str`	`any`	CMA Memory Store

CMA Decision Flow (per user turn)

User message received
        │
        ▼  ── PRESIDIO (API layer, before agent) ───────────────
  PII redacted locally → redacted_text forwarded to Claude
        │
        ▼
  Is domain known? ──No──► call classify_domain() ──► store in memory
        │
       Yes
        │
        ▼
  Are minimum fields complete? ──No──► ask ONE follow-up question
  (provider, date, amount, ref)
        │
       Yes
        │
        ▼
  Was a document uploaded? ──Yes──► call process_document() ──► merge entities
        │
       No
        │
        ▼  ── HITL GATE ──────────────────────────────────────────
  Present extracted details summary → ask user to confirm
        │
  Wait for [USER CONFIRMED] message (from /validate-entities endpoint)
        │
        ▼
  Has user confirmed entities? ──Yes──► call draft_complaint() ──► show draft
        │
        ▼
  Has user asked next steps? ──Yes──► call recommend_action() ──► show escalation

2.3 Human-in-the-Loop (HITL) Validation

After all required fields are collected and before draft generation, the system pauses and requires explicit user confirmation of the extracted entities.

Step	Component	Description
1	CMA	Presents extracted entities as a numbered summary in chat
2	Frontend	Populates the Verify Entities tab with pre-filled editable fields
3	User	Reviews, edits any incorrect value, and clicks "Confirm & Generate Draft"
4	API	`POST /api/session/{id}/validate-entities` sends verified entities to CMA
5	CMA	Receives `[USER CONFIRMED]` message and calls `draft_complaint()`

Why HITL?
PII redaction replaces some values with placeholders (e.g., a name becomes <PERSON>). The HITL step lets the user supply the correct readable label (e.g., "HDFC Bank" rather than just <ORG>) that will appear in the final draft, improving both accuracy and trust in the generated complaint.

2.4 Document Processor

Feature	Implementation
PDF parsing	`pdfplumber` (text-native PDFs)
Image OCR	`pytesseract` + `Pillow` (pre-process)
ViT pass	`google/vit-base-patch16-224` fine-tuned on receipt/bill images
Pre-process	Greyscale → adaptive threshold → deskew
Output	Clean extracted text + NER entities
Formats	PDF, PNG, JPG, JPEG, WEBP

2.5 FastAPI Backend

Endpoint	Method	Description
`/api/health`	GET	Health check (all components)
`/api/session/create`	POST	Create new CMA session
`/api/session/{id}/message`	POST	Send message → Presidio redact → agent
`/api/session/{id}/upload`	POST	Upload a document to session
`/api/session/{id}/validate-entities`	POST	HITL: submit user-confirmed entities
`/api/session/{id}/history`	GET	Retrieve conversation history
`/api/classify`	POST	Direct DL classification (debug)
`/api/extract`	POST	Direct NER extraction (debug)

2.6 Gradio Frontend

Tabs:

Chat — Conversational interface; shows 🔒 privacy badge when PII is redacted
Verify Entities — HITL panel: editable entity fields + "Confirm & Generate Draft"
Documents — Drag-and-drop upload; shows extracted entities
Complaint Draft — Rendered complaint with copy/download
Escalation Guide — Recommended authorities with portal links
About — Architecture diagram, model cards, tech stack

3. Technology Stack

Layer	Technology	Reason
Launcher	`start.py` (stdlib only)	Single script — trains all models then starts servers
Privacy	Microsoft Presidio + spaCy	Local PII redaction, no cloud call
DL Models	HuggingFace Transformers	Industry standard for NLP + ViT
Classifier data	CFPB dataset (Kaggle, one-time)	3M+ real complaints, public license
NER data	Synthetic in-memory	Template-generated; no download required
NextAction data	Synthetic in-memory	Generated from domain priors; no download
Agent	Anthropic CMA (default `claude-sonnet-4-6`, set via `GUIDE_MODEL`)	Stateful, tool-using agent
Backend	FastAPI + Uvicorn	Async, fast, OpenAPI auto-docs
Frontend	Gradio 4.x	ML-native UI, file upload, chat
OCR	pytesseract + pdfplumber	Proven, open-source
ViT doc model	HuggingFace ViT	Image-based evidence extraction
Env	Python 3.10+	Required by CMA SDK
Config	python-dotenv	Secure API key management
Notebooks (optional)	Jupyter	EDA and demo only; not required to run system

4. Data Flow — End to End

User types: "Rahul Sharma — Flipkart hasn't refunded ₹4,299 for order OD-123
             cancelled 3 weeks ago. My phone is 9876543210."
                │
                ▼  ── LOCAL ONLY ─────────────────────────────────────────────
        Presidio PIIRedactor detects: PERSON("Rahul Sharma"), PHONE("9876543210")
        Redacted: "<PERSON> — Flipkart hasn't refunded ₹4,299 for order OD-123
                   cancelled 3 weeks ago. My phone is <PHONE_NUMBER>."
                │
                ▼  ── EXTERNAL API ────────────────────────────────────────────
        FastAPI → CMA session with redacted text
                │
        Claude CMA agent processes redacted message
                │
                ├──► classify_domain(...)
                │         └──► DomainClassifier → {domain: "ecommerce", conf: 0.97}
                │
                ├──► extract_entities(...)
                │         └──► EvidenceNER → [ORG:"Flipkart", AMOUNT:"₹4,299",
                │                             REF_ID:"OD-123", DATE:"3 weeks ago"]
                │
                └──► HITL gate: "I have extracted the following — please confirm:
                                  - Company: Flipkart
                                  - Amount: ₹4,299
                                  - Order ID: OD-123
                                  - Date: 3 weeks ago
                                 Is this correct?"
                │
        User reviews in "Verify Entities" tab → edits if needed → clicks Confirm
                │
                ▼
        POST /validate-entities → [USER CONFIRMED] → draft_complaint()
                │
        ComplaintDraft generated and shown in Draft tab
                │
        recommend_action(domain="ecommerce") → [NCH, Consumer Forum]
                │
        Gradio renders: Draft · Evidence table · Escalation panel

5. Project Directory Structure

Project_ResolveAI/
│
├── start.py                       ← SINGLE ENTRY POINT — trains all models then starts servers
│
├── docs/
│   ├── abstract.md                ← project abstract (G.U.I.D.E.)
│   └── architecture.md            ← this file (spec)
│
├── src/
│   ├── __init__.py
│   ├── privacy/                   # Presidio PII redaction (runs before any external call)
│   │   ├── __init__.py
│   │   └── redactor.py            ← PIIRedactor singleton
│   │
│   ├── classifier/                # DL Domain Classifier (DistilBERT)
│   │   ├── model.py               ← DomainClassifier, CFPB_PRODUCT_MAP, LABEL2ID
│   │   ├── dataset.py             ← load_cfpb_csv(), clean_complaint_text(), ComplaintDataset
│   │   ├── train.py               ← CLI: --cfpb_csv, --output_dir, --epochs, --batch_size
│   │   └── predict.py             ← DomainPredictor singleton, classify_domain()
│   │
│   ├── ner/                       # DL NER model (DistilBERT token classifier)
│   │   ├── model.py               ← EvidenceNER, NER_LABELS, NER_LABEL2ID
│   │   ├── train.py               ← Generates synthetic data in-memory; CLI: --output_dir
│   │   └── predict.py             ← NERPredictor singleton, extract_entities()
│   │
│   ├── next_action/               # Next-action MLP predictor
│   │   ├── model.py               ← NextActionMLP, DOMAIN_ACTION_PRIORS, build_feature_vector()
│   │   ├── train.py               ← Generates synthetic features; CLI: --output_dir, --epochs
│   │   └── predict.py             ← NextActionPredictor singleton (MLP or rule-based fallback)
│   │
│   ├── document_processor/        # OCR + PDF parsing + ViT
│   │   ├── ocr.py                 ← Tesseract + pdfplumber pipeline
│   │   └── vit_extractor.py       ← ViT-based image evidence extraction
│   │
│   ├── agent/                     # Claude CMA integration
│   │   ├── tools.py               ← Tool definitions (JSON Schema) + execute_tool()
│   │   ├── prompts.py             ← SYSTEM_PROMPT (HITL rule 6, privacy context)
│   │   └── session.py             ← AgentManager singleton, send_message()
│   │
│   └── api/                       # FastAPI application
│       ├── main.py                ← Lifespan: Presidio → DL models → CMA agent
│       ├── routes.py              ← /message (Presidio→agent), /validate-entities (HITL)
│       └── schemas.py             ← Pydantic models incl. HITL + pii_redacted fields
│
├── ui/
│   └── app.py                     ← Gradio: Chat · Verify · Docs · Draft · Escalation · About
│
├── notebooks/                     # OPTIONAL — EDA and interactive demos only
│   ├── 01_data_exploration.ipynb  ← Explore CFPB dataset, save processed CSV
│   ├── 02_classifier_training.ipynb
│   ├── 04_cma_agent_demo.ipynb
│   └── 05_end_to_end_demo.ipynb
│
├── data/
│   ├── raw/                       ← Place CFPB complaints.csv here (one-time download)
│   ├── processed/                 ← Output of EDA notebook; not required for training
│   └── sample_complaints/         ← Synthetic domain-specific CSVs for augmentation
│
├── models/                        ← Created by training; populated by start.py
│   ├── domain_classifier/         ← best_model.pt + tokenizer files
│   ├── evidence_ner/              ← best_model.pt + tokenizer files
│   ├── document_vit/              ← ViT checkpoint
│   └── next_action/               ← best_model.pt
│
├── CLAUDE.md                      ← Guidance for Claude Code
├── requirements.txt               ← presidio-analyzer, presidio-anonymizer, spacy, torch, etc.
├── .env.example                   ← Template — copy to .env and add ANTHROPIC_API_KEY
├── .gitignore
└── README.md

6. Setup and Running

Step 1 — Get an Anthropic API key

Go to https://console.anthropic.com → Sign up (free tier available)
Navigate to API Keys → Create Key
Copy the key (shown only once)
In the project root, copy the template and fill in your key:
```
cp .env.example .env
# then edit .env:
ANTHROPIC_API_KEY=sk-ant-...
```
Never commit .env to git (listed in .gitignore).

Step 2 — Install dependencies

pip install -r requirements.txt
python -m spacy download en_core_web_lg      # Presidio NLP model (local, ~750 MB)

Step 3 — Download CFPB data (first run only)

The DomainClassifier requires the CFPB Consumer Complaint Database:

Download from Kaggle: consumer-complaint-database dataset
Save the CSV to data/raw/complaints.csv
Size: ~600 MB; one-time download; not committed to git

This is only needed to train the classifier. The NER and NextAction models generate their training data in-memory automatically.

Step 4 — Run (single command)

# First run — trains all models then starts servers:
python start.py --cfpb_csv data/raw/complaints.csv

# After first run — models already trained, skip training:
python start.py --no-train

# Force retrain everything:
python start.py --cfpb_csv data/raw/complaints.csv --train

# Train only (no servers):
python start.py --cfpb_csv data/raw/complaints.csv --train-only

When running, start.py will:

Validate .env and ANTHROPIC_API_KEY
Train DomainClassifier (~30 min CPU / ~5 min GPU T4) — skipped if checkpoint exists
Train EvidenceNER (~10 min CPU) — skipped if checkpoint exists
Train NextActionMLP (< 30 sec CPU) — skipped if checkpoint exists
Start FastAPI at http://localhost:8000 (Swagger docs at /docs)
Start Gradio UI at http://localhost:7860

Both servers print to the same terminal, prefixed with [API] or [UI]. Press Ctrl+C to stop everything cleanly.

7. Development Phases

Phase	Deliverable	Status
0	Abstract submitted (G.U.I.D.E.)	Done ✓
1	Architecture + spec (`docs/architecture.md`)	Done ✓
2	Project scaffold + environment setup	Done ✓
3	Presidio PII redaction layer (`src/privacy/`)	Done ✓
4	DL: DomainClassifier — model, dataset, train	Done ✓
5	DL: EvidenceNER + NextActionMLP — model, train	Done ✓
6	DL: ViT document extractor (fine-tuning)	In progress
7	CMA agent + tools integration	Done ✓
8	Document processor (OCR + ViT pipeline)	Done ✓
9	FastAPI backend (HITL endpoint, schemas)	Done ✓
10	Gradio UI (Verify tab, privacy badge, HITL)	Done ✓
11	Single launcher (`start.py`) + CLAUDE.md	Done ✓
12	Integration testing + demo notebooks	Remaining
13	Final report + presentation	Remaining