guide / docs /architecture.md
anmol-iisc's picture
UI enhancements, letter text redundant text removed
d230384
|
Raw
History Blame Contribute Delete
31 kB
# G.U.I.D.E. β€” Technical Architecture & Specification
## 1. Overview
G.U.I.D.E. (Grievance Utility for Information Extraction, Drafting and Enrichment)
is a **four-layer, spec-driven system** built for consumer complaint resolution.
Every component has a clear contract; layers communicate through defined interfaces
so each piece can be tested and replaced independently.
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ GRADIO FRONTEND β”‚
β”‚ Chat Β· Verify (HITL) Β· Documents Β· Draft Β· Escalation Β· About β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ HTTP (REST)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ FASTAPI BACKEND β”‚
β”‚ /api/session /api/message /api/upload β”‚
β”‚ /api/session/{id}/validate-entities /api/status β”‚
β””β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ β”‚
β”‚ Step 1 (local) β”‚ Step 2 (external API β€” after redaction)
β”‚ β”‚
β”Œβ”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ PRESIDIO β”‚ β”‚ CLAUDE MANAGED AGENT (CMA) β”‚
β”‚ PIIRedactor β”‚ β”‚ β”‚
β”‚ (runs locally) β”‚ β”‚ Tools: β”‚
β”‚ β”‚ β”‚ - classify_domain() ──► DomainClassifier β”‚
β”‚ Redacts: β”‚ β”‚ - extract_entities() ──► EvidenceNER β”‚
β”‚ PERSON β”‚ β”‚ - process_document() ──► OCR / ViT + NER β”‚
β”‚ PHONE_NUMBER β”‚ β”‚ - draft_complaint() ──► Claude (internal) β”‚
β”‚ EMAIL_ADDRESS β”‚ β”‚ - recommend_action() ──► NextActionPredict β”‚
β”‚ CREDIT_CARD β”‚ β”‚ β”‚
β”‚ IN_AADHAAR β”‚ β”‚ HITL gate: agent pauses before drafting β”‚
β”‚ IN_PAN β”‚ β”‚ and requests user confirmation of entities β”‚
β”‚ IBAN_CODE β”‚ β”‚ β”‚
β”‚ ... β”‚ β”‚ Memory: per-user session state β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ DEEP LEARNING LAYER β”‚
β”‚ β”‚
β”‚ 1. DomainClassifier (DistilBERT) β”‚
β”‚ 2. EvidenceNER (DistilBERT tokens) β”‚
β”‚ 3. DocumentViT (ViT image encoder) β”‚
β”‚ 4. NextActionPredictor (MLP) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ DOCUMENT PROCESSOR β”‚
β”‚ Tesseract OCR β”‚
β”‚ pdfplumber (PDF parse) β”‚
β”‚ PIL (image pre-process) β”‚
β”‚ ViT (image understanding)β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
---
## 2. Component Specifications
### 2.0 Privacy Preprocessing β€” Microsoft Presidio
This layer runs **locally** before any message is forwarded to Claude or any
external service. It is the first step in the pipeline.
| Attribute | Value |
|--------------|-----------------------------------------------------------|
| Library | `presidio-analyzer` + `presidio-anonymizer` |
| NLP engine | spaCy `en_core_web_lg` (local, no network call) |
| Trigger | Every `/api/session/{id}/message` call |
| Entity types | `PERSON`, `PHONE_NUMBER`, `EMAIL_ADDRESS`, `CREDIT_CARD`, `IBAN_CODE`, `US_BANK_NUMBER`, `IN_AADHAAR`, `IN_PAN`, `IN_VEHICLE_REGISTRATION` |
| Replacement | Each detected span β†’ `<ENTITY_TYPE>` placeholder |
| Output | `RedactionResult(redacted_text, pii_types_found, pii_redacted)` |
| Failure mode | On any error, original text is returned unchanged (fail-open so pipeline is never blocked) |
**Why local?**
The user's name, account number, and Aadhaar UID must never leave the device
in plaintext. Running Presidio in-process (same Python server) ensures redaction
happens before TCP bytes are written to the Anthropic API.
**What the DL models receive:**
The local DL models (DomainClassifier, EvidenceNER) are called by Claude through
tool calls β€” they therefore also receive redacted text. The EvidenceNER model is
still effective because it targets structural patterns (amounts, dates, reference
IDs) that are not redacted by Presidio.
---
### 2.1 Deep Learning Layer
#### 2.1.1 DomainClassifier
| Attribute | Value |
|-----------------|----------------------------------------------------|
| Architecture | `distilbert-base-uncased` + linear classification head |
| Task | Multi-class text classification |
| Classes (6) | `ecommerce`, `telecom`, `banking`, `cibil`, `insurance`, `general` |
| Training data | CFPB Consumer Complaint Database (3M+ rows) β€” one-time download from Kaggle. Save as `data/raw/complaints.csv`. |
| Script | `python -m src.classifier.train --cfpb_csv data/raw/complaints.csv --output_dir models/domain_classifier` |
| Input | Redacted complaint text (string, max 512 tokens) |
| Output | `DomainResult(domain: str, confidence: float, all_probs: dict, low_confidence: bool)` |
| Confidence threshold | `0.50` β€” results below this set `low_confidence=True` |
| Low-confidence path | Agent asks user one clarifying domain question; does not proceed until user confirms |
| Keyword fallback | Used when no checkpoint exists; always returns `confidence=0.0`, `low_confidence=True` |
| Fine-tune time | ~30 min CPU / ~5 min GPU (T4) |
| Library | HuggingFace `transformers` + `datasets` |
**Why DistilBERT?**
DistilBERT is 40% smaller and 60% faster than BERT-base with only 3% accuracy
loss. For a project with limited compute, it is the ideal starting point.
The CFPB dataset maps naturally to our 6 classes after label remapping.
**Low-confidence handling:**
`general` is the intentional catch-all class β€” the model never returns an error, only a
domain + probability. However, a low probability on all classes (e.g., the complaint text
is too short or ambiguous) means the winning domain is unreliable. When `confidence < 0.50`
the `low_confidence` flag is set and the CMA agent pauses to ask the user one clarifying
question ("Is this about e-commerce, telecom, banking, credit score, insurance, or other?")
before continuing. The user's answer overrides the model's suggestion and is stored with
`domain_source = "user_confirmed"` so later tools know the domain is authoritative.
---
#### 2.1.2 EvidenceNER
| Attribute | Value |
|--------------|------------------------------------------------------------|
| Architecture | `distilbert-base-uncased` with token classification head |
| Task | Named Entity Recognition (NER) on complaint text |
| Entity types | `ORG`, `AMOUNT`, `DATE`, `REF_ID`, `ACCOUNT`, `PERSON` |
| Training | ~4,000 synthetic complaint sentences generated in-memory by `src/ner/train.py` (no download needed). Optionally augmented with CoNLL-2003 via HuggingFace if internet is available (maps PER→PERSON, ORG→ORG; discards LOC/MISC). |
| Script | `python -m src.ner.train --output_dir models/evidence_ner` |
| Input | Redacted text (from user or OCR) |
| Output | List of `{text, label, start, end, confidence}` spans |
**Entities and their use in drafting:**
| Entity | Example | Used for |
|----------|---------------------------------|-----------------------------|
| ORG | "Flipkart", "HDFC Bank" | Complaint addressee |
| AMOUNT | "β‚Ή4,299", "Rs. 1,200" | Financial loss quantified |
| DATE | "12 March 2024", "last Tuesday" | Incident timeline |
| REF_ID | "Order #OD-2930291", "TXN123" | Evidence reference |
| ACCOUNT | "XXXX-1234", "loan account" | Dispute target |
| PERSON | "customer care executive" | Named witness/contact |
---
#### 2.1.3 DocumentViT
| Attribute | Value |
|--------------|---------------------------------------------------------------|
| Architecture | Vision Transformer (`google/vit-base-patch16-224` fine-tuned)|
| Task | Structured evidence extraction from document images |
| Input | Scanned receipt / bill / screenshot (PIL Image) |
| Output | List of `{text, label, confidence}` spans (same schema as NER)|
| When used | After OCR; ViT runs as a complementary pass on image-type docs|
| Library | HuggingFace `transformers` (`ViTForImageClassification` + custom head) |
**Why ViT alongside OCR?**
Tesseract OCR excels at clean printed text but struggles with handwriting, logos,
and table structures. The ViT model, fine-tuned on receipt and bill images, directly
classifies image regions and extracts amount/date/provider fields β€” especially
useful for blurry screenshots and poorly-scanned documents.
---
#### 2.1.4 NextActionPredictor
| Attribute | Value |
|---------------|-------------------------------------------------------------------|
| Architecture | 2-hidden-layer MLP (12-dim input β†’ 64 β†’ 64 β†’ 6) |
| Input | 12-dim feature vector: domain one-hot (6) + entity flags (5) + prior_contact (1) |
| Output | Ranked list of `{action, authority, url, confidence}` |
| Actions | 6 classes: `company_support`, `nch`, `trai`, `rbi_ombudsman`, `irdai`, `legal` |
| Training data | ~6,000 synthetic (domain, entity_flags, prior_contact β†’ action) examples generated in-memory from `DOMAIN_ACTION_PRIORS`; no download needed. Trains in < 30 seconds on CPU. |
| Script | `python -m src.next_action.train --output_dir models/next_action` |
| Fallback | If no checkpoint exists, `DOMAIN_ACTION_PRIORS` rule-based mapping is used so the pipeline always works. |
**Escalation routing logic:**
| Domain | Primary Authority | Secondary |
|-----------|---------------------------|--------------------|
| E-commerce| Company support β†’ NCH | Consumer Forum |
| Telecom | Company support β†’ TRAI | NCH |
| Banking | Company support β†’ RBI BO | Banking Ombudsman |
| CIBIL | Bureau direct β†’ RBI BO | SEBI (if investment)|
| Insurance | Company support β†’ IRDAI | Insurance Ombudsman|
| General | Company support β†’ NCH | Consumer Forum |
---
### 2.2 Claude Managed Agent (CMA)
The CMA is the orchestration layer. It maintains **per-user session state**
(conversation history, extracted entities, uploaded docs, draft versions) and
decides at each turn which tool to invoke.
**Key constraint:** Claude only ever sees **redacted text** (PII replaced with
`<ENTITY_TYPE>` placeholders by Presidio before the API call). This is documented
in the system prompt so Claude knows not to try to recover original values.
#### Agent System Prompt Summary
```
You are G.U.I.D.E., an expert consumer complaint assistant.
PII has already been redacted locally β€” work with placeholders as-is.
Rules:
1. Always classify the domain first using classify_domain().
β€’ If low_confidence=false (β‰₯ 0.50): store domain and proceed.
β€’ If low_confidence=true (< 0.50 or keyword fallback): ask the user ONE
clarifying question ("Is this about e-commerce, telecom, banking, credit
score, insurance, or other?") before continuing. Store domain_source=
"user_confirmed" when the domain comes from the user.
β€’ If classify_domain() errors: same clarifying question as above.
2. Ask ONE targeted follow-up question at a time if information is missing.
3. If documents are uploaded, always run process_document before drafting.
4. HITL gate: Before calling draft_complaint, present extracted details
as a numbered summary and ask the user to confirm them. Wait for
[USER CONFIRMED] message before proceeding.
5. Never draft until domain, provider, date, amount, prior contact, and
desired resolution are all known AND user-confirmed.
6. Generate drafts in formal English: Subject / To / Body / From.
7. Always recommend the next escalation step with specific portal URLs.
```
#### Tool Specifications
| Tool | Input | Output | Calls |
|-------------------|------------------------------|---------------------------------|-------------------|
| `classify_domain` | `complaint_text: str` | `DomainResult` | DL DomainClassifier |
| `extract_entities`| `text: str` | `List[Entity]` | DL EvidenceNER |
| `process_document`| `file_path: str` | `{raw_text, entities}` | OCR + ViT + EvidenceNER |
| `draft_complaint` | `complaint_context: dict` | `ComplaintDraft` | Claude (internal) |
| `recommend_action`| `domain: str, entities: dict`| `List[EscalationAction]` | DL NextAction |
| `store_memory` | `key: str, value: any` | `None` | CMA Memory Store |
| `get_memory` | `key: str` | `any` | CMA Memory Store |
#### CMA Decision Flow (per user turn)
```
User message received
β”‚
β–Ό ── PRESIDIO (API layer, before agent) ───────────────
PII redacted locally β†’ redacted_text forwarded to Claude
β”‚
β–Ό
Is domain known? ──No──► call classify_domain() ──► store in memory
β”‚
Yes
β”‚
β–Ό
Are minimum fields complete? ──No──► ask ONE follow-up question
(provider, date, amount, ref)
β”‚
Yes
β”‚
β–Ό
Was a document uploaded? ──Yes──► call process_document() ──► merge entities
β”‚
No
β”‚
β–Ό ── HITL GATE ──────────────────────────────────────────
Present extracted details summary β†’ ask user to confirm
β”‚
Wait for [USER CONFIRMED] message (from /validate-entities endpoint)
β”‚
β–Ό
Has user confirmed entities? ──Yes──► call draft_complaint() ──► show draft
β”‚
β–Ό
Has user asked next steps? ──Yes──► call recommend_action() ──► show escalation
```
---
### 2.3 Human-in-the-Loop (HITL) Validation
After all required fields are collected and before draft generation, the system
pauses and requires explicit user confirmation of the extracted entities.
| Step | Component | Description |
|------|-----------|-------------|
| 1 | CMA | Presents extracted entities as a numbered summary in chat |
| 2 | Frontend | Populates the **Verify Entities** tab with pre-filled editable fields |
| 3 | User | Reviews, edits any incorrect value, and clicks "Confirm & Generate Draft" |
| 4 | API | `POST /api/session/{id}/validate-entities` sends verified entities to CMA |
| 5 | CMA | Receives `[USER CONFIRMED]` message and calls `draft_complaint()` |
**Why HITL?**
PII redaction replaces some values with placeholders (e.g., a name becomes
`<PERSON>`). The HITL step lets the user supply the correct readable label
(e.g., "HDFC Bank" rather than just `<ORG>`) that will appear in the final draft,
improving both accuracy and trust in the generated complaint.
---
### 2.4 Document Processor
| Feature | Implementation |
|--------------|---------------------------------------|
| PDF parsing | `pdfplumber` (text-native PDFs) |
| Image OCR | `pytesseract` + `Pillow` (pre-process)|
| ViT pass | `google/vit-base-patch16-224` fine-tuned on receipt/bill images |
| Pre-process | Greyscale β†’ adaptive threshold β†’ deskew |
| Output | Clean extracted text + NER entities |
| Formats | PDF, PNG, JPG, JPEG, WEBP |
---
### 2.5 FastAPI Backend
| Endpoint | Method | Description |
|---------------------------------------|--------|----------------------------------------|
| `/api/health` | GET | Health check (all components) |
| `/api/session/create` | POST | Create new CMA session |
| `/api/session/{id}/message` | POST | Send message β†’ Presidio redact β†’ agent |
| `/api/session/{id}/upload` | POST | Upload a document to session |
| `/api/session/{id}/validate-entities` | POST | HITL: submit user-confirmed entities |
| `/api/session/{id}/history` | GET | Retrieve conversation history |
| `/api/classify` | POST | Direct DL classification (debug) |
| `/api/extract` | POST | Direct NER extraction (debug) |
---
### 2.6 Gradio Frontend
Tabs:
1. **Chat** β€” Conversational interface; shows πŸ”’ privacy badge when PII is redacted
2. **Verify Entities** β€” HITL panel: editable entity fields + "Confirm & Generate Draft"
3. **Documents** β€” Drag-and-drop upload; shows extracted entities
4. **Complaint Draft** β€” Rendered complaint with copy/download
5. **Escalation Guide** β€” Recommended authorities with portal links
6. **About** β€” Architecture diagram, model cards, tech stack
---
## 3. Technology Stack
| Layer | Technology | Reason |
|--------------------|--------------------------------|-----------------------------------------------|
| Launcher | `start.py` (stdlib only) | Single script β€” trains all models then starts servers |
| Privacy | Microsoft Presidio + spaCy | Local PII redaction, no cloud call |
| DL Models | HuggingFace Transformers | Industry standard for NLP + ViT |
| Classifier data | CFPB dataset (Kaggle, one-time)| 3M+ real complaints, public license |
| NER data | Synthetic in-memory | Template-generated; no download required |
| NextAction data | Synthetic in-memory | Generated from domain priors; no download |
| Agent | Anthropic CMA (default `claude-sonnet-4-6`, set via `GUIDE_MODEL`) | Stateful, tool-using agent |
| Backend | FastAPI + Uvicorn | Async, fast, OpenAPI auto-docs |
| Frontend | Gradio 4.x | ML-native UI, file upload, chat |
| OCR | pytesseract + pdfplumber | Proven, open-source |
| ViT doc model | HuggingFace ViT | Image-based evidence extraction |
| Env | Python 3.10+ | Required by CMA SDK |
| Config | python-dotenv | Secure API key management |
| Notebooks (optional) | Jupyter | EDA and demo only; not required to run system |
---
## 4. Data Flow β€” End to End
```
User types: "Rahul Sharma β€” Flipkart hasn't refunded β‚Ή4,299 for order OD-123
cancelled 3 weeks ago. My phone is 9876543210."
β”‚
β–Ό ── LOCAL ONLY ─────────────────────────────────────────────
Presidio PIIRedactor detects: PERSON("Rahul Sharma"), PHONE("9876543210")
Redacted: "<PERSON> β€” Flipkart hasn't refunded β‚Ή4,299 for order OD-123
cancelled 3 weeks ago. My phone is <PHONE_NUMBER>."
β”‚
β–Ό ── EXTERNAL API ────────────────────────────────────────────
FastAPI β†’ CMA session with redacted text
β”‚
Claude CMA agent processes redacted message
β”‚
β”œβ”€β”€β–Ί classify_domain(...)
β”‚ └──► DomainClassifier β†’ {domain: "ecommerce", conf: 0.97}
β”‚
β”œβ”€β”€β–Ί extract_entities(...)
β”‚ └──► EvidenceNER β†’ [ORG:"Flipkart", AMOUNT:"β‚Ή4,299",
β”‚ REF_ID:"OD-123", DATE:"3 weeks ago"]
β”‚
└──► HITL gate: "I have extracted the following β€” please confirm:
- Company: Flipkart
- Amount: β‚Ή4,299
- Order ID: OD-123
- Date: 3 weeks ago
Is this correct?"
β”‚
User reviews in "Verify Entities" tab β†’ edits if needed β†’ clicks Confirm
β”‚
β–Ό
POST /validate-entities β†’ [USER CONFIRMED] β†’ draft_complaint()
β”‚
ComplaintDraft generated and shown in Draft tab
β”‚
recommend_action(domain="ecommerce") β†’ [NCH, Consumer Forum]
β”‚
Gradio renders: Draft Β· Evidence table Β· Escalation panel
```
---
## 5. Project Directory Structure
```
Project_ResolveAI/
β”‚
β”œβ”€β”€ start.py ← SINGLE ENTRY POINT β€” trains all models then starts servers
β”‚
β”œβ”€β”€ docs/
β”‚ β”œβ”€β”€ abstract.md ← project abstract (G.U.I.D.E.)
β”‚ └── architecture.md ← this file (spec)
β”‚
β”œβ”€β”€ src/
β”‚ β”œβ”€β”€ __init__.py
β”‚ β”œβ”€β”€ privacy/ # Presidio PII redaction (runs before any external call)
β”‚ β”‚ β”œβ”€β”€ __init__.py
β”‚ β”‚ └── redactor.py ← PIIRedactor singleton
β”‚ β”‚
β”‚ β”œβ”€β”€ classifier/ # DL Domain Classifier (DistilBERT)
β”‚ β”‚ β”œβ”€β”€ model.py ← DomainClassifier, CFPB_PRODUCT_MAP, LABEL2ID
β”‚ β”‚ β”œβ”€β”€ dataset.py ← load_cfpb_csv(), clean_complaint_text(), ComplaintDataset
β”‚ β”‚ β”œβ”€β”€ train.py ← CLI: --cfpb_csv, --output_dir, --epochs, --batch_size
β”‚ β”‚ └── predict.py ← DomainPredictor singleton, classify_domain()
β”‚ β”‚
β”‚ β”œβ”€β”€ ner/ # DL NER model (DistilBERT token classifier)
β”‚ β”‚ β”œβ”€β”€ model.py ← EvidenceNER, NER_LABELS, NER_LABEL2ID
β”‚ β”‚ β”œβ”€β”€ train.py ← Generates synthetic data in-memory; CLI: --output_dir
β”‚ β”‚ └── predict.py ← NERPredictor singleton, extract_entities()
β”‚ β”‚
β”‚ β”œβ”€β”€ next_action/ # Next-action MLP predictor
β”‚ β”‚ β”œβ”€β”€ model.py ← NextActionMLP, DOMAIN_ACTION_PRIORS, build_feature_vector()
β”‚ β”‚ β”œβ”€β”€ train.py ← Generates synthetic features; CLI: --output_dir, --epochs
β”‚ β”‚ └── predict.py ← NextActionPredictor singleton (MLP or rule-based fallback)
β”‚ β”‚
β”‚ β”œβ”€β”€ document_processor/ # OCR + PDF parsing + ViT
β”‚ β”‚ β”œβ”€β”€ ocr.py ← Tesseract + pdfplumber pipeline
β”‚ β”‚ └── vit_extractor.py ← ViT-based image evidence extraction
β”‚ β”‚
β”‚ β”œβ”€β”€ agent/ # Claude CMA integration
β”‚ β”‚ β”œβ”€β”€ tools.py ← Tool definitions (JSON Schema) + execute_tool()
β”‚ β”‚ β”œβ”€β”€ prompts.py ← SYSTEM_PROMPT (HITL rule 6, privacy context)
β”‚ β”‚ └── session.py ← AgentManager singleton, send_message()
β”‚ β”‚
β”‚ └── api/ # FastAPI application
β”‚ β”œβ”€β”€ main.py ← Lifespan: Presidio β†’ DL models β†’ CMA agent
β”‚ β”œβ”€β”€ routes.py ← /message (Presidioβ†’agent), /validate-entities (HITL)
β”‚ └── schemas.py ← Pydantic models incl. HITL + pii_redacted fields
β”‚
β”œβ”€β”€ ui/
β”‚ └── app.py ← Gradio: Chat Β· Verify Β· Docs Β· Draft Β· Escalation Β· About
β”‚
β”œβ”€β”€ notebooks/ # OPTIONAL β€” EDA and interactive demos only
β”‚ β”œβ”€β”€ 01_data_exploration.ipynb ← Explore CFPB dataset, save processed CSV
β”‚ β”œβ”€β”€ 02_classifier_training.ipynb
β”‚ β”œβ”€β”€ 04_cma_agent_demo.ipynb
β”‚ └── 05_end_to_end_demo.ipynb
β”‚
β”œβ”€β”€ data/
β”‚ β”œβ”€β”€ raw/ ← Place CFPB complaints.csv here (one-time download)
β”‚ β”œβ”€β”€ processed/ ← Output of EDA notebook; not required for training
β”‚ └── sample_complaints/ ← Synthetic domain-specific CSVs for augmentation
β”‚
β”œβ”€β”€ models/ ← Created by training; populated by start.py
β”‚ β”œβ”€β”€ domain_classifier/ ← best_model.pt + tokenizer files
β”‚ β”œβ”€β”€ evidence_ner/ ← best_model.pt + tokenizer files
β”‚ β”œβ”€β”€ document_vit/ ← ViT checkpoint
β”‚ └── next_action/ ← best_model.pt
β”‚
β”œβ”€β”€ CLAUDE.md ← Guidance for Claude Code
β”œβ”€β”€ requirements.txt ← presidio-analyzer, presidio-anonymizer, spacy, torch, etc.
β”œβ”€β”€ .env.example ← Template β€” copy to .env and add ANTHROPIC_API_KEY
β”œβ”€β”€ .gitignore
└── README.md
```
---
## 6. Setup and Running
### Step 1 β€” Get an Anthropic API key
1. Go to https://console.anthropic.com β†’ Sign up (free tier available)
2. Navigate to **API Keys** β†’ **Create Key**
3. Copy the key (shown only once)
4. In the project root, copy the template and fill in your key:
```
cp .env.example .env
# then edit .env:
ANTHROPIC_API_KEY=sk-ant-...
```
Never commit `.env` to git (listed in `.gitignore`).
### Step 2 β€” Install dependencies
```bash
pip install -r requirements.txt
python -m spacy download en_core_web_lg # Presidio NLP model (local, ~750 MB)
```
### Step 3 β€” Download CFPB data (first run only)
The DomainClassifier requires the CFPB Consumer Complaint Database:
- Download from Kaggle: `consumer-complaint-database` dataset
- Save the CSV to `data/raw/complaints.csv`
- Size: ~600 MB; one-time download; not committed to git
This is only needed to train the classifier. The NER and NextAction models generate their training data in-memory automatically.
### Step 4 β€” Run (single command)
```bash
# First run β€” trains all models then starts servers:
python start.py --cfpb_csv data/raw/complaints.csv
# After first run β€” models already trained, skip training:
python start.py --no-train
# Force retrain everything:
python start.py --cfpb_csv data/raw/complaints.csv --train
# Train only (no servers):
python start.py --cfpb_csv data/raw/complaints.csv --train-only
```
When running, `start.py` will:
1. Validate `.env` and `ANTHROPIC_API_KEY`
2. Train **DomainClassifier** (~30 min CPU / ~5 min GPU T4) β€” skipped if checkpoint exists
3. Train **EvidenceNER** (~10 min CPU) β€” skipped if checkpoint exists
4. Train **NextActionMLP** (< 30 sec CPU) β€” skipped if checkpoint exists
5. Start **FastAPI** at `http://localhost:8000` (Swagger docs at `/docs`)
6. Start **Gradio UI** at `http://localhost:7860`
Both servers print to the same terminal, prefixed with `[API]` or `[UI]`. Press **Ctrl+C** to stop everything cleanly.
---
## 7. Development Phases
| Phase | Deliverable | Status |
|-------|------------------------------------------------|--------------|
| 0 | Abstract submitted (G.U.I.D.E.) | Done βœ“ |
| 1 | Architecture + spec (`docs/architecture.md`) | Done βœ“ |
| 2 | Project scaffold + environment setup | Done βœ“ |
| 3 | Presidio PII redaction layer (`src/privacy/`) | Done βœ“ |
| 4 | DL: DomainClassifier β€” model, dataset, train | Done βœ“ |
| 5 | DL: EvidenceNER + NextActionMLP β€” model, train | Done βœ“ |
| 6 | DL: ViT document extractor (fine-tuning) | In progress |
| 7 | CMA agent + tools integration | Done βœ“ |
| 8 | Document processor (OCR + ViT pipeline) | Done βœ“ |
| 9 | FastAPI backend (HITL endpoint, schemas) | Done βœ“ |
| 10 | Gradio UI (Verify tab, privacy badge, HITL) | Done βœ“ |
| 11 | Single launcher (`start.py`) + CLAUDE.md | Done βœ“ |
| 12 | Integration testing + demo notebooks | Remaining |
| 13 | Final report + presentation | Remaining |