Spaces:
Sleeping
Sleeping
| # G.U.I.D.E. β Technical Architecture & Specification | |
| ## 1. Overview | |
| G.U.I.D.E. (Grievance Utility for Information Extraction, Drafting and Enrichment) | |
| is a **four-layer, spec-driven system** built for consumer complaint resolution. | |
| Every component has a clear contract; layers communicate through defined interfaces | |
| so each piece can be tested and replaced independently. | |
| ``` | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β GRADIO FRONTEND β | |
| β Chat Β· Verify (HITL) Β· Documents Β· Draft Β· Escalation Β· About β | |
| ββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββ | |
| β HTTP (REST) | |
| ββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββ | |
| β FASTAPI BACKEND β | |
| β /api/session /api/message /api/upload β | |
| β /api/session/{id}/validate-entities /api/status β | |
| ββββ¬βββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββββββ | |
| β β | |
| β Step 1 (local) β Step 2 (external API β after redaction) | |
| β β | |
| ββββΌβββββββββββββββ ββΌβββββββββββββββββββββββββββββββββββββββββββββ | |
| β PRESIDIO β β CLAUDE MANAGED AGENT (CMA) β | |
| β PIIRedactor β β β | |
| β (runs locally) β β Tools: β | |
| β β β - classify_domain() βββΊ DomainClassifier β | |
| β Redacts: β β - extract_entities() βββΊ EvidenceNER β | |
| β PERSON β β - process_document() βββΊ OCR / ViT + NER β | |
| β PHONE_NUMBER β β - draft_complaint() βββΊ Claude (internal) β | |
| β EMAIL_ADDRESS β β - recommend_action() βββΊ NextActionPredict β | |
| β CREDIT_CARD β β β | |
| β IN_AADHAAR β β HITL gate: agent pauses before drafting β | |
| β IN_PAN β β and requests user confirmation of entities β | |
| β IBAN_CODE β β β | |
| β ... β β Memory: per-user session state β | |
| βββββββββββββββββββ βββββββββββββββββββββββββββββββββββββββββββββββ | |
| β | |
| ββββββββββββββΌββββββββββββββββββββββββββββββ | |
| β DEEP LEARNING LAYER β | |
| β β | |
| β 1. DomainClassifier (DistilBERT) β | |
| β 2. EvidenceNER (DistilBERT tokens) β | |
| β 3. DocumentViT (ViT image encoder) β | |
| β 4. NextActionPredictor (MLP) β | |
| ββββββββββββββββββββββββββββββββββββββββββββ | |
| β | |
| ββββββββββββββΌββββββββββββββ | |
| β DOCUMENT PROCESSOR β | |
| β Tesseract OCR β | |
| β pdfplumber (PDF parse) β | |
| β PIL (image pre-process) β | |
| β ViT (image understanding)β | |
| ββββββββββββββββββββββββββββ | |
| ``` | |
| --- | |
| ## 2. Component Specifications | |
| ### 2.0 Privacy Preprocessing β Microsoft Presidio | |
| This layer runs **locally** before any message is forwarded to Claude or any | |
| external service. It is the first step in the pipeline. | |
| | Attribute | Value | | |
| |--------------|-----------------------------------------------------------| | |
| | Library | `presidio-analyzer` + `presidio-anonymizer` | | |
| | NLP engine | spaCy `en_core_web_lg` (local, no network call) | | |
| | Trigger | Every `/api/session/{id}/message` call | | |
| | Entity types | `PERSON`, `PHONE_NUMBER`, `EMAIL_ADDRESS`, `CREDIT_CARD`, `IBAN_CODE`, `US_BANK_NUMBER`, `IN_AADHAAR`, `IN_PAN`, `IN_VEHICLE_REGISTRATION` | | |
| | Replacement | Each detected span β `<ENTITY_TYPE>` placeholder | | |
| | Output | `RedactionResult(redacted_text, pii_types_found, pii_redacted)` | | |
| | Failure mode | On any error, original text is returned unchanged (fail-open so pipeline is never blocked) | | |
| **Why local?** | |
| The user's name, account number, and Aadhaar UID must never leave the device | |
| in plaintext. Running Presidio in-process (same Python server) ensures redaction | |
| happens before TCP bytes are written to the Anthropic API. | |
| **What the DL models receive:** | |
| The local DL models (DomainClassifier, EvidenceNER) are called by Claude through | |
| tool calls β they therefore also receive redacted text. The EvidenceNER model is | |
| still effective because it targets structural patterns (amounts, dates, reference | |
| IDs) that are not redacted by Presidio. | |
| --- | |
| ### 2.1 Deep Learning Layer | |
| #### 2.1.1 DomainClassifier | |
| | Attribute | Value | | |
| |-----------------|----------------------------------------------------| | |
| | Architecture | `distilbert-base-uncased` + linear classification head | | |
| | Task | Multi-class text classification | | |
| | Classes (6) | `ecommerce`, `telecom`, `banking`, `cibil`, `insurance`, `general` | | |
| | Training data | CFPB Consumer Complaint Database (3M+ rows) β one-time download from Kaggle. Save as `data/raw/complaints.csv`. | | |
| | Script | `python -m src.classifier.train --cfpb_csv data/raw/complaints.csv --output_dir models/domain_classifier` | | |
| | Input | Redacted complaint text (string, max 512 tokens) | | |
| | Output | `DomainResult(domain: str, confidence: float, all_probs: dict, low_confidence: bool)` | | |
| | Confidence threshold | `0.50` β results below this set `low_confidence=True` | | |
| | Low-confidence path | Agent asks user one clarifying domain question; does not proceed until user confirms | | |
| | Keyword fallback | Used when no checkpoint exists; always returns `confidence=0.0`, `low_confidence=True` | | |
| | Fine-tune time | ~30 min CPU / ~5 min GPU (T4) | | |
| | Library | HuggingFace `transformers` + `datasets` | | |
| **Why DistilBERT?** | |
| DistilBERT is 40% smaller and 60% faster than BERT-base with only 3% accuracy | |
| loss. For a project with limited compute, it is the ideal starting point. | |
| The CFPB dataset maps naturally to our 6 classes after label remapping. | |
| **Low-confidence handling:** | |
| `general` is the intentional catch-all class β the model never returns an error, only a | |
| domain + probability. However, a low probability on all classes (e.g., the complaint text | |
| is too short or ambiguous) means the winning domain is unreliable. When `confidence < 0.50` | |
| the `low_confidence` flag is set and the CMA agent pauses to ask the user one clarifying | |
| question ("Is this about e-commerce, telecom, banking, credit score, insurance, or other?") | |
| before continuing. The user's answer overrides the model's suggestion and is stored with | |
| `domain_source = "user_confirmed"` so later tools know the domain is authoritative. | |
| --- | |
| #### 2.1.2 EvidenceNER | |
| | Attribute | Value | | |
| |--------------|------------------------------------------------------------| | |
| | Architecture | `distilbert-base-uncased` with token classification head | | |
| | Task | Named Entity Recognition (NER) on complaint text | | |
| | Entity types | `ORG`, `AMOUNT`, `DATE`, `REF_ID`, `ACCOUNT`, `PERSON` | | |
| | Training | ~4,000 synthetic complaint sentences generated in-memory by `src/ner/train.py` (no download needed). Optionally augmented with CoNLL-2003 via HuggingFace if internet is available (maps PERβPERSON, ORGβORG; discards LOC/MISC). | | |
| | Script | `python -m src.ner.train --output_dir models/evidence_ner` | | |
| | Input | Redacted text (from user or OCR) | | |
| | Output | List of `{text, label, start, end, confidence}` spans | | |
| **Entities and their use in drafting:** | |
| | Entity | Example | Used for | | |
| |----------|---------------------------------|-----------------------------| | |
| | ORG | "Flipkart", "HDFC Bank" | Complaint addressee | | |
| | AMOUNT | "βΉ4,299", "Rs. 1,200" | Financial loss quantified | | |
| | DATE | "12 March 2024", "last Tuesday" | Incident timeline | | |
| | REF_ID | "Order #OD-2930291", "TXN123" | Evidence reference | | |
| | ACCOUNT | "XXXX-1234", "loan account" | Dispute target | | |
| | PERSON | "customer care executive" | Named witness/contact | | |
| --- | |
| #### 2.1.3 DocumentViT | |
| | Attribute | Value | | |
| |--------------|---------------------------------------------------------------| | |
| | Architecture | Vision Transformer (`google/vit-base-patch16-224` fine-tuned)| | |
| | Task | Structured evidence extraction from document images | | |
| | Input | Scanned receipt / bill / screenshot (PIL Image) | | |
| | Output | List of `{text, label, confidence}` spans (same schema as NER)| | |
| | When used | After OCR; ViT runs as a complementary pass on image-type docs| | |
| | Library | HuggingFace `transformers` (`ViTForImageClassification` + custom head) | | |
| **Why ViT alongside OCR?** | |
| Tesseract OCR excels at clean printed text but struggles with handwriting, logos, | |
| and table structures. The ViT model, fine-tuned on receipt and bill images, directly | |
| classifies image regions and extracts amount/date/provider fields β especially | |
| useful for blurry screenshots and poorly-scanned documents. | |
| --- | |
| #### 2.1.4 NextActionPredictor | |
| | Attribute | Value | | |
| |---------------|-------------------------------------------------------------------| | |
| | Architecture | 2-hidden-layer MLP (12-dim input β 64 β 64 β 6) | | |
| | Input | 12-dim feature vector: domain one-hot (6) + entity flags (5) + prior_contact (1) | | |
| | Output | Ranked list of `{action, authority, url, confidence}` | | |
| | Actions | 6 classes: `company_support`, `nch`, `trai`, `rbi_ombudsman`, `irdai`, `legal` | | |
| | Training data | ~6,000 synthetic (domain, entity_flags, prior_contact β action) examples generated in-memory from `DOMAIN_ACTION_PRIORS`; no download needed. Trains in < 30 seconds on CPU. | | |
| | Script | `python -m src.next_action.train --output_dir models/next_action` | | |
| | Fallback | If no checkpoint exists, `DOMAIN_ACTION_PRIORS` rule-based mapping is used so the pipeline always works. | | |
| **Escalation routing logic:** | |
| | Domain | Primary Authority | Secondary | | |
| |-----------|---------------------------|--------------------| | |
| | E-commerce| Company support β NCH | Consumer Forum | | |
| | Telecom | Company support β TRAI | NCH | | |
| | Banking | Company support β RBI BO | Banking Ombudsman | | |
| | CIBIL | Bureau direct β RBI BO | SEBI (if investment)| | |
| | Insurance | Company support β IRDAI | Insurance Ombudsman| | |
| | General | Company support β NCH | Consumer Forum | | |
| --- | |
| ### 2.2 Claude Managed Agent (CMA) | |
| The CMA is the orchestration layer. It maintains **per-user session state** | |
| (conversation history, extracted entities, uploaded docs, draft versions) and | |
| decides at each turn which tool to invoke. | |
| **Key constraint:** Claude only ever sees **redacted text** (PII replaced with | |
| `<ENTITY_TYPE>` placeholders by Presidio before the API call). This is documented | |
| in the system prompt so Claude knows not to try to recover original values. | |
| #### Agent System Prompt Summary | |
| ``` | |
| You are G.U.I.D.E., an expert consumer complaint assistant. | |
| PII has already been redacted locally β work with placeholders as-is. | |
| Rules: | |
| 1. Always classify the domain first using classify_domain(). | |
| β’ If low_confidence=false (β₯ 0.50): store domain and proceed. | |
| β’ If low_confidence=true (< 0.50 or keyword fallback): ask the user ONE | |
| clarifying question ("Is this about e-commerce, telecom, banking, credit | |
| score, insurance, or other?") before continuing. Store domain_source= | |
| "user_confirmed" when the domain comes from the user. | |
| β’ If classify_domain() errors: same clarifying question as above. | |
| 2. Ask ONE targeted follow-up question at a time if information is missing. | |
| 3. If documents are uploaded, always run process_document before drafting. | |
| 4. HITL gate: Before calling draft_complaint, present extracted details | |
| as a numbered summary and ask the user to confirm them. Wait for | |
| [USER CONFIRMED] message before proceeding. | |
| 5. Never draft until domain, provider, date, amount, prior contact, and | |
| desired resolution are all known AND user-confirmed. | |
| 6. Generate drafts in formal English: Subject / To / Body / From. | |
| 7. Always recommend the next escalation step with specific portal URLs. | |
| ``` | |
| #### Tool Specifications | |
| | Tool | Input | Output | Calls | | |
| |-------------------|------------------------------|---------------------------------|-------------------| | |
| | `classify_domain` | `complaint_text: str` | `DomainResult` | DL DomainClassifier | | |
| | `extract_entities`| `text: str` | `List[Entity]` | DL EvidenceNER | | |
| | `process_document`| `file_path: str` | `{raw_text, entities}` | OCR + ViT + EvidenceNER | | |
| | `draft_complaint` | `complaint_context: dict` | `ComplaintDraft` | Claude (internal) | | |
| | `recommend_action`| `domain: str, entities: dict`| `List[EscalationAction]` | DL NextAction | | |
| | `store_memory` | `key: str, value: any` | `None` | CMA Memory Store | | |
| | `get_memory` | `key: str` | `any` | CMA Memory Store | | |
| #### CMA Decision Flow (per user turn) | |
| ``` | |
| User message received | |
| β | |
| βΌ ββ PRESIDIO (API layer, before agent) βββββββββββββββ | |
| PII redacted locally β redacted_text forwarded to Claude | |
| β | |
| βΌ | |
| Is domain known? ββNoβββΊ call classify_domain() βββΊ store in memory | |
| β | |
| Yes | |
| β | |
| βΌ | |
| Are minimum fields complete? ββNoβββΊ ask ONE follow-up question | |
| (provider, date, amount, ref) | |
| β | |
| Yes | |
| β | |
| βΌ | |
| Was a document uploaded? ββYesβββΊ call process_document() βββΊ merge entities | |
| β | |
| No | |
| β | |
| βΌ ββ HITL GATE ββββββββββββββββββββββββββββββββββββββββββ | |
| Present extracted details summary β ask user to confirm | |
| β | |
| Wait for [USER CONFIRMED] message (from /validate-entities endpoint) | |
| β | |
| βΌ | |
| Has user confirmed entities? ββYesβββΊ call draft_complaint() βββΊ show draft | |
| β | |
| βΌ | |
| Has user asked next steps? ββYesβββΊ call recommend_action() βββΊ show escalation | |
| ``` | |
| --- | |
| ### 2.3 Human-in-the-Loop (HITL) Validation | |
| After all required fields are collected and before draft generation, the system | |
| pauses and requires explicit user confirmation of the extracted entities. | |
| | Step | Component | Description | | |
| |------|-----------|-------------| | |
| | 1 | CMA | Presents extracted entities as a numbered summary in chat | | |
| | 2 | Frontend | Populates the **Verify Entities** tab with pre-filled editable fields | | |
| | 3 | User | Reviews, edits any incorrect value, and clicks "Confirm & Generate Draft" | | |
| | 4 | API | `POST /api/session/{id}/validate-entities` sends verified entities to CMA | | |
| | 5 | CMA | Receives `[USER CONFIRMED]` message and calls `draft_complaint()` | | |
| **Why HITL?** | |
| PII redaction replaces some values with placeholders (e.g., a name becomes | |
| `<PERSON>`). The HITL step lets the user supply the correct readable label | |
| (e.g., "HDFC Bank" rather than just `<ORG>`) that will appear in the final draft, | |
| improving both accuracy and trust in the generated complaint. | |
| --- | |
| ### 2.4 Document Processor | |
| | Feature | Implementation | | |
| |--------------|---------------------------------------| | |
| | PDF parsing | `pdfplumber` (text-native PDFs) | | |
| | Image OCR | `pytesseract` + `Pillow` (pre-process)| | |
| | ViT pass | `google/vit-base-patch16-224` fine-tuned on receipt/bill images | | |
| | Pre-process | Greyscale β adaptive threshold β deskew | | |
| | Output | Clean extracted text + NER entities | | |
| | Formats | PDF, PNG, JPG, JPEG, WEBP | | |
| --- | |
| ### 2.5 FastAPI Backend | |
| | Endpoint | Method | Description | | |
| |---------------------------------------|--------|----------------------------------------| | |
| | `/api/health` | GET | Health check (all components) | | |
| | `/api/session/create` | POST | Create new CMA session | | |
| | `/api/session/{id}/message` | POST | Send message β Presidio redact β agent | | |
| | `/api/session/{id}/upload` | POST | Upload a document to session | | |
| | `/api/session/{id}/validate-entities` | POST | HITL: submit user-confirmed entities | | |
| | `/api/session/{id}/history` | GET | Retrieve conversation history | | |
| | `/api/classify` | POST | Direct DL classification (debug) | | |
| | `/api/extract` | POST | Direct NER extraction (debug) | | |
| --- | |
| ### 2.6 Gradio Frontend | |
| Tabs: | |
| 1. **Chat** β Conversational interface; shows π privacy badge when PII is redacted | |
| 2. **Verify Entities** β HITL panel: editable entity fields + "Confirm & Generate Draft" | |
| 3. **Documents** β Drag-and-drop upload; shows extracted entities | |
| 4. **Complaint Draft** β Rendered complaint with copy/download | |
| 5. **Escalation Guide** β Recommended authorities with portal links | |
| 6. **About** β Architecture diagram, model cards, tech stack | |
| --- | |
| ## 3. Technology Stack | |
| | Layer | Technology | Reason | | |
| |--------------------|--------------------------------|-----------------------------------------------| | |
| | Launcher | `start.py` (stdlib only) | Single script β trains all models then starts servers | | |
| | Privacy | Microsoft Presidio + spaCy | Local PII redaction, no cloud call | | |
| | DL Models | HuggingFace Transformers | Industry standard for NLP + ViT | | |
| | Classifier data | CFPB dataset (Kaggle, one-time)| 3M+ real complaints, public license | | |
| | NER data | Synthetic in-memory | Template-generated; no download required | | |
| | NextAction data | Synthetic in-memory | Generated from domain priors; no download | | |
| | Agent | Anthropic CMA (default `claude-sonnet-4-6`, set via `GUIDE_MODEL`) | Stateful, tool-using agent | | |
| | Backend | FastAPI + Uvicorn | Async, fast, OpenAPI auto-docs | | |
| | Frontend | Gradio 4.x | ML-native UI, file upload, chat | | |
| | OCR | pytesseract + pdfplumber | Proven, open-source | | |
| | ViT doc model | HuggingFace ViT | Image-based evidence extraction | | |
| | Env | Python 3.10+ | Required by CMA SDK | | |
| | Config | python-dotenv | Secure API key management | | |
| | Notebooks (optional) | Jupyter | EDA and demo only; not required to run system | | |
| --- | |
| ## 4. Data Flow β End to End | |
| ``` | |
| User types: "Rahul Sharma β Flipkart hasn't refunded βΉ4,299 for order OD-123 | |
| cancelled 3 weeks ago. My phone is 9876543210." | |
| β | |
| βΌ ββ LOCAL ONLY βββββββββββββββββββββββββββββββββββββββββββββ | |
| Presidio PIIRedactor detects: PERSON("Rahul Sharma"), PHONE("9876543210") | |
| Redacted: "<PERSON> β Flipkart hasn't refunded βΉ4,299 for order OD-123 | |
| cancelled 3 weeks ago. My phone is <PHONE_NUMBER>." | |
| β | |
| βΌ ββ EXTERNAL API ββββββββββββββββββββββββββββββββββββββββββββ | |
| FastAPI β CMA session with redacted text | |
| β | |
| Claude CMA agent processes redacted message | |
| β | |
| ββββΊ classify_domain(...) | |
| β ββββΊ DomainClassifier β {domain: "ecommerce", conf: 0.97} | |
| β | |
| ββββΊ extract_entities(...) | |
| β ββββΊ EvidenceNER β [ORG:"Flipkart", AMOUNT:"βΉ4,299", | |
| β REF_ID:"OD-123", DATE:"3 weeks ago"] | |
| β | |
| ββββΊ HITL gate: "I have extracted the following β please confirm: | |
| - Company: Flipkart | |
| - Amount: βΉ4,299 | |
| - Order ID: OD-123 | |
| - Date: 3 weeks ago | |
| Is this correct?" | |
| β | |
| User reviews in "Verify Entities" tab β edits if needed β clicks Confirm | |
| β | |
| βΌ | |
| POST /validate-entities β [USER CONFIRMED] β draft_complaint() | |
| β | |
| ComplaintDraft generated and shown in Draft tab | |
| β | |
| recommend_action(domain="ecommerce") β [NCH, Consumer Forum] | |
| β | |
| Gradio renders: Draft Β· Evidence table Β· Escalation panel | |
| ``` | |
| --- | |
| ## 5. Project Directory Structure | |
| ``` | |
| Project_ResolveAI/ | |
| β | |
| βββ start.py β SINGLE ENTRY POINT β trains all models then starts servers | |
| β | |
| βββ docs/ | |
| β βββ abstract.md β project abstract (G.U.I.D.E.) | |
| β βββ architecture.md β this file (spec) | |
| β | |
| βββ src/ | |
| β βββ __init__.py | |
| β βββ privacy/ # Presidio PII redaction (runs before any external call) | |
| β β βββ __init__.py | |
| β β βββ redactor.py β PIIRedactor singleton | |
| β β | |
| β βββ classifier/ # DL Domain Classifier (DistilBERT) | |
| β β βββ model.py β DomainClassifier, CFPB_PRODUCT_MAP, LABEL2ID | |
| β β βββ dataset.py β load_cfpb_csv(), clean_complaint_text(), ComplaintDataset | |
| β β βββ train.py β CLI: --cfpb_csv, --output_dir, --epochs, --batch_size | |
| β β βββ predict.py β DomainPredictor singleton, classify_domain() | |
| β β | |
| β βββ ner/ # DL NER model (DistilBERT token classifier) | |
| β β βββ model.py β EvidenceNER, NER_LABELS, NER_LABEL2ID | |
| β β βββ train.py β Generates synthetic data in-memory; CLI: --output_dir | |
| β β βββ predict.py β NERPredictor singleton, extract_entities() | |
| β β | |
| β βββ next_action/ # Next-action MLP predictor | |
| β β βββ model.py β NextActionMLP, DOMAIN_ACTION_PRIORS, build_feature_vector() | |
| β β βββ train.py β Generates synthetic features; CLI: --output_dir, --epochs | |
| β β βββ predict.py β NextActionPredictor singleton (MLP or rule-based fallback) | |
| β β | |
| β βββ document_processor/ # OCR + PDF parsing + ViT | |
| β β βββ ocr.py β Tesseract + pdfplumber pipeline | |
| β β βββ vit_extractor.py β ViT-based image evidence extraction | |
| β β | |
| β βββ agent/ # Claude CMA integration | |
| β β βββ tools.py β Tool definitions (JSON Schema) + execute_tool() | |
| β β βββ prompts.py β SYSTEM_PROMPT (HITL rule 6, privacy context) | |
| β β βββ session.py β AgentManager singleton, send_message() | |
| β β | |
| β βββ api/ # FastAPI application | |
| β βββ main.py β Lifespan: Presidio β DL models β CMA agent | |
| β βββ routes.py β /message (Presidioβagent), /validate-entities (HITL) | |
| β βββ schemas.py β Pydantic models incl. HITL + pii_redacted fields | |
| β | |
| βββ ui/ | |
| β βββ app.py β Gradio: Chat Β· Verify Β· Docs Β· Draft Β· Escalation Β· About | |
| β | |
| βββ notebooks/ # OPTIONAL β EDA and interactive demos only | |
| β βββ 01_data_exploration.ipynb β Explore CFPB dataset, save processed CSV | |
| β βββ 02_classifier_training.ipynb | |
| β βββ 04_cma_agent_demo.ipynb | |
| β βββ 05_end_to_end_demo.ipynb | |
| β | |
| βββ data/ | |
| β βββ raw/ β Place CFPB complaints.csv here (one-time download) | |
| β βββ processed/ β Output of EDA notebook; not required for training | |
| β βββ sample_complaints/ β Synthetic domain-specific CSVs for augmentation | |
| β | |
| βββ models/ β Created by training; populated by start.py | |
| β βββ domain_classifier/ β best_model.pt + tokenizer files | |
| β βββ evidence_ner/ β best_model.pt + tokenizer files | |
| β βββ document_vit/ β ViT checkpoint | |
| β βββ next_action/ β best_model.pt | |
| β | |
| βββ CLAUDE.md β Guidance for Claude Code | |
| βββ requirements.txt β presidio-analyzer, presidio-anonymizer, spacy, torch, etc. | |
| βββ .env.example β Template β copy to .env and add ANTHROPIC_API_KEY | |
| βββ .gitignore | |
| βββ README.md | |
| ``` | |
| --- | |
| ## 6. Setup and Running | |
| ### Step 1 β Get an Anthropic API key | |
| 1. Go to https://console.anthropic.com β Sign up (free tier available) | |
| 2. Navigate to **API Keys** β **Create Key** | |
| 3. Copy the key (shown only once) | |
| 4. In the project root, copy the template and fill in your key: | |
| ``` | |
| cp .env.example .env | |
| # then edit .env: | |
| ANTHROPIC_API_KEY=sk-ant-... | |
| ``` | |
| Never commit `.env` to git (listed in `.gitignore`). | |
| ### Step 2 β Install dependencies | |
| ```bash | |
| pip install -r requirements.txt | |
| python -m spacy download en_core_web_lg # Presidio NLP model (local, ~750 MB) | |
| ``` | |
| ### Step 3 β Download CFPB data (first run only) | |
| The DomainClassifier requires the CFPB Consumer Complaint Database: | |
| - Download from Kaggle: `consumer-complaint-database` dataset | |
| - Save the CSV to `data/raw/complaints.csv` | |
| - Size: ~600 MB; one-time download; not committed to git | |
| This is only needed to train the classifier. The NER and NextAction models generate their training data in-memory automatically. | |
| ### Step 4 β Run (single command) | |
| ```bash | |
| # First run β trains all models then starts servers: | |
| python start.py --cfpb_csv data/raw/complaints.csv | |
| # After first run β models already trained, skip training: | |
| python start.py --no-train | |
| # Force retrain everything: | |
| python start.py --cfpb_csv data/raw/complaints.csv --train | |
| # Train only (no servers): | |
| python start.py --cfpb_csv data/raw/complaints.csv --train-only | |
| ``` | |
| When running, `start.py` will: | |
| 1. Validate `.env` and `ANTHROPIC_API_KEY` | |
| 2. Train **DomainClassifier** (~30 min CPU / ~5 min GPU T4) β skipped if checkpoint exists | |
| 3. Train **EvidenceNER** (~10 min CPU) β skipped if checkpoint exists | |
| 4. Train **NextActionMLP** (< 30 sec CPU) β skipped if checkpoint exists | |
| 5. Start **FastAPI** at `http://localhost:8000` (Swagger docs at `/docs`) | |
| 6. Start **Gradio UI** at `http://localhost:7860` | |
| Both servers print to the same terminal, prefixed with `[API]` or `[UI]`. Press **Ctrl+C** to stop everything cleanly. | |
| --- | |
| ## 7. Development Phases | |
| | Phase | Deliverable | Status | | |
| |-------|------------------------------------------------|--------------| | |
| | 0 | Abstract submitted (G.U.I.D.E.) | Done β | | |
| | 1 | Architecture + spec (`docs/architecture.md`) | Done β | | |
| | 2 | Project scaffold + environment setup | Done β | | |
| | 3 | Presidio PII redaction layer (`src/privacy/`) | Done β | | |
| | 4 | DL: DomainClassifier β model, dataset, train | Done β | | |
| | 5 | DL: EvidenceNER + NextActionMLP β model, train | Done β | | |
| | 6 | DL: ViT document extractor (fine-tuning) | In progress | | |
| | 7 | CMA agent + tools integration | Done β | | |
| | 8 | Document processor (OCR + ViT pipeline) | Done β | | |
| | 9 | FastAPI backend (HITL endpoint, schemas) | Done β | | |
| | 10 | Gradio UI (Verify tab, privacy badge, HITL) | Done β | | |
| | 11 | Single launcher (`start.py`) + CLAUDE.md | Done β | | |
| | 12 | Integration testing + demo notebooks | Remaining | | |
| | 13 | Final report + presentation | Remaining | | |