Spaces:
Sleeping
G.U.I.D.E. β Technical Architecture & Specification
1. Overview
G.U.I.D.E. (Grievance Utility for Information Extraction, Drafting and Enrichment) is a four-layer, spec-driven system built for consumer complaint resolution. Every component has a clear contract; layers communicate through defined interfaces so each piece can be tested and replaced independently.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β GRADIO FRONTEND β
β Chat Β· Verify (HITL) Β· Documents Β· Draft Β· Escalation Β· About β
ββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββ
β HTTP (REST)
ββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββ
β FASTAPI BACKEND β
β /api/session /api/message /api/upload β
β /api/session/{id}/validate-entities /api/status β
ββββ¬βββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββββββ
β β
β Step 1 (local) β Step 2 (external API β after redaction)
β β
ββββΌβββββββββββββββ ββΌβββββββββββββββββββββββββββββββββββββββββββββ
β PRESIDIO β β CLAUDE MANAGED AGENT (CMA) β
β PIIRedactor β β β
β (runs locally) β β Tools: β
β β β - classify_domain() βββΊ DomainClassifier β
β Redacts: β β - extract_entities() βββΊ EvidenceNER β
β PERSON β β - process_document() βββΊ OCR / ViT + NER β
β PHONE_NUMBER β β - draft_complaint() βββΊ Claude (internal) β
β EMAIL_ADDRESS β β - recommend_action() βββΊ NextActionPredict β
β CREDIT_CARD β β β
β IN_AADHAAR β β HITL gate: agent pauses before drafting β
β IN_PAN β β and requests user confirmation of entities β
β IBAN_CODE β β β
β ... β β Memory: per-user session state β
βββββββββββββββββββ βββββββββββββββββββββββββββββββββββββββββββββββ
β
ββββββββββββββΌββββββββββββββββββββββββββββββ
β DEEP LEARNING LAYER β
β β
β 1. DomainClassifier (DistilBERT) β
β 2. EvidenceNER (DistilBERT tokens) β
β 3. DocumentViT (ViT image encoder) β
β 4. NextActionPredictor (MLP) β
ββββββββββββββββββββββββββββββββββββββββββββ
β
ββββββββββββββΌββββββββββββββ
β DOCUMENT PROCESSOR β
β Tesseract OCR β
β pdfplumber (PDF parse) β
β PIL (image pre-process) β
β ViT (image understanding)β
ββββββββββββββββββββββββββββ
2. Component Specifications
2.0 Privacy Preprocessing β Microsoft Presidio
This layer runs locally before any message is forwarded to Claude or any external service. It is the first step in the pipeline.
| Attribute | Value |
|---|---|
| Library | presidio-analyzer + presidio-anonymizer |
| NLP engine | spaCy en_core_web_lg (local, no network call) |
| Trigger | Every /api/session/{id}/message call |
| Entity types | PERSON, PHONE_NUMBER, EMAIL_ADDRESS, CREDIT_CARD, IBAN_CODE, US_BANK_NUMBER, IN_AADHAAR, IN_PAN, IN_VEHICLE_REGISTRATION |
| Replacement | Each detected span β <ENTITY_TYPE> placeholder |
| Output | RedactionResult(redacted_text, pii_types_found, pii_redacted) |
| Failure mode | On any error, original text is returned unchanged (fail-open so pipeline is never blocked) |
Why local?
The user's name, account number, and Aadhaar UID must never leave the device
in plaintext. Running Presidio in-process (same Python server) ensures redaction
happens before TCP bytes are written to the Anthropic API.
What the DL models receive:
The local DL models (DomainClassifier, EvidenceNER) are called by Claude through
tool calls β they therefore also receive redacted text. The EvidenceNER model is
still effective because it targets structural patterns (amounts, dates, reference
IDs) that are not redacted by Presidio.
2.1 Deep Learning Layer
2.1.1 DomainClassifier
| Attribute | Value |
|---|---|
| Architecture | distilbert-base-uncased + linear classification head |
| Task | Multi-class text classification |
| Classes (6) | ecommerce, telecom, banking, cibil, insurance, general |
| Training data | CFPB Consumer Complaint Database (3M+ rows) β one-time download from Kaggle. Save as data/raw/complaints.csv. |
| Script | python -m src.classifier.train --cfpb_csv data/raw/complaints.csv --output_dir models/domain_classifier |
| Input | Redacted complaint text (string, max 512 tokens) |
| Output | DomainResult(domain: str, confidence: float, all_probs: dict, low_confidence: bool) |
| Confidence threshold | 0.50 β results below this set low_confidence=True |
| Low-confidence path | Agent asks user one clarifying domain question; does not proceed until user confirms |
| Keyword fallback | Used when no checkpoint exists; always returns confidence=0.0, low_confidence=True |
| Fine-tune time | ~30 min CPU / ~5 min GPU (T4) |
| Library | HuggingFace transformers + datasets |
Why DistilBERT?
DistilBERT is 40% smaller and 60% faster than BERT-base with only 3% accuracy
loss. For a project with limited compute, it is the ideal starting point.
The CFPB dataset maps naturally to our 6 classes after label remapping.
Low-confidence handling:general is the intentional catch-all class β the model never returns an error, only a
domain + probability. However, a low probability on all classes (e.g., the complaint text
is too short or ambiguous) means the winning domain is unreliable. When confidence < 0.50
the low_confidence flag is set and the CMA agent pauses to ask the user one clarifying
question ("Is this about e-commerce, telecom, banking, credit score, insurance, or other?")
before continuing. The user's answer overrides the model's suggestion and is stored with
domain_source = "user_confirmed" so later tools know the domain is authoritative.
2.1.2 EvidenceNER
| Attribute | Value |
|---|---|
| Architecture | distilbert-base-uncased with token classification head |
| Task | Named Entity Recognition (NER) on complaint text |
| Entity types | ORG, AMOUNT, DATE, REF_ID, ACCOUNT, PERSON |
| Training | ~4,000 synthetic complaint sentences generated in-memory by src/ner/train.py (no download needed). Optionally augmented with CoNLL-2003 via HuggingFace if internet is available (maps PERβPERSON, ORGβORG; discards LOC/MISC). |
| Script | python -m src.ner.train --output_dir models/evidence_ner |
| Input | Redacted text (from user or OCR) |
| Output | List of {text, label, start, end, confidence} spans |
Entities and their use in drafting:
| Entity | Example | Used for |
|---|---|---|
| ORG | "Flipkart", "HDFC Bank" | Complaint addressee |
| AMOUNT | "βΉ4,299", "Rs. 1,200" | Financial loss quantified |
| DATE | "12 March 2024", "last Tuesday" | Incident timeline |
| REF_ID | "Order #OD-2930291", "TXN123" | Evidence reference |
| ACCOUNT | "XXXX-1234", "loan account" | Dispute target |
| PERSON | "customer care executive" | Named witness/contact |
2.1.3 DocumentViT
| Attribute | Value |
|---|---|
| Architecture | Vision Transformer (google/vit-base-patch16-224 fine-tuned) |
| Task | Structured evidence extraction from document images |
| Input | Scanned receipt / bill / screenshot (PIL Image) |
| Output | List of {text, label, confidence} spans (same schema as NER) |
| When used | After OCR; ViT runs as a complementary pass on image-type docs |
| Library | HuggingFace transformers (ViTForImageClassification + custom head) |
Why ViT alongside OCR?
Tesseract OCR excels at clean printed text but struggles with handwriting, logos,
and table structures. The ViT model, fine-tuned on receipt and bill images, directly
classifies image regions and extracts amount/date/provider fields β especially
useful for blurry screenshots and poorly-scanned documents.
2.1.4 NextActionPredictor
| Attribute | Value |
|---|---|
| Architecture | 2-hidden-layer MLP (12-dim input β 64 β 64 β 6) |
| Input | 12-dim feature vector: domain one-hot (6) + entity flags (5) + prior_contact (1) |
| Output | Ranked list of {action, authority, url, confidence} |
| Actions | 6 classes: company_support, nch, trai, rbi_ombudsman, irdai, legal |
| Training data | ~6,000 synthetic (domain, entity_flags, prior_contact β action) examples generated in-memory from DOMAIN_ACTION_PRIORS; no download needed. Trains in < 30 seconds on CPU. |
| Script | python -m src.next_action.train --output_dir models/next_action |
| Fallback | If no checkpoint exists, DOMAIN_ACTION_PRIORS rule-based mapping is used so the pipeline always works. |
Escalation routing logic:
| Domain | Primary Authority | Secondary |
|---|---|---|
| E-commerce | Company support β NCH | Consumer Forum |
| Telecom | Company support β TRAI | NCH |
| Banking | Company support β RBI BO | Banking Ombudsman |
| CIBIL | Bureau direct β RBI BO | SEBI (if investment) |
| Insurance | Company support β IRDAI | Insurance Ombudsman |
| General | Company support β NCH | Consumer Forum |
2.2 Claude Managed Agent (CMA)
The CMA is the orchestration layer. It maintains per-user session state (conversation history, extracted entities, uploaded docs, draft versions) and decides at each turn which tool to invoke.
Key constraint: Claude only ever sees redacted text (PII replaced with
<ENTITY_TYPE> placeholders by Presidio before the API call). This is documented
in the system prompt so Claude knows not to try to recover original values.
Agent System Prompt Summary
You are G.U.I.D.E., an expert consumer complaint assistant.
PII has already been redacted locally β work with placeholders as-is.
Rules:
1. Always classify the domain first using classify_domain().
β’ If low_confidence=false (β₯ 0.50): store domain and proceed.
β’ If low_confidence=true (< 0.50 or keyword fallback): ask the user ONE
clarifying question ("Is this about e-commerce, telecom, banking, credit
score, insurance, or other?") before continuing. Store domain_source=
"user_confirmed" when the domain comes from the user.
β’ If classify_domain() errors: same clarifying question as above.
2. Ask ONE targeted follow-up question at a time if information is missing.
3. If documents are uploaded, always run process_document before drafting.
4. HITL gate: Before calling draft_complaint, present extracted details
as a numbered summary and ask the user to confirm them. Wait for
[USER CONFIRMED] message before proceeding.
5. Never draft until domain, provider, date, amount, prior contact, and
desired resolution are all known AND user-confirmed.
6. Generate drafts in formal English: Subject / To / Body / From.
7. Always recommend the next escalation step with specific portal URLs.
Tool Specifications
| Tool | Input | Output | Calls |
|---|---|---|---|
classify_domain |
complaint_text: str |
DomainResult |
DL DomainClassifier |
extract_entities |
text: str |
List[Entity] |
DL EvidenceNER |
process_document |
file_path: str |
{raw_text, entities} |
OCR + ViT + EvidenceNER |
draft_complaint |
complaint_context: dict |
ComplaintDraft |
Claude (internal) |
recommend_action |
domain: str, entities: dict |
List[EscalationAction] |
DL NextAction |
store_memory |
key: str, value: any |
None |
CMA Memory Store |
get_memory |
key: str |
any |
CMA Memory Store |
CMA Decision Flow (per user turn)
User message received
β
βΌ ββ PRESIDIO (API layer, before agent) βββββββββββββββ
PII redacted locally β redacted_text forwarded to Claude
β
βΌ
Is domain known? ββNoβββΊ call classify_domain() βββΊ store in memory
β
Yes
β
βΌ
Are minimum fields complete? ββNoβββΊ ask ONE follow-up question
(provider, date, amount, ref)
β
Yes
β
βΌ
Was a document uploaded? ββYesβββΊ call process_document() βββΊ merge entities
β
No
β
βΌ ββ HITL GATE ββββββββββββββββββββββββββββββββββββββββββ
Present extracted details summary β ask user to confirm
β
Wait for [USER CONFIRMED] message (from /validate-entities endpoint)
β
βΌ
Has user confirmed entities? ββYesβββΊ call draft_complaint() βββΊ show draft
β
βΌ
Has user asked next steps? ββYesβββΊ call recommend_action() βββΊ show escalation
2.3 Human-in-the-Loop (HITL) Validation
After all required fields are collected and before draft generation, the system pauses and requires explicit user confirmation of the extracted entities.
| Step | Component | Description |
|---|---|---|
| 1 | CMA | Presents extracted entities as a numbered summary in chat |
| 2 | Frontend | Populates the Verify Entities tab with pre-filled editable fields |
| 3 | User | Reviews, edits any incorrect value, and clicks "Confirm & Generate Draft" |
| 4 | API | POST /api/session/{id}/validate-entities sends verified entities to CMA |
| 5 | CMA | Receives [USER CONFIRMED] message and calls draft_complaint() |
Why HITL?
PII redaction replaces some values with placeholders (e.g., a name becomes
<PERSON>). The HITL step lets the user supply the correct readable label
(e.g., "HDFC Bank" rather than just <ORG>) that will appear in the final draft,
improving both accuracy and trust in the generated complaint.
2.4 Document Processor
| Feature | Implementation |
|---|---|
| PDF parsing | pdfplumber (text-native PDFs) |
| Image OCR | pytesseract + Pillow (pre-process) |
| ViT pass | google/vit-base-patch16-224 fine-tuned on receipt/bill images |
| Pre-process | Greyscale β adaptive threshold β deskew |
| Output | Clean extracted text + NER entities |
| Formats | PDF, PNG, JPG, JPEG, WEBP |
2.5 FastAPI Backend
| Endpoint | Method | Description |
|---|---|---|
/api/health |
GET | Health check (all components) |
/api/session/create |
POST | Create new CMA session |
/api/session/{id}/message |
POST | Send message β Presidio redact β agent |
/api/session/{id}/upload |
POST | Upload a document to session |
/api/session/{id}/validate-entities |
POST | HITL: submit user-confirmed entities |
/api/session/{id}/history |
GET | Retrieve conversation history |
/api/classify |
POST | Direct DL classification (debug) |
/api/extract |
POST | Direct NER extraction (debug) |
2.6 Gradio Frontend
Tabs:
- Chat β Conversational interface; shows π privacy badge when PII is redacted
- Verify Entities β HITL panel: editable entity fields + "Confirm & Generate Draft"
- Documents β Drag-and-drop upload; shows extracted entities
- Complaint Draft β Rendered complaint with copy/download
- Escalation Guide β Recommended authorities with portal links
- About β Architecture diagram, model cards, tech stack
3. Technology Stack
| Layer | Technology | Reason |
|---|---|---|
| Launcher | start.py (stdlib only) |
Single script β trains all models then starts servers |
| Privacy | Microsoft Presidio + spaCy | Local PII redaction, no cloud call |
| DL Models | HuggingFace Transformers | Industry standard for NLP + ViT |
| Classifier data | CFPB dataset (Kaggle, one-time) | 3M+ real complaints, public license |
| NER data | Synthetic in-memory | Template-generated; no download required |
| NextAction data | Synthetic in-memory | Generated from domain priors; no download |
| Agent | Anthropic CMA (default claude-sonnet-4-6, set via GUIDE_MODEL) |
Stateful, tool-using agent |
| Backend | FastAPI + Uvicorn | Async, fast, OpenAPI auto-docs |
| Frontend | Gradio 4.x | ML-native UI, file upload, chat |
| OCR | pytesseract + pdfplumber | Proven, open-source |
| ViT doc model | HuggingFace ViT | Image-based evidence extraction |
| Env | Python 3.10+ | Required by CMA SDK |
| Config | python-dotenv | Secure API key management |
| Notebooks (optional) | Jupyter | EDA and demo only; not required to run system |
4. Data Flow β End to End
User types: "Rahul Sharma β Flipkart hasn't refunded βΉ4,299 for order OD-123
cancelled 3 weeks ago. My phone is 9876543210."
β
βΌ ββ LOCAL ONLY βββββββββββββββββββββββββββββββββββββββββββββ
Presidio PIIRedactor detects: PERSON("Rahul Sharma"), PHONE("9876543210")
Redacted: "<PERSON> β Flipkart hasn't refunded βΉ4,299 for order OD-123
cancelled 3 weeks ago. My phone is <PHONE_NUMBER>."
β
βΌ ββ EXTERNAL API ββββββββββββββββββββββββββββββββββββββββββββ
FastAPI β CMA session with redacted text
β
Claude CMA agent processes redacted message
β
ββββΊ classify_domain(...)
β ββββΊ DomainClassifier β {domain: "ecommerce", conf: 0.97}
β
ββββΊ extract_entities(...)
β ββββΊ EvidenceNER β [ORG:"Flipkart", AMOUNT:"βΉ4,299",
β REF_ID:"OD-123", DATE:"3 weeks ago"]
β
ββββΊ HITL gate: "I have extracted the following β please confirm:
- Company: Flipkart
- Amount: βΉ4,299
- Order ID: OD-123
- Date: 3 weeks ago
Is this correct?"
β
User reviews in "Verify Entities" tab β edits if needed β clicks Confirm
β
βΌ
POST /validate-entities β [USER CONFIRMED] β draft_complaint()
β
ComplaintDraft generated and shown in Draft tab
β
recommend_action(domain="ecommerce") β [NCH, Consumer Forum]
β
Gradio renders: Draft Β· Evidence table Β· Escalation panel
5. Project Directory Structure
Project_ResolveAI/
β
βββ start.py β SINGLE ENTRY POINT β trains all models then starts servers
β
βββ docs/
β βββ abstract.md β project abstract (G.U.I.D.E.)
β βββ architecture.md β this file (spec)
β
βββ src/
β βββ __init__.py
β βββ privacy/ # Presidio PII redaction (runs before any external call)
β β βββ __init__.py
β β βββ redactor.py β PIIRedactor singleton
β β
β βββ classifier/ # DL Domain Classifier (DistilBERT)
β β βββ model.py β DomainClassifier, CFPB_PRODUCT_MAP, LABEL2ID
β β βββ dataset.py β load_cfpb_csv(), clean_complaint_text(), ComplaintDataset
β β βββ train.py β CLI: --cfpb_csv, --output_dir, --epochs, --batch_size
β β βββ predict.py β DomainPredictor singleton, classify_domain()
β β
β βββ ner/ # DL NER model (DistilBERT token classifier)
β β βββ model.py β EvidenceNER, NER_LABELS, NER_LABEL2ID
β β βββ train.py β Generates synthetic data in-memory; CLI: --output_dir
β β βββ predict.py β NERPredictor singleton, extract_entities()
β β
β βββ next_action/ # Next-action MLP predictor
β β βββ model.py β NextActionMLP, DOMAIN_ACTION_PRIORS, build_feature_vector()
β β βββ train.py β Generates synthetic features; CLI: --output_dir, --epochs
β β βββ predict.py β NextActionPredictor singleton (MLP or rule-based fallback)
β β
β βββ document_processor/ # OCR + PDF parsing + ViT
β β βββ ocr.py β Tesseract + pdfplumber pipeline
β β βββ vit_extractor.py β ViT-based image evidence extraction
β β
β βββ agent/ # Claude CMA integration
β β βββ tools.py β Tool definitions (JSON Schema) + execute_tool()
β β βββ prompts.py β SYSTEM_PROMPT (HITL rule 6, privacy context)
β β βββ session.py β AgentManager singleton, send_message()
β β
β βββ api/ # FastAPI application
β βββ main.py β Lifespan: Presidio β DL models β CMA agent
β βββ routes.py β /message (Presidioβagent), /validate-entities (HITL)
β βββ schemas.py β Pydantic models incl. HITL + pii_redacted fields
β
βββ ui/
β βββ app.py β Gradio: Chat Β· Verify Β· Docs Β· Draft Β· Escalation Β· About
β
βββ notebooks/ # OPTIONAL β EDA and interactive demos only
β βββ 01_data_exploration.ipynb β Explore CFPB dataset, save processed CSV
β βββ 02_classifier_training.ipynb
β βββ 04_cma_agent_demo.ipynb
β βββ 05_end_to_end_demo.ipynb
β
βββ data/
β βββ raw/ β Place CFPB complaints.csv here (one-time download)
β βββ processed/ β Output of EDA notebook; not required for training
β βββ sample_complaints/ β Synthetic domain-specific CSVs for augmentation
β
βββ models/ β Created by training; populated by start.py
β βββ domain_classifier/ β best_model.pt + tokenizer files
β βββ evidence_ner/ β best_model.pt + tokenizer files
β βββ document_vit/ β ViT checkpoint
β βββ next_action/ β best_model.pt
β
βββ CLAUDE.md β Guidance for Claude Code
βββ requirements.txt β presidio-analyzer, presidio-anonymizer, spacy, torch, etc.
βββ .env.example β Template β copy to .env and add ANTHROPIC_API_KEY
βββ .gitignore
βββ README.md
6. Setup and Running
Step 1 β Get an Anthropic API key
- Go to https://console.anthropic.com β Sign up (free tier available)
- Navigate to API Keys β Create Key
- Copy the key (shown only once)
- In the project root, copy the template and fill in your key:
Never commitcp .env.example .env # then edit .env: ANTHROPIC_API_KEY=sk-ant-....envto git (listed in.gitignore).
Step 2 β Install dependencies
pip install -r requirements.txt
python -m spacy download en_core_web_lg # Presidio NLP model (local, ~750 MB)
Step 3 β Download CFPB data (first run only)
The DomainClassifier requires the CFPB Consumer Complaint Database:
- Download from Kaggle:
consumer-complaint-databasedataset - Save the CSV to
data/raw/complaints.csv - Size: ~600 MB; one-time download; not committed to git
This is only needed to train the classifier. The NER and NextAction models generate their training data in-memory automatically.
Step 4 β Run (single command)
# First run β trains all models then starts servers:
python start.py --cfpb_csv data/raw/complaints.csv
# After first run β models already trained, skip training:
python start.py --no-train
# Force retrain everything:
python start.py --cfpb_csv data/raw/complaints.csv --train
# Train only (no servers):
python start.py --cfpb_csv data/raw/complaints.csv --train-only
When running, start.py will:
- Validate
.envandANTHROPIC_API_KEY - Train DomainClassifier (~30 min CPU / ~5 min GPU T4) β skipped if checkpoint exists
- Train EvidenceNER (~10 min CPU) β skipped if checkpoint exists
- Train NextActionMLP (< 30 sec CPU) β skipped if checkpoint exists
- Start FastAPI at
http://localhost:8000(Swagger docs at/docs) - Start Gradio UI at
http://localhost:7860
Both servers print to the same terminal, prefixed with [API] or [UI]. Press Ctrl+C to stop everything cleanly.
7. Development Phases
| Phase | Deliverable | Status |
|---|---|---|
| 0 | Abstract submitted (G.U.I.D.E.) | Done β |
| 1 | Architecture + spec (docs/architecture.md) |
Done β |
| 2 | Project scaffold + environment setup | Done β |
| 3 | Presidio PII redaction layer (src/privacy/) |
Done β |
| 4 | DL: DomainClassifier β model, dataset, train | Done β |
| 5 | DL: EvidenceNER + NextActionMLP β model, train | Done β |
| 6 | DL: ViT document extractor (fine-tuning) | In progress |
| 7 | CMA agent + tools integration | Done β |
| 8 | Document processor (OCR + ViT pipeline) | Done β |
| 9 | FastAPI backend (HITL endpoint, schemas) | Done β |
| 10 | Gradio UI (Verify tab, privacy badge, HITL) | Done β |
| 11 | Single launcher (start.py) + CLAUDE.md |
Done β |
| 12 | Integration testing + demo notebooks | Remaining |
| 13 | Final report + presentation | Remaining |