# Phase 2 Architecture Diagram

## System Overview

```
┌─────────────────────────────────────────────────────────────────────────┐
│                         SCAMSHIELD AI SYSTEM                             │
│                                                                          │
│  ┌────────────────────────────────────────────────────────────────────┐ │
│  │                        USER INTERFACES                              │ │
│  │                                                                      │ │
│  │  ┌──────────────────┐              ┌──────────────────┐            │ │
│  │  │   Phase 1 UI     │              │   Phase 2 UI     │            │ │
│  │  │  (index.html)    │              │  (voice.html)    │            │ │
│  │  │                  │              │                  │            │ │
│  │  │  Text Chat       │              │  🎤 Voice Chat   │            │ │
│  │  │  Interface       │              │  Interface       │            │ │
│  │  └────────┬─────────┘              └────────┬─────────┘            │ │
│  └───────────┼────────────────────────────────┼──────────────────────┘ │
│              │                                 │                        │
│              │ HTTP/JSON                       │ HTTP/FormData          │
│              │                                 │                        │
└──────────────┼─────────────────────────────────┼────────────────────────┘
               │                                 │
               ▼                                 ▼
┌──────────────────────────────────────────────────────────────────────────┐
│                            API LAYER                                      │
│                                                                           │
│  ┌────────────────────────┐         ┌────────────────────────┐          │
│  │  Phase 1 Endpoints     │         │  Phase 2 Endpoints     │          │
│  │  (endpoints.py)        │         │  (voice_endpoints.py)  │          │
│  │                        │         │                        │          │
│  │  POST /honeypot/engage │         │  POST /voice/engage    │          │
│  │  GET  /honeypot/session│         │  GET  /voice/audio/:id │          │
│  │  POST /honeypot/batch  │         │  GET  /voice/health    │          │
│  └────────────┬───────────┘         └────────────┬───────────┘          │
└───────────────┼──────────────────────────────────┼───────────────────────┘
                │                                   │
                │                                   │
                │                                   ▼
                │                    ┌──────────────────────────┐
                │                    │   Phase 2 Voice Layer    │
                │                    │                          │
                │                    │  ┌────────────────────┐  │
                │                    │  │  Audio Upload      │  │
                │                    │  │  (multipart/form)  │  │
                │                    │  └─────────┬──────────┘  │
                │                    │            │             │
                │                    │            ▼             │
                │                    │  ┌────────────────────┐  │
                │                    │  │  ASR Engine        │  │
                │                    │  │  (Whisper)         │  │
                │                    │  │                    │  │
                │                    │  │  Audio → Text      │  │
                │                    │  └─────────┬──────────┘  │
                │                    │            │             │
                │                    └────────────┼─────────────┘
                │                                 │
                │                                 │ Transcribed Text
                │                                 │
                ▼◄────────────────────────────────┘
┌──────────────────────────────────────────────────────────────────────────┐
│                     PHASE 1 CORE (UNCHANGED)                              │
│                                                                           │
│  ┌─────────────────────────────────────────────────────────────────────┐ │
│  │                      DETECTION LAYER                                 │ │
│  │                                                                      │ │
│  │  ┌──────────────────┐    ┌──────────────────┐    ┌──────────────┐ │ │
│  │  │ Language Detector│    │  Scam Detector   │    │  Scam Type   │ │ │
│  │  │  (language.py)   │───▶│  (detector.py)   │───▶│  Classifier  │ │ │
│  │  │                  │    │                  │    │              │ │ │
│  │  │  Auto-detect     │    │  IndicBERT +     │    │  Financial   │ │ │
│  │  │  en/hi/gu/etc    │    │  Rules-based     │    │  Tech Support│ │ │
│  │  └──────────────────┘    └──────────────────┘    │  Impersonation│ │
│  │                                                   └──────────────┘ │ │
│  └──────────────────────────────────────┬───────────────────────────────┘ │
│                                         │                                 │
│                                         ▼                                 │
│  ┌─────────────────────────────────────────────────────────────────────┐ │
│  │                      HONEYPOT LAYER                                  │ │
│  │                                                                      │ │
│  │  ┌──────────────────────────────────────────────────────────────┐  │ │
│  │  │              LangGraph Workflow                               │  │ │
│  │  │                                                               │  │ │
│  │  │  ┌─────────┐      ┌─────────┐      ┌─────────┐             │  │ │
│  │  │  │  Plan   │─────▶│Generate │─────▶│ Extract │             │  │ │
│  │  │  │  Node   │      │  Node   │      │  Node   │             │  │ │
│  │  │  └─────────┘      └─────────┘      └─────────┘             │  │ │
│  │  │      │                 │                 │                  │  │ │
│  │  │      │                 ▼                 │                  │  │ │
│  │  │      │           ┌──────────┐            │                  │  │ │
│  │  │      │           │ Groq LLM │            │                  │  │ │
│  │  │      │           │ (Llama)  │            │                  │  │ │
│  │  │      │           └──────────┘            │                  │  │ │
│  │  │      │                                   │                  │  │ │
│  │  │      └───────────────┬───────────────────┘                  │  │ │
│  │  │                      │                                      │  │ │
│  │  │                      ▼                                      │  │ │
│  │  │              ┌───────────────┐                             │  │ │
│  │  │              │ State Manager │                             │  │ │
│  │  │              │  (Redis)      │                             │  │ │
│  │  │              └───────────────┘                             │  │ │
│  │  └──────────────────────────────────────────────────────────────┘  │ │
│  │                                                                      │ │
│  │  Output: Text Reply                                                 │ │
│  └──────────────────────────────────────┬───────────────────────────────┘ │
│                                         │                                 │
│                                         ▼                                 │
│  ┌─────────────────────────────────────────────────────────────────────┐ │
│  │                    EXTRACTION LAYER                                  │ │
│  │                                                                      │ │
│  │  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐             │ │
│  │  │ UPI Extractor│  │ Bank Account │  │ Phone/URL    │             │ │
│  │  │  (Regex)     │  │  Extractor   │  │  Extractor   │             │ │
│  │  └──────────────┘  └──────────────┘  └──────────────┘             │ │
│  │                                                                      │ │
│  │  Output: Extracted Intelligence                                     │ │
│  └──────────────────────────────────────┬───────────────────────────────┘ │
└─────────────────────────────────────────┼───────────────────────────────┘
                                          │
                                          │ Text Reply
                                          │
                                          ▼
┌──────────────────────────────────────────────────────────────────────────┐
│                     PHASE 2 OUTPUT LAYER                                  │
│                                                                           │
│  ┌────────────────────────────────────────────────────────────────────┐  │
│  │                    TTS Engine (gTTS)                                │  │
│  │                                                                     │  │
│  │  Text Reply ───▶ Text-to-Speech ───▶ Audio File (.mp3)            │  │
│  │                                                                     │  │
│  │  Languages: en, hi, gu, ta, te, bn, mr                             │  │
│  └────────────────────────────────────────────────────────────────────┘  │
│                                                                           │
│  Output: Audio URL                                                        │
└───────────────────────────────────────────────────────────────────────────┘
                                          │
                                          │ Audio URL
                                          │
                                          ▼
                                    ┌──────────┐
                                    │  User    │
                                    │  Hears   │
                                    │  AI Voice│
                                    └──────────┘
```

## Data Flow: Voice Conversation

### Step-by-Step Flow

```
1. USER SPEAKS
   │
   │ "Your account is blocked. Send OTP now!"
   │
   ▼
2. BROWSER CAPTURES AUDIO
   │
   │ MediaRecorder API → WebM/WAV blob
   │
   ▼
3. UPLOAD TO API
   │
   │ POST /api/v1/voice/engage
   │ FormData: audio_file, session_id, language
   │
   ▼
4. ASR (WHISPER)
   │
   │ Audio → Text Transcription
   │ Output: "Your account is blocked. Send OTP now!"
   │ Language: "en"
   │ Confidence: 0.95
   │
   ▼
5. PHASE 1 DETECTION (UNCHANGED)
   │
   │ Input: Transcribed text
   │ Scam Detector → is_scam: true, confidence: 0.92
   │ Type: "financial_fraud"
   │
   ▼
6. PHASE 1 HONEYPOT (UNCHANGED)
   │
   │ LangGraph Workflow:
   │   - Plan: Select "confused_elderly" persona
   │   - Generate: Groq LLM creates reply
   │   - Extract: Parse for UPI/bank/phone
   │
   │ Output: "Oh no! What should I do? I'm scared!"
   │
   ▼
7. TTS (gTTS)
   │
   │ Text → Speech Synthesis
   │ Input: "Oh no! What should I do? I'm scared!"
   │ Language: "en"
   │ Output: /tmp/reply_xyz.mp3
   │
   ▼
8. RESPONSE TO USER
   │
   │ JSON Response:
   │ {
   │   "ai_reply_text": "Oh no! What should I do?",
   │   "ai_reply_audio_url": "/api/v1/voice/audio/reply_xyz.mp3",
   │   "transcription": {...},
   │   "scam_detected": true,
   │   ...
   │ }
   │
   ▼
9. BROWSER PLAYS AUDIO
   │
   │ <audio controls src="/api/v1/voice/audio/reply_xyz.mp3">
   │
   ▼
10. USER HEARS AI VOICE
    │
    │ "Oh no! What should I do? I'm scared!"
    │
    └─▶ Loop continues for multi-turn conversation
```

## Component Isolation

### Phase 1 (Existing - Untouched)

```
┌─────────────────────────────────────┐
│        PHASE 1 COMPONENTS           │
│                                     │
│  ✅ app/models/detector.py          │
│  ✅ app/models/language.py          │
│  ✅ app/models/extractor.py         │
│  ✅ app/agent/honeypot.py           │
│  ✅ app/agent/personas.py           │
│  ✅ app/agent/strategies.py         │
│  ✅ app/api/endpoints.py            │
│  ✅ app/api/schemas.py              │
│  ✅ ui/index.html                   │
│  ✅ ui/app.js                       │
│                                     │
│  NO MODIFICATIONS REQUIRED          │
└─────────────────────────────────────┘
```

### Phase 2 (New - Isolated)

```
┌─────────────────────────────────────┐
│        PHASE 2 COMPONENTS           │
│                                     │
│  🆕 app/voice/asr.py                │
│  🆕 app/voice/tts.py                │
│  🆕 app/voice/fraud_detector.py     │
│  🆕 app/api/voice_endpoints.py      │
│  🆕 app/api/voice_schemas.py        │
│  🆕 ui/voice.html                   │
│  🆕 ui/voice.js                     │
│  🆕 ui/voice.css                    │
│                                     │
│  COMPLETELY SEPARATE                │
└─────────────────────────────────────┘
```

### Integration Points (Minimal)

```
┌─────────────────────────────────────┐
│      INTEGRATION POINTS             │
│                                     │
│  📝 app/main.py                     │
│     + Add voice router (conditional)│
│                                     │
│  📝 app/config.py                   │
│     + Add Phase 2 settings          │
│                                     │
│  📝 .env                            │
│     + Add PHASE_2_ENABLED=true      │
│                                     │
│  MINIMAL CHANGES                    │
└─────────────────────────────────────┘
```

## Optional: Voice Fraud Detection

```
┌──────────────────────────────────────────────────────────────┐
│              VOICE FRAUD DETECTION (OPTIONAL)                 │
│                                                               │
│  Audio Input                                                  │
│      │                                                        │
│      ├──────────────────────────────────────┐                │
│      │                                      │                │
│      ▼                                      ▼                │
│  ┌─────────┐                          ┌─────────┐           │
│  │   ASR   │                          │ Fraud   │           │
│  │(Whisper)│                          │Detector │           │
│  └────┬────┘                          │(Wav2Vec2│           │
│       │                               │resemb.) │           │
│       │                               └────┬────┘           │
│       │                                    │                │
│       │ Transcribed Text                   │ Fraud Score    │
│       │                                    │                │
│       ▼                                    ▼                │
│  ┌──────────────────────────────────────────────┐          │
│  │         Combined Analysis                     │          │
│  │                                               │          │
│  │  - Text content (scam detection)              │          │
│  │  - Voice authenticity (fraud detection)       │          │
│  │                                               │          │
│  │  Risk Score = f(scam_conf, fraud_conf)        │          │
│  └──────────────────────────────────────────────┘          │
│                                                              │
│  If VOICE_FRAUD_DETECTION=true                              │
└──────────────────────────────────────────────────────────────┘
```

## Database & State Management

```
┌─────────────────────────────────────────────────────────────┐
│                   STATE PERSISTENCE                          │
│                                                              │
│  ┌──────────────────────┐         ┌──────────────────────┐ │
│  │      Redis           │         │     PostgreSQL       │ │
│  │  (Session State)     │         │  (Conversation Logs) │ │
│  │                      │         │                      │ │
│  │  - Active sessions   │         │  - Full transcripts  │ │
│  │  - Turn count        │         │  - Extracted intel   │ │
│  │  - Extracted intel   │         │  - Audio metadata    │ │
│  │  - Persona state     │         │  - Timestamps        │ │
│  │  - TTL: 1 hour       │         │  - Permanent storage │ │
│  └──────────────────────┘         └──────────────────────┘ │
│                                                              │
│  SAME AS PHASE 1 - NO CHANGES                               │
└─────────────────────────────────────────────────────────────┘
```

## Performance Considerations

```
┌─────────────────────────────────────────────────────────────┐
│                    LATENCY BREAKDOWN                         │
│                                                              │
│  User Speaks (5s audio)                                      │
│      │                                                       │
│      ▼                                                       │
│  Upload to API (0.5s)                                        │
│      │                                                       │
│      ▼                                                       │
│  ASR Transcription (1.5s)  ◄── Whisper base model           │
│      │                                                       │
│      ▼                                                       │
│  Scam Detection (0.2s)     ◄── IndicBERT                    │
│      │                                                       │
│      ▼                                                       │
│  Honeypot LLM (1.0s)       ◄── Groq API                     │
│      │                                                       │
│      ▼                                                       │
│  TTS Synthesis (0.8s)      ◄── gTTS                         │
│      │                                                       │
│      ▼                                                       │
│  Download Audio (0.3s)                                       │
│      │                                                       │
│      ▼                                                       │
│  Total: ~4.3s                                                │
│                                                              │
│  Target: <5s ✅                                              │
└─────────────────────────────────────────────────────────────┘
```

## Deployment Architecture

```
┌─────────────────────────────────────────────────────────────┐
│                      DEPLOYMENT                              │
│                                                              │
│  ┌────────────────────────────────────────────────────────┐ │
│  │              Docker Container                           │ │
│  │                                                         │ │
│  │  ┌──────────────────────────────────────────────────┐  │ │
│  │  │         FastAPI Application                       │  │ │
│  │  │                                                   │  │ │
│  │  │  - Phase 1 Endpoints (always enabled)            │  │ │
│  │  │  - Phase 2 Endpoints (if PHASE_2_ENABLED=true)   │  │ │
│  │  │                                                   │  │ │
│  │  │  Dependencies:                                    │  │ │
│  │  │    - Base: transformers, langchain, groq         │  │ │
│  │  │    - Phase 2: whisper, gTTS, torchaudio          │  │ │
│  │  └──────────────────────────────────────────────────┘  │ │
│  │                                                         │ │
│  │  ┌──────────────────────────────────────────────────┐  │ │
│  │  │         Model Cache                               │  │ │
│  │  │                                                   │  │ │
│  │  │  - IndicBERT (~500MB)                            │  │ │
│  │  │  - Whisper base (~150MB) [Phase 2]               │  │ │
│  │  └──────────────────────────────────────────────────┘  │ │
│  └────────────────────────────────────────────────────────┘ │
│                                                              │
│  External Services:                                          │
│  - Redis (session state)                                     │
│  - PostgreSQL (conversation logs)                            │
│  - Groq API (LLM)                                            │
└─────────────────────────────────────────────────────────────┘
```

## Security Architecture

```
┌─────────────────────────────────────────────────────────────┐
│                    SECURITY LAYERS                           │
│                                                              │
│  1. API Authentication                                       │
│     └─▶ x-api-key header (both Phase 1 & 2)                 │
│                                                              │
│  2. Input Validation                                         │
│     ├─▶ File size limits (<10MB)                            │
│     ├─▶ File type validation (audio/* only)                 │
│     └─▶ Sanitize filenames                                  │
│                                                              │
│  3. Rate Limiting                                            │
│     └─▶ Max requests per session                            │
│                                                              │
│  4. Data Privacy                                             │
│     ├─▶ Temporary audio storage                             │
│     ├─▶ Auto-delete after processing                        │
│     └─▶ No raw audio in database                            │
│                                                              │
│  5. Error Handling                                           │
│     └─▶ No sensitive info in error messages                 │
└─────────────────────────────────────────────────────────────┘
```

---

## Key Takeaways

1. **Phase 2 wraps Phase 1**: Voice is input/output layer only
2. **Zero modifications to Phase 1**: Core honeypot unchanged
3. **Conditional loading**: Phase 2 only loads if enabled
4. **Separate UI**: Voice UI is independent of text UI
5. **Same state management**: Reuses Redis/PostgreSQL
6. **Performance target**: <5s for full voice loop
7. **Security first**: Audio files temporary, validated, rate-limited

---

**For detailed implementation steps, see:** `PHASE_2_VOICE_IMPLEMENTATION_PLAN.md`

**For quick start guide, see:** `PHASE_2_README.md`

**For progress tracking, see:** `PHASE_2_CHECKLIST.md`