| # Phase 2 Architecture Diagram | |
| ## System Overview | |
| ``` | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β SCAMSHIELD AI SYSTEM β | |
| β β | |
| β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β | |
| β β USER INTERFACES β β | |
| β β β β | |
| β β ββββββββββββββββββββ ββββββββββββββββββββ β β | |
| β β β Phase 1 UI β β Phase 2 UI β β β | |
| β β β (index.html) β β (voice.html) β β β | |
| β β β β β β β β | |
| β β β Text Chat β β π€ Voice Chat β β β | |
| β β β Interface β β Interface β β β | |
| β β ββββββββββ¬ββββββββββ ββββββββββ¬ββββββββββ β β | |
| β βββββββββββββΌβββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββ β | |
| β β β β | |
| β β HTTP/JSON β HTTP/FormData β | |
| β β β β | |
| ββββββββββββββββΌββββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββ | |
| β β | |
| βΌ βΌ | |
| ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β API LAYER β | |
| β β | |
| β ββββββββββββββββββββββββββ ββββββββββββββββββββββββββ β | |
| β β Phase 1 Endpoints β β Phase 2 Endpoints β β | |
| β β (endpoints.py) β β (voice_endpoints.py) β β | |
| β β β β β β | |
| β β POST /honeypot/engage β β POST /voice/engage β β | |
| β β GET /honeypot/sessionβ β GET /voice/audio/:id β β | |
| β β POST /honeypot/batch β β GET /voice/health β β | |
| β ββββββββββββββ¬ββββββββββββ ββββββββββββββ¬ββββββββββββ β | |
| βββββββββββββββββΌβββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββ | |
| β β | |
| β β | |
| β βΌ | |
| β ββββββββββββββββββββββββββββ | |
| β β Phase 2 Voice Layer β | |
| β β β | |
| β β ββββββββββββββββββββββ β | |
| β β β Audio Upload β β | |
| β β β (multipart/form) β β | |
| β β βββββββββββ¬βββββββββββ β | |
| β β β β | |
| β β βΌ β | |
| β β ββββββββββββββββββββββ β | |
| β β β ASR Engine β β | |
| β β β (Whisper) β β | |
| β β β β β | |
| β β β Audio β Text β β | |
| β β βββββββββββ¬βββββββββββ β | |
| β β β β | |
| β ββββββββββββββΌββββββββββββββ | |
| β β | |
| β β Transcribed Text | |
| β β | |
| βΌββββββββββββββββββββββββββββββββββ | |
| ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β PHASE 1 CORE (UNCHANGED) β | |
| β β | |
| β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β | |
| β β DETECTION LAYER β β | |
| β β β β | |
| β β ββββββββββββββββββββ ββββββββββββββββββββ ββββββββββββββββ β β | |
| β β β Language Detectorβ β Scam Detector β β Scam Type β β β | |
| β β β (language.py) βββββΆβ (detector.py) βββββΆβ Classifier β β β | |
| β β β β β β β β β β | |
| β β β Auto-detect β β IndicBERT + β β Financial β β β | |
| β β β en/hi/gu/etc β β Rules-based β β Tech Supportβ β β | |
| β β ββββββββββββββββββββ ββββββββββββββββββββ β Impersonationβ β | |
| β β ββββββββββββββββ β β | |
| β ββββββββββββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββ β | |
| β β β | |
| β βΌ β | |
| β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β | |
| β β HONEYPOT LAYER β β | |
| β β β β | |
| β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β | |
| β β β LangGraph Workflow β β β | |
| β β β β β β | |
| β β β βββββββββββ βββββββββββ βββββββββββ β β β | |
| β β β β Plan βββββββΆβGenerate βββββββΆβ Extract β β β β | |
| β β β β Node β β Node β β Node β β β β | |
| β β β βββββββββββ βββββββββββ βββββββββββ β β β | |
| β β β β β β β β β | |
| β β β β βΌ β β β β | |
| β β β β ββββββββββββ β β β β | |
| β β β β β Groq LLM β β β β β | |
| β β β β β (Llama) β β β β β | |
| β β β β ββββββββββββ β β β β | |
| β β β β β β β β | |
| β β β βββββββββββββββββ¬ββββββββββββββββββββ β β β | |
| β β β β β β β | |
| β β β βΌ β β β | |
| β β β βββββββββββββββββ β β β | |
| β β β β State Manager β β β β | |
| β β β β (Redis) β β β β | |
| β β β βββββββββββββββββ β β β | |
| β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β | |
| β β β β | |
| β β Output: Text Reply β β | |
| β ββββββββββββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββ β | |
| β β β | |
| β βΌ β | |
| β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β | |
| β β EXTRACTION LAYER β β | |
| β β β β | |
| β β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β β | |
| β β β UPI Extractorβ β Bank Account β β Phone/URL β β β | |
| β β β (Regex) β β Extractor β β Extractor β β β | |
| β β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β β | |
| β β β β | |
| β β Output: Extracted Intelligence β β | |
| β ββββββββββββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββ β | |
| βββββββββββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββ | |
| β | |
| β Text Reply | |
| β | |
| βΌ | |
| ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β PHASE 2 OUTPUT LAYER β | |
| β β | |
| β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β | |
| β β TTS Engine (gTTS) β β | |
| β β β β | |
| β β Text Reply ββββΆ Text-to-Speech ββββΆ Audio File (.mp3) β β | |
| β β β β | |
| β β Languages: en, hi, gu, ta, te, bn, mr β β | |
| β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β | |
| β β | |
| β Output: Audio URL β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β | |
| β Audio URL | |
| β | |
| βΌ | |
| ββββββββββββ | |
| β User β | |
| β Hears β | |
| β AI Voiceβ | |
| ββββββββββββ | |
| ``` | |
| ## Data Flow: Voice Conversation | |
| ### Step-by-Step Flow | |
| ``` | |
| 1. USER SPEAKS | |
| β | |
| β "Your account is blocked. Send OTP now!" | |
| β | |
| βΌ | |
| 2. BROWSER CAPTURES AUDIO | |
| β | |
| β MediaRecorder API β WebM/WAV blob | |
| β | |
| βΌ | |
| 3. UPLOAD TO API | |
| β | |
| β POST /api/v1/voice/engage | |
| β FormData: audio_file, session_id, language | |
| β | |
| βΌ | |
| 4. ASR (WHISPER) | |
| β | |
| β Audio β Text Transcription | |
| β Output: "Your account is blocked. Send OTP now!" | |
| β Language: "en" | |
| β Confidence: 0.95 | |
| β | |
| βΌ | |
| 5. PHASE 1 DETECTION (UNCHANGED) | |
| β | |
| β Input: Transcribed text | |
| β Scam Detector β is_scam: true, confidence: 0.92 | |
| β Type: "financial_fraud" | |
| β | |
| βΌ | |
| 6. PHASE 1 HONEYPOT (UNCHANGED) | |
| β | |
| β LangGraph Workflow: | |
| β - Plan: Select "confused_elderly" persona | |
| β - Generate: Groq LLM creates reply | |
| β - Extract: Parse for UPI/bank/phone | |
| β | |
| β Output: "Oh no! What should I do? I'm scared!" | |
| β | |
| βΌ | |
| 7. TTS (gTTS) | |
| β | |
| β Text β Speech Synthesis | |
| β Input: "Oh no! What should I do? I'm scared!" | |
| β Language: "en" | |
| β Output: /tmp/reply_xyz.mp3 | |
| β | |
| βΌ | |
| 8. RESPONSE TO USER | |
| β | |
| β JSON Response: | |
| β { | |
| β "ai_reply_text": "Oh no! What should I do?", | |
| β "ai_reply_audio_url": "/api/v1/voice/audio/reply_xyz.mp3", | |
| β "transcription": {...}, | |
| β "scam_detected": true, | |
| β ... | |
| β } | |
| β | |
| βΌ | |
| 9. BROWSER PLAYS AUDIO | |
| β | |
| β <audio controls src="/api/v1/voice/audio/reply_xyz.mp3"> | |
| β | |
| βΌ | |
| 10. USER HEARS AI VOICE | |
| β | |
| β "Oh no! What should I do? I'm scared!" | |
| β | |
| βββΆ Loop continues for multi-turn conversation | |
| ``` | |
| ## Component Isolation | |
| ### Phase 1 (Existing - Untouched) | |
| ``` | |
| βββββββββββββββββββββββββββββββββββββββ | |
| β PHASE 1 COMPONENTS β | |
| β β | |
| β β app/models/detector.py β | |
| β β app/models/language.py β | |
| β β app/models/extractor.py β | |
| β β app/agent/honeypot.py β | |
| β β app/agent/personas.py β | |
| β β app/agent/strategies.py β | |
| β β app/api/endpoints.py β | |
| β β app/api/schemas.py β | |
| β β ui/index.html β | |
| β β ui/app.js β | |
| β β | |
| β NO MODIFICATIONS REQUIRED β | |
| βββββββββββββββββββββββββββββββββββββββ | |
| ``` | |
| ### Phase 2 (New - Isolated) | |
| ``` | |
| βββββββββββββββββββββββββββββββββββββββ | |
| β PHASE 2 COMPONENTS β | |
| β β | |
| β π app/voice/asr.py β | |
| β π app/voice/tts.py β | |
| β π app/voice/fraud_detector.py β | |
| β π app/api/voice_endpoints.py β | |
| β π app/api/voice_schemas.py β | |
| β π ui/voice.html β | |
| β π ui/voice.js β | |
| β π ui/voice.css β | |
| β β | |
| β COMPLETELY SEPARATE β | |
| βββββββββββββββββββββββββββββββββββββββ | |
| ``` | |
| ### Integration Points (Minimal) | |
| ``` | |
| βββββββββββββββββββββββββββββββββββββββ | |
| β INTEGRATION POINTS β | |
| β β | |
| β π app/main.py β | |
| β + Add voice router (conditional)β | |
| β β | |
| β π app/config.py β | |
| β + Add Phase 2 settings β | |
| β β | |
| β π .env β | |
| β + Add PHASE_2_ENABLED=true β | |
| β β | |
| β MINIMAL CHANGES β | |
| βββββββββββββββββββββββββββββββββββββββ | |
| ``` | |
| ## Optional: Voice Fraud Detection | |
| ``` | |
| ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β VOICE FRAUD DETECTION (OPTIONAL) β | |
| β β | |
| β Audio Input β | |
| β β β | |
| β ββββββββββββββββββββββββββββββββββββββββ β | |
| β β β β | |
| β βΌ βΌ β | |
| β βββββββββββ βββββββββββ β | |
| β β ASR β β Fraud β β | |
| β β(Whisper)β βDetector β β | |
| β ββββββ¬βββββ β(Wav2Vec2β β | |
| β β βresemb.) β β | |
| β β ββββββ¬βββββ β | |
| β β β β | |
| β β Transcribed Text β Fraud Score β | |
| β β β β | |
| β βΌ βΌ β | |
| β ββββββββββββββββββββββββββββββββββββββββββββββββ β | |
| β β Combined Analysis β β | |
| β β β β | |
| β β - Text content (scam detection) β β | |
| β β - Voice authenticity (fraud detection) β β | |
| β β β β | |
| β β Risk Score = f(scam_conf, fraud_conf) β β | |
| β ββββββββββββββββββββββββββββββββββββββββββββββββ β | |
| β β | |
| β If VOICE_FRAUD_DETECTION=true β | |
| ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| ``` | |
| ## Database & State Management | |
| ``` | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β STATE PERSISTENCE β | |
| β β | |
| β ββββββββββββββββββββββββ ββββββββββββββββββββββββ β | |
| β β Redis β β PostgreSQL β β | |
| β β (Session State) β β (Conversation Logs) β β | |
| β β β β β β | |
| β β - Active sessions β β - Full transcripts β β | |
| β β - Turn count β β - Extracted intel β β | |
| β β - Extracted intel β β - Audio metadata β β | |
| β β - Persona state β β - Timestamps β β | |
| β β - TTL: 1 hour β β - Permanent storage β β | |
| β ββββββββββββββββββββββββ ββββββββββββββββββββββββ β | |
| β β | |
| β SAME AS PHASE 1 - NO CHANGES β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| ``` | |
| ## Performance Considerations | |
| ``` | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β LATENCY BREAKDOWN β | |
| β β | |
| β User Speaks (5s audio) β | |
| β β β | |
| β βΌ β | |
| β Upload to API (0.5s) β | |
| β β β | |
| β βΌ β | |
| β ASR Transcription (1.5s) βββ Whisper base model β | |
| β β β | |
| β βΌ β | |
| β Scam Detection (0.2s) βββ IndicBERT β | |
| β β β | |
| β βΌ β | |
| β Honeypot LLM (1.0s) βββ Groq API β | |
| β β β | |
| β βΌ β | |
| β TTS Synthesis (0.8s) βββ gTTS β | |
| β β β | |
| β βΌ β | |
| β Download Audio (0.3s) β | |
| β β β | |
| β βΌ β | |
| β Total: ~4.3s β | |
| β β | |
| β Target: <5s β β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| ``` | |
| ## Deployment Architecture | |
| ``` | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β DEPLOYMENT β | |
| β β | |
| β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β | |
| β β Docker Container β β | |
| β β β β | |
| β β ββββββββββββββββββββββββββββββββββββββββββββββββββββ β β | |
| β β β FastAPI Application β β β | |
| β β β β β β | |
| β β β - Phase 1 Endpoints (always enabled) β β β | |
| β β β - Phase 2 Endpoints (if PHASE_2_ENABLED=true) β β β | |
| β β β β β β | |
| β β β Dependencies: β β β | |
| β β β - Base: transformers, langchain, groq β β β | |
| β β β - Phase 2: whisper, gTTS, torchaudio β β β | |
| β β ββββββββββββββββββββββββββββββββββββββββββββββββββββ β β | |
| β β β β | |
| β β ββββββββββββββββββββββββββββββββββββββββββββββββββββ β β | |
| β β β Model Cache β β β | |
| β β β β β β | |
| β β β - IndicBERT (~500MB) β β β | |
| β β β - Whisper base (~150MB) [Phase 2] β β β | |
| β β ββββββββββββββββββββββββββββββββββββββββββββββββββββ β β | |
| β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β | |
| β β | |
| β External Services: β | |
| β - Redis (session state) β | |
| β - PostgreSQL (conversation logs) β | |
| β - Groq API (LLM) β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| ``` | |
| ## Security Architecture | |
| ``` | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β SECURITY LAYERS β | |
| β β | |
| β 1. API Authentication β | |
| β βββΆ x-api-key header (both Phase 1 & 2) β | |
| β β | |
| β 2. Input Validation β | |
| β βββΆ File size limits (<10MB) β | |
| β βββΆ File type validation (audio/* only) β | |
| β βββΆ Sanitize filenames β | |
| β β | |
| β 3. Rate Limiting β | |
| β βββΆ Max requests per session β | |
| β β | |
| β 4. Data Privacy β | |
| β βββΆ Temporary audio storage β | |
| β βββΆ Auto-delete after processing β | |
| β βββΆ No raw audio in database β | |
| β β | |
| β 5. Error Handling β | |
| β βββΆ No sensitive info in error messages β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| ``` | |
| --- | |
| ## Key Takeaways | |
| 1. **Phase 2 wraps Phase 1**: Voice is input/output layer only | |
| 2. **Zero modifications to Phase 1**: Core honeypot unchanged | |
| 3. **Conditional loading**: Phase 2 only loads if enabled | |
| 4. **Separate UI**: Voice UI is independent of text UI | |
| 5. **Same state management**: Reuses Redis/PostgreSQL | |
| 6. **Performance target**: <5s for full voice loop | |
| 7. **Security first**: Audio files temporary, validated, rate-limited | |
| --- | |
| **For detailed implementation steps, see:** `PHASE_2_VOICE_IMPLEMENTATION_PLAN.md` | |
| **For quick start guide, see:** `PHASE_2_README.md` | |
| **For progress tracking, see:** `PHASE_2_CHECKLIST.md` | |