# EMOTIA Product Requirements Document ## 1. Product Overview ### Problem Video calls remove many human signals. Recruiters, educators, sales teams, and therapists lack objective insights into: - Emotional state - Engagement - Confidence - Intent (confusion, agreement, hesitation) Manual observation is subjective, inconsistent, and non-scalable. ### Solution A real-time multi-modal AI system that analyzes: - Facial expressions (video) - Vocal tone (audio) - Spoken language (text) - Temporal behavior (over time) …and produces interpretable, ethical, probabilistic insights. ### Target Users - Recruiters & hiring platforms - EdTech platforms - Sales & customer success teams - Remote therapy & coaching platforms - Product teams analyzing user calls ## 2. Core Features ### 2.1 Live Video Call Analysis - Real-time emotion detection - Engagement tracking - Confidence & stress indicators - Timeline-based emotion shifts ### 2.2 Post-Call Analytics Dashboard - Emotion timeline - Intent heatmap - Modality influence breakdown - Key moments (confusion spikes, stress peaks) ### 2.3 Multi-Modal Explainability Why a prediction was made: - Face vs voice vs text contribution - Visual overlays (heatmaps) - Confidence intervals (not hard labels) ### 2.4 Ethics & Bias Controls - Bias evaluation toggle - Per-modality opt-out - Clear disclaimers (non-diagnostic, assistive AI) ## 3. UI / UX Vision ### 3.1 UI Style - Dark mode only - Glassmorphism cards - Neon accent colors (cyan / violet / lime) - Smooth micro-animations - Real-time waveform + emotion graphs ### 3.2 Main Dashboard #### Left Panel - Live video feed - Face bounding box - Micro-expression indicators #### Center - Emotion timeline (animated) - Engagement meter (0–100) - Confidence score #### Right Panel - Intent probabilities - Stress indicators - Modality contribution bars ### 3.3 Post-Call Report UI - Scrollable emotion timeline - Clickable "critical moments" - Modality dominance chart - Exportable report (PDF) ### 3.4 UI Components (Must-Have) - Animated confidence rings - Temporal scrubber - Heatmap overlays - Tooltips explaining AI decisions ## 4. Technical Architecture ### 4.1 Input Pipeline - Webcam video (25–30 FPS) - Microphone audio - Real-time ASR - Sliding temporal windows (5–10 sec) ### 4.2 Model Architecture (Production-Grade) #### πŸ”Ή Visual Branch - Vision Transformer (ViT) fine-tuned for facial expressions - Face detection + alignment - Temporal pooling #### πŸ”Ή Audio Branch - Audio β†’ Mel-spectrogram - CNN + Transformer - Prosody, pitch, rhythm modeling #### πŸ”Ή Text Branch - Transformer-based language model - Fine-tuned for intent & sentiment - Confidence / hesitation phrase detection #### πŸ”Ή Fusion Network (KEY DIFFERENTIATOR) - Cross-modal attention - Dynamic modality weighting - Temporal transformer for sequence learning #### πŸ”Ή Output Heads - Emotion classification - Intent classification - Engagement regression - Confidence regression ## 5. Models to Use (Strong + Realistic) ### Visual - ViT-Base / EfficientNet - Pretrained on face emotion datasets ### Audio - Wav2Vec-style embeddings - CNN-Transformer hybrid ### Text - Transformer encoder (fine-tuned) - Focus on conversational intent ### Fusion - Custom attention-based multi-head network - (this is your original contribution) ## 6. Datasets (CV-Worthy) ### Facial Emotion - FER-2013 - AffectNet - RAF-DB ### Audio Emotion - RAVDESS - CREMA-D ### Speech + Intent - IEMOCAP - MELD (multi-party dialogue) ### Strategy - Pretrain each modality separately - Fine-tune jointly - Align timestamps across modalities ## 7. Training & Evaluation ### Training - Multi-task learning - Weighted losses per output - Curriculum learning (single β†’ multi-modal) ### Metrics - F1-score per emotion - Concordance correlation (regression) - Confusion matrices - Per-modality ablation ## 8. Deployment ### Backend - FastAPI - GPU inference support - Streaming inference pipeline ### Frontend - Next.js / React - WebRTC video - Web Audio API - WebGL visualizations ### Infrastructure - Dockerized services - Modular microservices - Model versioning ## 9. Non-Functional Requirements - Real-time latency < 200ms - Modular model replacement - Privacy-first design - No biometric storage by default