| # EMOTIA Product Requirements Document | |
| ## 1. Product Overview | |
| ### Problem | |
| Video calls remove many human signals. Recruiters, educators, sales teams, and therapists lack objective insights into: | |
| - Emotional state | |
| - Engagement | |
| - Confidence | |
| - Intent (confusion, agreement, hesitation) | |
| Manual observation is subjective, inconsistent, and non-scalable. | |
| ### Solution | |
| A real-time multi-modal AI system that analyzes: | |
| - Facial expressions (video) | |
| - Vocal tone (audio) | |
| - Spoken language (text) | |
| - Temporal behavior (over time) | |
| …and produces interpretable, ethical, probabilistic insights. | |
| ### Target Users | |
| - Recruiters & hiring platforms | |
| - EdTech platforms | |
| - Sales & customer success teams | |
| - Remote therapy & coaching platforms | |
| - Product teams analyzing user calls | |
| ## 2. Core Features | |
| ### 2.1 Live Video Call Analysis | |
| - Real-time emotion detection | |
| - Engagement tracking | |
| - Confidence & stress indicators | |
| - Timeline-based emotion shifts | |
| ### 2.2 Post-Call Analytics Dashboard | |
| - Emotion timeline | |
| - Intent heatmap | |
| - Modality influence breakdown | |
| - Key moments (confusion spikes, stress peaks) | |
| ### 2.3 Multi-Modal Explainability | |
| Why a prediction was made: | |
| - Face vs voice vs text contribution | |
| - Visual overlays (heatmaps) | |
| - Confidence intervals (not hard labels) | |
| ### 2.4 Ethics & Bias Controls | |
| - Bias evaluation toggle | |
| - Per-modality opt-out | |
| - Clear disclaimers (non-diagnostic, assistive AI) | |
| ## 3. UI / UX Vision | |
| ### 3.1 UI Style | |
| - Dark mode only | |
| - Glassmorphism cards | |
| - Neon accent colors (cyan / violet / lime) | |
| - Smooth micro-animations | |
| - Real-time waveform + emotion graphs | |
| ### 3.2 Main Dashboard | |
| #### Left Panel | |
| - Live video feed | |
| - Face bounding box | |
| - Micro-expression indicators | |
| #### Center | |
| - Emotion timeline (animated) | |
| - Engagement meter (0–100) | |
| - Confidence score | |
| #### Right Panel | |
| - Intent probabilities | |
| - Stress indicators | |
| - Modality contribution bars | |
| ### 3.3 Post-Call Report UI | |
| - Scrollable emotion timeline | |
| - Clickable "critical moments" | |
| - Modality dominance chart | |
| - Exportable report (PDF) | |
| ### 3.4 UI Components (Must-Have) | |
| - Animated confidence rings | |
| - Temporal scrubber | |
| - Heatmap overlays | |
| - Tooltips explaining AI decisions | |
| ## 4. Technical Architecture | |
| ### 4.1 Input Pipeline | |
| - Webcam video (25–30 FPS) | |
| - Microphone audio | |
| - Real-time ASR | |
| - Sliding temporal windows (5–10 sec) | |
| ### 4.2 Model Architecture (Production-Grade) | |
| #### 🔹 Visual Branch | |
| - Vision Transformer (ViT) fine-tuned for facial expressions | |
| - Face detection + alignment | |
| - Temporal pooling | |
| #### 🔹 Audio Branch | |
| - Audio → Mel-spectrogram | |
| - CNN + Transformer | |
| - Prosody, pitch, rhythm modeling | |
| #### 🔹 Text Branch | |
| - Transformer-based language model | |
| - Fine-tuned for intent & sentiment | |
| - Confidence / hesitation phrase detection | |
| #### 🔹 Fusion Network (KEY DIFFERENTIATOR) | |
| - Cross-modal attention | |
| - Dynamic modality weighting | |
| - Temporal transformer for sequence learning | |
| #### 🔹 Output Heads | |
| - Emotion classification | |
| - Intent classification | |
| - Engagement regression | |
| - Confidence regression | |
| ## 5. Models to Use (Strong + Realistic) | |
| ### Visual | |
| - ViT-Base / EfficientNet | |
| - Pretrained on face emotion datasets | |
| ### Audio | |
| - Wav2Vec-style embeddings | |
| - CNN-Transformer hybrid | |
| ### Text | |
| - Transformer encoder (fine-tuned) | |
| - Focus on conversational intent | |
| ### Fusion | |
| - Custom attention-based multi-head network | |
| - (this is your original contribution) | |
| ## 6. Datasets (CV-Worthy) | |
| ### Facial Emotion | |
| - FER-2013 | |
| - AffectNet | |
| - RAF-DB | |
| ### Audio Emotion | |
| - RAVDESS | |
| - CREMA-D | |
| ### Speech + Intent | |
| - IEMOCAP | |
| - MELD (multi-party dialogue) | |
| ### Strategy | |
| - Pretrain each modality separately | |
| - Fine-tune jointly | |
| - Align timestamps across modalities | |
| ## 7. Training & Evaluation | |
| ### Training | |
| - Multi-task learning | |
| - Weighted losses per output | |
| - Curriculum learning (single → multi-modal) | |
| ### Metrics | |
| - F1-score per emotion | |
| - Concordance correlation (regression) | |
| - Confusion matrices | |
| - Per-modality ablation | |
| ## 8. Deployment | |
| ### Backend | |
| - FastAPI | |
| - GPU inference support | |
| - Streaming inference pipeline | |
| ### Frontend | |
| - Next.js / React | |
| - WebRTC video | |
| - Web Audio API | |
| - WebGL visualizations | |
| ### Infrastructure | |
| - Dockerized services | |
| - Modular microservices | |
| - Model versioning | |
| ## 9. Non-Functional Requirements | |
| - Real-time latency < 200ms | |
| - Modular model replacement | |
| - Privacy-first design | |
| - No biometric storage by default |