Manav2op
/

EMOTIA

emotion-detection

intent-analysis

Model card Files Files and versions

EMOTIA / prd.md

Manav2op's picture

Upload folder using huggingface_hub

25d0747 verified about 1 month ago

|

history blame contribute delete

4.31 kB

EMOTIA Product Requirements Document

1. Product Overview

Problem

Video calls remove many human signals. Recruiters, educators, sales teams, and therapists lack objective insights into:

Emotional state
Engagement
Confidence
Intent (confusion, agreement, hesitation)

Manual observation is subjective, inconsistent, and non-scalable.

Solution

A real-time multi-modal AI system that analyzes:

Facial expressions (video)
Vocal tone (audio)
Spoken language (text)
Temporal behavior (over time)

…and produces interpretable, ethical, probabilistic insights.

Target Users

Recruiters & hiring platforms
EdTech platforms
Sales & customer success teams
Remote therapy & coaching platforms
Product teams analyzing user calls

2. Core Features

2.1 Live Video Call Analysis

Real-time emotion detection
Engagement tracking
Confidence & stress indicators
Timeline-based emotion shifts

2.2 Post-Call Analytics Dashboard

Emotion timeline
Intent heatmap
Modality influence breakdown
Key moments (confusion spikes, stress peaks)

2.3 Multi-Modal Explainability

Why a prediction was made:

Face vs voice vs text contribution
Visual overlays (heatmaps)
Confidence intervals (not hard labels)

2.4 Ethics & Bias Controls

Bias evaluation toggle
Per-modality opt-out
Clear disclaimers (non-diagnostic, assistive AI)

3. UI / UX Vision

3.1 UI Style

Dark mode only
Glassmorphism cards
Neon accent colors (cyan / violet / lime)
Smooth micro-animations
Real-time waveform + emotion graphs

3.2 Main Dashboard

Left Panel

Live video feed
Face bounding box
Micro-expression indicators

Center

Emotion timeline (animated)
Engagement meter (0–100)
Confidence score

Right Panel

Intent probabilities
Stress indicators
Modality contribution bars

3.3 Post-Call Report UI

Scrollable emotion timeline
Clickable "critical moments"
Modality dominance chart
Exportable report (PDF)

3.4 UI Components (Must-Have)

Animated confidence rings
Temporal scrubber
Heatmap overlays
Tooltips explaining AI decisions

4. Technical Architecture

4.1 Input Pipeline

Webcam video (25–30 FPS)
Microphone audio
Real-time ASR
Sliding temporal windows (5–10 sec)

4.2 Model Architecture (Production-Grade)

🔹 Visual Branch

Vision Transformer (ViT) fine-tuned for facial expressions
Face detection + alignment
Temporal pooling

🔹 Audio Branch

Audio → Mel-spectrogram
CNN + Transformer
Prosody, pitch, rhythm modeling

🔹 Text Branch

Transformer-based language model
Fine-tuned for intent & sentiment
Confidence / hesitation phrase detection

🔹 Fusion Network (KEY DIFFERENTIATOR)

Cross-modal attention
Dynamic modality weighting
Temporal transformer for sequence learning

🔹 Output Heads

Emotion classification
Intent classification
Engagement regression
Confidence regression

5. Models to Use (Strong + Realistic)

Visual

ViT-Base / EfficientNet
Pretrained on face emotion datasets

Audio

Wav2Vec-style embeddings
CNN-Transformer hybrid

Text

Transformer encoder (fine-tuned)
Focus on conversational intent

Fusion

Custom attention-based multi-head network
(this is your original contribution)

6. Datasets (CV-Worthy)

Facial Emotion

FER-2013
AffectNet
RAF-DB

Audio Emotion

RAVDESS
CREMA-D

Speech + Intent

IEMOCAP
MELD (multi-party dialogue)

Strategy

Pretrain each modality separately
Fine-tune jointly
Align timestamps across modalities

7. Training & Evaluation

Training

Multi-task learning
Weighted losses per output
Curriculum learning (single → multi-modal)

Metrics

F1-score per emotion
Concordance correlation (regression)
Confusion matrices
Per-modality ablation

8. Deployment

Backend

FastAPI
GPU inference support
Streaming inference pipeline

Frontend

Next.js / React
WebRTC video
Web Audio API
WebGL visualizations

Infrastructure

Dockerized services
Modular microservices
Model versioning

9. Non-Functional Requirements

Real-time latency < 200ms
Modular model replacement
Privacy-first design
No biometric storage by default