| # EMOTIA Architecture |
|
|
| ## System Overview |
|
|
| EMOTIA is a multi-modal AI system that analyzes video calls to infer emotional state, conversational intent, engagement, and confidence using facial expressions, vocal tone, spoken language, and temporal context. |
|
|
| ## Architecture Diagram |
|
|
| ``` |
| βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ |
| β Video Input β β Audio Input β β Text Input β |
| β (25-30 FPS) β β (16kHz WAV) β β (ASR Trans.) β |
| βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ |
| β β β |
| βΌ βΌ βΌ |
| βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ |
| β Vision Branch β β Audio Branch β β Text Branch β |
| β β’ ViT-Base β β β’ CNN + Trans. β β β’ BERT Encoder β |
| β β’ Face Detect β β β’ Wav2Vec2 β β β’ Intent Detect β |
| β β’ Emotion Class β β β’ Prosody β β β’ Sentiment β |
| βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ |
| β β β |
| ββββββββββββββββββββββββββΌβββββββββββββββββββββββββ |
| βΌ |
| βββββββββββββββββββββββββββββββ |
| β Cross-Modal Fusion β |
| β β’ Attention Mechanism β |
| β β’ Dynamic Weighting β |
| β β’ Temporal Transformer β |
| β β’ Modality Contributions β |
| βββββββββββββββββββββββββββββββ |
| β |
| βΌ |
| βββββββββββββββββββββββββββββββ |
| β Multi-Task Outputs β |
| β β’ Emotion Classification β |
| β β’ Intent Classification β |
| β β’ Engagement Regression β |
| β β’ Confidence Estimation β |
| βββββββββββββββββββββββββββββββ |
| ``` |
|
|
| ## Component Details |
|
|
| ### Vision Branch |
| - **Input**: RGB video frames (224x224) |
| - **Face Detection**: OpenCV Haar cascades |
| - **Feature Extraction**: Vision Transformer (ViT-Base) |
| - **Fine-tuning**: FER-2013, AffectNet, RAF-DB datasets |
| - **Output**: Emotion logits (7 classes), confidence score |
|
|
| ### Audio Branch |
| - **Input**: Audio waveforms (16kHz, 3-second windows) |
| - **Preprocessing**: Mel-spectrogram extraction |
| - **Feature Extraction**: Wav2Vec2 + CNN layers |
| - **Prosody Analysis**: Pitch, rhythm, energy features |
| - **Output**: Emotion logits, stress/confidence score |
|
|
| ### Text Branch |
| - **Input**: Transcribed speech text |
| - **Preprocessing**: Tokenization, cleaning |
| - **Feature Extraction**: BERT-base for intent/sentiment |
| - **Intent Detection**: Hesitation phrases, confidence markers |
| - **Output**: Intent logits (5 classes), sentiment logits |
|
|
| ### Fusion Network |
| - **Modality Projection**: Linear layers to common embedding space (256D) |
| - **Cross-Attention**: Multi-head attention between modalities |
| - **Temporal Modeling**: Transformer encoder for sequence processing |
| - **Dynamic Weighting**: Learned modality importance scores |
| - **Outputs**: Fused predictions with contribution weights |
|
|
| ## Data Flow |
|
|
| 1. **Input Processing**: Video frames, audio chunks, ASR text |
| 2. **Sliding Windows**: 5-10 second temporal windows |
| 3. **Feature Extraction**: Parallel processing per modality |
| 4. **Fusion**: Cross-modal attention and temporal aggregation |
| 5. **Prediction**: Multi-task classification/regression |
| 6. **Explainability**: Modality contribution scores |
|
|
| ## Deployment Architecture |
|
|
| ``` |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| β Client Application β |
| β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β |
| β β WebRTC Video Stream β β |
| β β β’ Camera Access β β |
| β β β’ Audio Capture β β |
| β β β’ Real-time Streaming β β |
| β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| β |
| βΌ |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| β FastAPI Backend β |
| β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β |
| β β Inference Pipeline β β |
| β β β’ Model Loading β β |
| β β β’ Preprocessing β β |
| β β β’ GPU Inference β β |
| β β β’ Post-processing β β |
| β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β |
| β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β |
| β β Real-time Processing β β |
| β β β’ Sliding Window Buffering β β |
| β β β’ Asynchronous Processing β β |
| β β β’ Streaming Responses β β |
| β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| β |
| βΌ |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| β Response Formatting β |
| β β’ JSON API Responses β |
| β β’ Real-time WebSocket Updates β |
| β β’ Batch Processing for Post-call Analysis β |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| ``` |
|
|
| ## Performance Requirements |
|
|
| - **Latency**: <200ms end-to-end |
| - **Throughput**: 25-30 FPS video processing |
| - **Accuracy**: F1 > 0.80 for emotion classification |
| - **Scalability**: Horizontal scaling with load balancer |
| - **Reliability**: 99.9% uptime, graceful degradation |
|
|
| ## Security Considerations |
|
|
| - **Data Privacy**: No biometric storage by default |
| - **Encryption**: TLS 1.3 for all communications |
| - **Access Control**: API key authentication |
| - **Audit Logging**: All inference requests logged |
| - **Compliance**: GDPR, CCPA compliance features |