EMOTIA Architecture
System Overview
EMOTIA is a multi-modal AI system that analyzes video calls to infer emotional state, conversational intent, engagement, and confidence using facial expressions, vocal tone, spoken language, and temporal context.
Architecture Diagram
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β Video Input β β Audio Input β β Text Input β
β (25-30 FPS) β β (16kHz WAV) β β (ASR Trans.) β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β β β
βΌ βΌ βΌ
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β Vision Branch β β Audio Branch β β Text Branch β
β β’ ViT-Base β β β’ CNN + Trans. β β β’ BERT Encoder β
β β’ Face Detect β β β’ Wav2Vec2 β β β’ Intent Detect β
β β’ Emotion Class β β β’ Prosody β β β’ Sentiment β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β β β
ββββββββββββββββββββββββββΌβββββββββββββββββββββββββ
βΌ
βββββββββββββββββββββββββββββββ
β Cross-Modal Fusion β
β β’ Attention Mechanism β
β β’ Dynamic Weighting β
β β’ Temporal Transformer β
β β’ Modality Contributions β
βββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββ
β Multi-Task Outputs β
β β’ Emotion Classification β
β β’ Intent Classification β
β β’ Engagement Regression β
β β’ Confidence Estimation β
βββββββββββββββββββββββββββββββ
Component Details
Vision Branch
- Input: RGB video frames (224x224)
- Face Detection: OpenCV Haar cascades
- Feature Extraction: Vision Transformer (ViT-Base)
- Fine-tuning: FER-2013, AffectNet, RAF-DB datasets
- Output: Emotion logits (7 classes), confidence score
Audio Branch
- Input: Audio waveforms (16kHz, 3-second windows)
- Preprocessing: Mel-spectrogram extraction
- Feature Extraction: Wav2Vec2 + CNN layers
- Prosody Analysis: Pitch, rhythm, energy features
- Output: Emotion logits, stress/confidence score
Text Branch
- Input: Transcribed speech text
- Preprocessing: Tokenization, cleaning
- Feature Extraction: BERT-base for intent/sentiment
- Intent Detection: Hesitation phrases, confidence markers
- Output: Intent logits (5 classes), sentiment logits
Fusion Network
- Modality Projection: Linear layers to common embedding space (256D)
- Cross-Attention: Multi-head attention between modalities
- Temporal Modeling: Transformer encoder for sequence processing
- Dynamic Weighting: Learned modality importance scores
- Outputs: Fused predictions with contribution weights
Data Flow
- Input Processing: Video frames, audio chunks, ASR text
- Sliding Windows: 5-10 second temporal windows
- Feature Extraction: Parallel processing per modality
- Fusion: Cross-modal attention and temporal aggregation
- Prediction: Multi-task classification/regression
- Explainability: Modality contribution scores
Deployment Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Client Application β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β WebRTC Video Stream β β
β β β’ Camera Access β β
β β β’ Audio Capture β β
β β β’ Real-time Streaming β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β FastAPI Backend β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Inference Pipeline β β
β β β’ Model Loading β β
β β β’ Preprocessing β β
β β β’ GPU Inference β β
β β β’ Post-processing β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Real-time Processing β β
β β β’ Sliding Window Buffering β β
β β β’ Asynchronous Processing β β
β β β’ Streaming Responses β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Response Formatting β
β β’ JSON API Responses β
β β’ Real-time WebSocket Updates β
β β’ Batch Processing for Post-call Analysis β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Performance Requirements
- Latency: <200ms end-to-end
- Throughput: 25-30 FPS video processing
- Accuracy: F1 > 0.80 for emotion classification
- Scalability: Horizontal scaling with load balancer
- Reliability: 99.9% uptime, graceful degradation
Security Considerations
- Data Privacy: No biometric storage by default
- Encryption: TLS 1.3 for all communications
- Access Control: API key authentication
- Audit Logging: All inference requests logged
- Compliance: GDPR, CCPA compliance features