EMOTIA Architecture

System Overview

EMOTIA is a multi-modal AI system that analyzes video calls to infer emotional state, conversational intent, engagement, and confidence using facial expressions, vocal tone, spoken language, and temporal context.

Architecture Diagram

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   Video Input   │    │  Audio Input    │    │   Text Input    │
│   (25-30 FPS)   │    │  (16kHz WAV)    │    │  (ASR Trans.)   │
└─────────────────┘    └─────────────────┘    └─────────────────┘
         │                        │                        │
         ▼                        ▼                        ▼
┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│ Vision Branch   │    │ Audio Branch    │    │ Text Branch     │
│ • ViT-Base      │    │ • CNN + Trans.  │    │ • BERT Encoder  │
│ • Face Detect   │    │ • Wav2Vec2      │    │ • Intent Detect │
│ • Emotion Class │    │ • Prosody       │    │ • Sentiment     │
└─────────────────┘    └─────────────────┘    └─────────────────┘
         │                        │                        │
         └────────────────────────┼────────────────────────┘
                                  ▼
                   ┌─────────────────────────────┐
                   │   Cross-Modal Fusion        │
                   │ • Attention Mechanism       │
                   │ • Dynamic Weighting         │
                   │ • Temporal Transformer      │
                   │ • Modality Contributions    │
                   └─────────────────────────────┘
                                  │
                                  ▼
                   ┌─────────────────────────────┐
                   │   Multi-Task Outputs        │
                   │ • Emotion Classification    │
                   │ • Intent Classification     │
                   │ • Engagement Regression     │
                   │ • Confidence Estimation     │
                   └─────────────────────────────┘

Component Details

Vision Branch

Input: RGB video frames (224x224)
Face Detection: OpenCV Haar cascades
Feature Extraction: Vision Transformer (ViT-Base)
Fine-tuning: FER-2013, AffectNet, RAF-DB datasets
Output: Emotion logits (7 classes), confidence score

Audio Branch

Input: Audio waveforms (16kHz, 3-second windows)
Preprocessing: Mel-spectrogram extraction
Feature Extraction: Wav2Vec2 + CNN layers
Prosody Analysis: Pitch, rhythm, energy features
Output: Emotion logits, stress/confidence score

Text Branch

Input: Transcribed speech text
Preprocessing: Tokenization, cleaning
Feature Extraction: BERT-base for intent/sentiment
Intent Detection: Hesitation phrases, confidence markers
Output: Intent logits (5 classes), sentiment logits

Fusion Network

Modality Projection: Linear layers to common embedding space (256D)
Cross-Attention: Multi-head attention between modalities
Temporal Modeling: Transformer encoder for sequence processing
Dynamic Weighting: Learned modality importance scores
Outputs: Fused predictions with contribution weights

Data Flow

Input Processing: Video frames, audio chunks, ASR text
Sliding Windows: 5-10 second temporal windows
Feature Extraction: Parallel processing per modality
Fusion: Cross-modal attention and temporal aggregation
Prediction: Multi-task classification/regression
Explainability: Modality contribution scores

Deployment Architecture

┌─────────────────────────────────────────────────────────────┐
│                    Client Application                       │
│  ┌─────────────────────────────────────────────────────┐    │
│  │                WebRTC Video Stream                 │    │
│  │  • Camera Access                                  │    │
│  │  • Audio Capture                                  │    │
│  │  • Real-time Streaming                             │    │
│  └─────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────┘
                                 │
                                 ▼
┌─────────────────────────────────────────────────────────────┐
│                     FastAPI Backend                         │
│  ┌─────────────────────────────────────────────────────┐    │
│  │              Inference Pipeline                    │    │
│  │  • Model Loading                                  │    │
│  │  • Preprocessing                                  │    │
│  │  • GPU Inference                                  │    │
│  │  • Post-processing                                │    │
│  └─────────────────────────────────────────────────────┘    │
│  ┌─────────────────────────────────────────────────────┐    │
│  │              Real-time Processing                  │    │
│  │  • Sliding Window Buffering                       │    │
│  │  • Asynchronous Processing                        │    │
│  │  • Streaming Responses                            │    │
│  └─────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────┘
                                 │
                                 ▼
┌─────────────────────────────────────────────────────────────┐
│                    Response Formatting                      │
│  • JSON API Responses                                     │
│  • Real-time WebSocket Updates                            │
│  • Batch Processing for Post-call Analysis                │
└─────────────────────────────────────────────────────────────┘

Performance Requirements

Latency: <200ms end-to-end
Throughput: 25-30 FPS video processing
Accuracy: F1 > 0.80 for emotion classification
Scalability: Horizontal scaling with load balancer
Reliability: 99.9% uptime, graceful degradation

Security Considerations

Data Privacy: No biometric storage by default
Encryption: TLS 1.3 for all communications
Access Control: API key authentication
Audit Logging: All inference requests logged
Compliance: GDPR, CCPA compliance features