File size: 8,827 Bytes

25d0747

# EMOTIA Architecture

## System Overview

EMOTIA is a multi-modal AI system that analyzes video calls to infer emotional state, conversational intent, engagement, and confidence using facial expressions, vocal tone, spoken language, and temporal context.

## Architecture Diagram

```
┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   Video Input   │    │  Audio Input    │    │   Text Input    │
│   (25-30 FPS)   │    │  (16kHz WAV)    │    │  (ASR Trans.)   │
└─────────────────┘    └─────────────────┘    └─────────────────┘
         │                        │                        │
         ▼                        ▼                        ▼
┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│ Vision Branch   │    │ Audio Branch    │    │ Text Branch     │
│ • ViT-Base      │    │ • CNN + Trans.  │    │ • BERT Encoder  │
│ • Face Detect   │    │ • Wav2Vec2      │    │ • Intent Detect │
│ • Emotion Class │    │ • Prosody       │    │ • Sentiment     │
└─────────────────┘    └─────────────────┘    └─────────────────┘
         │                        │                        │
         └────────────────────────┼────────────────────────┘
                                  ▼
                   ┌─────────────────────────────┐
                   │   Cross-Modal Fusion        │
                   │ • Attention Mechanism       │
                   │ • Dynamic Weighting         │
                   │ • Temporal Transformer      │
                   │ • Modality Contributions    │
                   └─────────────────────────────┘
                                  │
                                  ▼
                   ┌─────────────────────────────┐
                   │   Multi-Task Outputs        │
                   │ • Emotion Classification    │
                   │ • Intent Classification     │
                   │ • Engagement Regression     │
                   │ • Confidence Estimation     │
                   └─────────────────────────────┘
```

## Component Details

### Vision Branch
- **Input**: RGB video frames (224x224)
- **Face Detection**: OpenCV Haar cascades
- **Feature Extraction**: Vision Transformer (ViT-Base)
- **Fine-tuning**: FER-2013, AffectNet, RAF-DB datasets
- **Output**: Emotion logits (7 classes), confidence score

### Audio Branch
- **Input**: Audio waveforms (16kHz, 3-second windows)
- **Preprocessing**: Mel-spectrogram extraction
- **Feature Extraction**: Wav2Vec2 + CNN layers
- **Prosody Analysis**: Pitch, rhythm, energy features
- **Output**: Emotion logits, stress/confidence score

### Text Branch
- **Input**: Transcribed speech text
- **Preprocessing**: Tokenization, cleaning
- **Feature Extraction**: BERT-base for intent/sentiment
- **Intent Detection**: Hesitation phrases, confidence markers
- **Output**: Intent logits (5 classes), sentiment logits

### Fusion Network
- **Modality Projection**: Linear layers to common embedding space (256D)
- **Cross-Attention**: Multi-head attention between modalities
- **Temporal Modeling**: Transformer encoder for sequence processing
- **Dynamic Weighting**: Learned modality importance scores
- **Outputs**: Fused predictions with contribution weights

## Data Flow

1. **Input Processing**: Video frames, audio chunks, ASR text
2. **Sliding Windows**: 5-10 second temporal windows
3. **Feature Extraction**: Parallel processing per modality
4. **Fusion**: Cross-modal attention and temporal aggregation
5. **Prediction**: Multi-task classification/regression
6. **Explainability**: Modality contribution scores

## Deployment Architecture

```
┌─────────────────────────────────────────────────────────────┐
│                    Client Application                       │
│  ┌─────────────────────────────────────────────────────┐    │
│  │                WebRTC Video Stream                 │    │
│  │  • Camera Access                                  │    │
│  │  • Audio Capture                                  │    │
│  │  • Real-time Streaming                             │    │
│  └─────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────┘
                                 │
                                 ▼
┌─────────────────────────────────────────────────────────────┐
│                     FastAPI Backend                         │
│  ┌─────────────────────────────────────────────────────┐    │
│  │              Inference Pipeline                    │    │
│  │  • Model Loading                                  │    │
│  │  • Preprocessing                                  │    │
│  │  • GPU Inference                                  │    │
│  │  • Post-processing                                │    │
│  └─────────────────────────────────────────────────────┘    │
│  ┌─────────────────────────────────────────────────────┐    │
│  │              Real-time Processing                  │    │
│  │  • Sliding Window Buffering                       │    │
│  │  • Asynchronous Processing                        │    │
│  │  • Streaming Responses                            │    │
│  └─────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────┘
                                 │
                                 ▼
┌─────────────────────────────────────────────────────────────┐
│                    Response Formatting                      │
│  • JSON API Responses                                     │
│  • Real-time WebSocket Updates                            │
│  • Batch Processing for Post-call Analysis                │
└─────────────────────────────────────────────────────────────┘
```

## Performance Requirements

- **Latency**: <200ms end-to-end
- **Throughput**: 25-30 FPS video processing
- **Accuracy**: F1 > 0.80 for emotion classification
- **Scalability**: Horizontal scaling with load balancer
- **Reliability**: 99.9% uptime, graceful degradation

## Security Considerations

- **Data Privacy**: No biometric storage by default
- **Encryption**: TLS 1.3 for all communications
- **Access Control**: API key authentication
- **Audit Logging**: All inference requests logged
- **Compliance**: GDPR, CCPA compliance features