EMOTIA / docs /architecture.md

Upload folder using huggingface_hub

25d0747 verified about 1 month ago

8.83 kB

	# EMOTIA Architecture

	## System Overview

	EMOTIA is a multi-modal AI system that analyzes video calls to infer emotional state, conversational intent, engagement, and confidence using facial expressions, vocal tone, spoken language, and temporal context.

	## Architecture Diagram

	```
	┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
	│ Video Input │ │ Audio Input │ │ Text Input │
	│ (25-30 FPS) │ │ (16kHz WAV) │ │ (ASR Trans.) │
	└─────────────────┘ └─────────────────┘ └─────────────────┘
	│ │ │
	▼ ▼ ▼
	┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
	│ Vision Branch │ │ Audio Branch │ │ Text Branch │
	│ • ViT-Base │ │ • CNN + Trans. │ │ • BERT Encoder │
	│ • Face Detect │ │ • Wav2Vec2 │ │ • Intent Detect │
	│ • Emotion Class │ │ • Prosody │ │ • Sentiment │
	└─────────────────┘ └─────────────────┘ └─────────────────┘
	│ │ │
	└────────────────────────┼────────────────────────┘
	▼
	┌─────────────────────────────┐
	│ Cross-Modal Fusion │
	│ • Attention Mechanism │
	│ • Dynamic Weighting │
	│ • Temporal Transformer │
	│ • Modality Contributions │
	└─────────────────────────────┘
	│
	▼
	┌─────────────────────────────┐
	│ Multi-Task Outputs │
	│ • Emotion Classification │
	│ • Intent Classification │
	│ • Engagement Regression │
	│ • Confidence Estimation │
	└─────────────────────────────┘
	```

	## Component Details

	### Vision Branch
	- Input: RGB video frames (224x224)
	- Face Detection: OpenCV Haar cascades
	- Feature Extraction: Vision Transformer (ViT-Base)
	- Fine-tuning: FER-2013, AffectNet, RAF-DB datasets
	- Output: Emotion logits (7 classes), confidence score

	### Audio Branch
	- Input: Audio waveforms (16kHz, 3-second windows)
	- Preprocessing: Mel-spectrogram extraction
	- Feature Extraction: Wav2Vec2 + CNN layers
	- Prosody Analysis: Pitch, rhythm, energy features
	- Output: Emotion logits, stress/confidence score

	### Text Branch
	- Input: Transcribed speech text
	- Preprocessing: Tokenization, cleaning
	- Feature Extraction: BERT-base for intent/sentiment
	- Intent Detection: Hesitation phrases, confidence markers
	- Output: Intent logits (5 classes), sentiment logits

	### Fusion Network
	- Modality Projection: Linear layers to common embedding space (256D)
	- Cross-Attention: Multi-head attention between modalities
	- Temporal Modeling: Transformer encoder for sequence processing
	- Dynamic Weighting: Learned modality importance scores
	- Outputs: Fused predictions with contribution weights

	## Data Flow

	1. Input Processing: Video frames, audio chunks, ASR text
	2. Sliding Windows: 5-10 second temporal windows
	3. Feature Extraction: Parallel processing per modality
	4. Fusion: Cross-modal attention and temporal aggregation
	5. Prediction: Multi-task classification/regression
	6. Explainability: Modality contribution scores

	## Deployment Architecture

	```
	┌─────────────────────────────────────────────────────────────┐
	│ Client Application │
	│ ┌─────────────────────────────────────────────────────┐ │
	│ │ WebRTC Video Stream │ │
	│ │ • Camera Access │ │
	│ │ • Audio Capture │ │
	│ │ • Real-time Streaming │ │
	│ └─────────────────────────────────────────────────────┘ │
	└─────────────────────────────────────────────────────────────┘
	│
	▼
	┌─────────────────────────────────────────────────────────────┐
	│ FastAPI Backend │
	│ ┌─────────────────────────────────────────────────────┐ │
	│ │ Inference Pipeline │ │
	│ │ • Model Loading │ │
	│ │ • Preprocessing │ │
	│ │ • GPU Inference │ │
	│ │ • Post-processing │ │
	│ └─────────────────────────────────────────────────────┘ │
	│ ┌─────────────────────────────────────────────────────┐ │
	│ │ Real-time Processing │ │
	│ │ • Sliding Window Buffering │ │
	│ │ • Asynchronous Processing │ │
	│ │ • Streaming Responses │ │
	│ └─────────────────────────────────────────────────────┘ │
	└─────────────────────────────────────────────────────────────┘
	│
	▼
	┌─────────────────────────────────────────────────────────────┐
	│ Response Formatting │
	│ • JSON API Responses │
	│ • Real-time WebSocket Updates │
	│ • Batch Processing for Post-call Analysis │
	└─────────────────────────────────────────────────────────────┘
	```

	## Performance Requirements

	- Latency: <200ms end-to-end
	- Throughput: 25-30 FPS video processing
	- Accuracy: F1 > 0.80 for emotion classification
	- Scalability: Horizontal scaling with load balancer
	- Reliability: 99.9% uptime, graceful degradation

	## Security Considerations

	- Data Privacy: No biometric storage by default
	- Encryption: TLS 1.3 for all communications
	- Access Control: API key authentication
	- Audit Logging: All inference requests logged
	- Compliance: GDPR, CCPA compliance features