--- license: mit tags: - multimodal - representation-learning - contrastive-learning - simclr - unsupervised - pytorch - tabular - explainability metrics: - adjusted_rand_score - silhouette_score --- # Multi-Modal Representation Learning Framework Unsupervised multi-modal representation learning framework that fuses heterogeneous tabular signals into unified embeddings using cross-modal attention and SimCLR contrastive training. **Trained without any labels. Achieves ARI = 0.9989 on cluster recovery.** --- ## Model Architecture ``` Academic [5] + Behavioral [5] + Activity [5] ↓ ↓ ↓ Encoder A Encoder B Encoder C (5→128→64) (5→128→64) (5→128→64) └──────────────┴───────────────┘ ↓ CrossModalAttentionFusion - Concat [64,64,64] → 192 - Per-modality attention scores - Softmax → weights sum to 1.0 - Project 192 → 128 ↓ unified_embedding [128] attention_weights [3] ← explainability ``` - **Parameters:** 66,243 (encoder only) - **Training:** SimCLR contrastive learning, 184 epochs, RTX 3050 - **Loss:** NT-Xent (temperature=0.07) - **Batch size:** 128 with 256 negatives per step --- ## Results | Metric | Score | |--------|-------| | NT-Xent Loss | 0.5869 | | Silhouette Score | 0.3310 | | **Adjusted Rand Index** | **0.9989** | Near-perfect unsupervised cluster recovery across 4 student profiles from 5000 samples — zero labels used during training. --- ## Quick Start ```python import torch from huggingface_hub import hf_hub_download from modeling_multimodal import MultiModalFramework # Load model model = MultiModalFramework.from_pretrained("YOUR_HF_USERNAME/multimodal-representation-framework") model.eval() # Example: single student academic = torch.tensor([[3.7, 92.0, 90.0, 85.0, 1.0]]) # gpa, attendance%, assignment%, exam_avg, late behavioral = torch.tensor([[5.0, 90.0, 6.0, 8.0, 2.0]]) # library, session_min, peer, forum, login_var activity = torch.tensor([[9000.0, 7.5, 60.0, 5.0, 62.0]]) # steps, sleep, active_min, sedentary, hr with torch.no_grad(): embedding, attn = model(academic, behavioral, activity) print(f"Embedding shape : {embedding.shape}") # [1, 128] print(f"Attn weights : {attn.numpy().round(3)}") # [academic, behavioral, activity] ``` --- ## Modality Attention Weights The model produces per-sample attention weights explaining which modality contributed most to the unified embedding. **Overall contribution across 5000 students:** - Activity: 49.1% - Behavioral: 29.1% - Academic: 21.8% **Per-profile insights:** - Social Learner relies heavily on Activity (0.60) - Quiet Worker relies on Behavioral (0.36) - High Achiever shows balanced attention across all modalities --- ## Application to Wearable Sensor Fusion This framework directly addresses the multi-modal fusion problem in wearable health tech. Replace tabular encoders with 1D-CNN/LSTM encoders to handle: | This Model | Wearable Application | |-----------|---------------------| | Academic modality | EEG signals | | Behavioral modality | EMG signals | | Activity modality | IMU + PPG | | Student profiles | Human activity states | --- ## Training Details - **Dataset:** Synthetic — 5000 samples, 4 hidden profiles - **Augmentation:** Gaussian noise (σ=0.15) + 5% feature dropout - **Optimizer:** Adam (lr=1e-3, weight_decay=1e-4) - **LR Schedule:** 10-epoch warmup + cosine decay - **Early stopping:** Patience=30 - **Hardware:** NVIDIA RTX 3050 4GB