| ---
|
| license: mit
|
| tags:
|
| - multimodal
|
| - representation-learning
|
| - contrastive-learning
|
| - simclr
|
| - unsupervised
|
| - pytorch
|
| - tabular
|
| - explainability
|
| metrics:
|
| - adjusted_rand_score
|
| - silhouette_score
|
| ---
|
|
|
| # Multi-Modal Representation Learning Framework
|
|
|
| Unsupervised multi-modal representation learning framework that fuses
|
| heterogeneous tabular signals into unified embeddings using cross-modal
|
| attention and SimCLR contrastive training.
|
|
|
| **Trained without any labels. Achieves ARI = 0.9989 on cluster recovery.**
|
|
|
| ---
|
|
|
| ## Model Architecture
|
| ```
|
| Academic [5] + Behavioral [5] + Activity [5]
|
| β β β
|
| Encoder A Encoder B Encoder C
|
| (5β128β64) (5β128β64) (5β128β64)
|
| ββββββββββββββββ΄ββββββββββββββββ
|
| β
|
| CrossModalAttentionFusion
|
| - Concat [64,64,64] β 192
|
| - Per-modality attention scores
|
| - Softmax β weights sum to 1.0
|
| - Project 192 β 128
|
| β
|
| unified_embedding [128]
|
| attention_weights [3] β explainability
|
| ```
|
|
|
| - **Parameters:** 66,243 (encoder only)
|
| - **Training:** SimCLR contrastive learning, 184 epochs, RTX 3050
|
| - **Loss:** NT-Xent (temperature=0.07)
|
| - **Batch size:** 128 with 256 negatives per step
|
|
|
| ---
|
|
|
| ## Results
|
|
|
| | Metric | Score |
|
| |--------|-------|
|
| | NT-Xent Loss | 0.5869 |
|
| | Silhouette Score | 0.3310 |
|
| | **Adjusted Rand Index** | **0.9989** |
|
|
|
| Near-perfect unsupervised cluster recovery across 4 student
|
| profiles from 5000 samples β zero labels used during training.
|
|
|
| ---
|
|
|
| ## Quick Start
|
| ```python
|
| import torch
|
| from huggingface_hub import hf_hub_download
|
| from modeling_multimodal import MultiModalFramework
|
|
|
| # Load model
|
| model = MultiModalFramework.from_pretrained("YOUR_HF_USERNAME/multimodal-representation-framework")
|
| model.eval()
|
|
|
| # Example: single student
|
| academic = torch.tensor([[3.7, 92.0, 90.0, 85.0, 1.0]]) # gpa, attendance%, assignment%, exam_avg, late
|
| behavioral = torch.tensor([[5.0, 90.0, 6.0, 8.0, 2.0]]) # library, session_min, peer, forum, login_var
|
| activity = torch.tensor([[9000.0, 7.5, 60.0, 5.0, 62.0]]) # steps, sleep, active_min, sedentary, hr
|
|
|
| with torch.no_grad():
|
| embedding, attn = model(academic, behavioral, activity)
|
|
|
| print(f"Embedding shape : {embedding.shape}") # [1, 128]
|
| print(f"Attn weights : {attn.numpy().round(3)}") # [academic, behavioral, activity]
|
| ```
|
|
|
| ---
|
|
|
| ## Modality Attention Weights
|
|
|
| The model produces per-sample attention weights explaining which
|
| modality contributed most to the unified embedding.
|
|
|
| **Overall contribution across 5000 students:**
|
| - Activity: 49.1%
|
| - Behavioral: 29.1%
|
| - Academic: 21.8%
|
|
|
| **Per-profile insights:**
|
| - Social Learner relies heavily on Activity (0.60)
|
| - Quiet Worker relies on Behavioral (0.36)
|
| - High Achiever shows balanced attention across all modalities
|
|
|
| ---
|
|
|
| ## Application to Wearable Sensor Fusion
|
|
|
| This framework directly addresses the multi-modal fusion problem in
|
| wearable health tech. Replace tabular encoders with 1D-CNN/LSTM
|
| encoders to handle:
|
|
|
| | This Model | Wearable Application |
|
| |-----------|---------------------|
|
| | Academic modality | EEG signals |
|
| | Behavioral modality | EMG signals |
|
| | Activity modality | IMU + PPG |
|
| | Student profiles | Human activity states |
|
|
|
| ---
|
|
|
| ## Training Details
|
|
|
| - **Dataset:** Synthetic β 5000 samples, 4 hidden profiles
|
| - **Augmentation:** Gaussian noise (Ο=0.15) + 5% feature dropout
|
| - **Optimizer:** Adam (lr=1e-3, weight_decay=1e-4)
|
| - **LR Schedule:** 10-epoch warmup + cosine decay
|
| - **Early stopping:** Patience=30
|
| - **Hardware:** NVIDIA RTX 3050 4GB |