Multi-Modal Representation Learning Framework
Unsupervised multi-modal representation learning framework that fuses heterogeneous tabular signals into unified embeddings using cross-modal attention and SimCLR contrastive training.
Trained without any labels. Achieves ARI = 0.9989 on cluster recovery.
Model Architecture
Academic [5] + Behavioral [5] + Activity [5]
β β β
Encoder A Encoder B Encoder C
(5β128β64) (5β128β64) (5β128β64)
ββββββββββββββββ΄ββββββββββββββββ
β
CrossModalAttentionFusion
- Concat [64,64,64] β 192
- Per-modality attention scores
- Softmax β weights sum to 1.0
- Project 192 β 128
β
unified_embedding [128]
attention_weights [3] β explainability
- Parameters: 66,243 (encoder only)
- Training: SimCLR contrastive learning, 184 epochs, RTX 3050
- Loss: NT-Xent (temperature=0.07)
- Batch size: 128 with 256 negatives per step
Results
| Metric | Score |
|---|---|
| NT-Xent Loss | 0.5869 |
| Silhouette Score | 0.3310 |
| Adjusted Rand Index | 0.9989 |
Near-perfect unsupervised cluster recovery across 4 student profiles from 5000 samples β zero labels used during training.
Quick Start
import torch
from huggingface_hub import hf_hub_download
from modeling_multimodal import MultiModalFramework
# Load model
model = MultiModalFramework.from_pretrained("YOUR_HF_USERNAME/multimodal-representation-framework")
model.eval()
# Example: single student
academic = torch.tensor([[3.7, 92.0, 90.0, 85.0, 1.0]]) # gpa, attendance%, assignment%, exam_avg, late
behavioral = torch.tensor([[5.0, 90.0, 6.0, 8.0, 2.0]]) # library, session_min, peer, forum, login_var
activity = torch.tensor([[9000.0, 7.5, 60.0, 5.0, 62.0]]) # steps, sleep, active_min, sedentary, hr
with torch.no_grad():
embedding, attn = model(academic, behavioral, activity)
print(f"Embedding shape : {embedding.shape}") # [1, 128]
print(f"Attn weights : {attn.numpy().round(3)}") # [academic, behavioral, activity]
Modality Attention Weights
The model produces per-sample attention weights explaining which modality contributed most to the unified embedding.
Overall contribution across 5000 students:
- Activity: 49.1%
- Behavioral: 29.1%
- Academic: 21.8%
Per-profile insights:
- Social Learner relies heavily on Activity (0.60)
- Quiet Worker relies on Behavioral (0.36)
- High Achiever shows balanced attention across all modalities
Application to Wearable Sensor Fusion
This framework directly addresses the multi-modal fusion problem in wearable health tech. Replace tabular encoders with 1D-CNN/LSTM encoders to handle:
| This Model | Wearable Application |
|---|---|
| Academic modality | EEG signals |
| Behavioral modality | EMG signals |
| Activity modality | IMU + PPG |
| Student profiles | Human activity states |
Training Details
- Dataset: Synthetic β 5000 samples, 4 hidden profiles
- Augmentation: Gaussian noise (Ο=0.15) + 5% feature dropout
- Optimizer: Adam (lr=1e-3, weight_decay=1e-4)
- LR Schedule: 10-epoch warmup + cosine decay
- Early stopping: Patience=30
- Hardware: NVIDIA RTX 3050 4GB
- Downloads last month
- 34
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support