File size: 3,836 Bytes
682718e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
---

license: mit
tags:
  - multimodal
  - representation-learning
  - contrastive-learning
  - simclr
  - unsupervised
  - pytorch
  - tabular
  - explainability
metrics:
  - adjusted_rand_score
  - silhouette_score
---


# Multi-Modal Representation Learning Framework

Unsupervised multi-modal representation learning framework that fuses
heterogeneous tabular signals into unified embeddings using cross-modal
attention and SimCLR contrastive training.

**Trained without any labels. Achieves ARI = 0.9989 on cluster recovery.**

---

## Model Architecture
```

Academic [5] + Behavioral [5] + Activity [5]

       ↓              ↓               ↓

  Encoder A      Encoder B       Encoder C

  (5β†’128β†’64)    (5β†’128β†’64)      (5β†’128β†’64)

       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

                       ↓

         CrossModalAttentionFusion

         - Concat [64,64,64] β†’ 192

         - Per-modality attention scores

         - Softmax β†’ weights sum to 1.0

         - Project 192 β†’ 128

                       ↓

           unified_embedding [128]

           attention_weights [3]   ← explainability

```

- **Parameters:** 66,243 (encoder only)
- **Training:** SimCLR contrastive learning, 184 epochs, RTX 3050
- **Loss:** NT-Xent (temperature=0.07)
- **Batch size:** 128 with 256 negatives per step

---

## Results

| Metric | Score |
|--------|-------|
| NT-Xent Loss | 0.5869 |
| Silhouette Score | 0.3310 |
| **Adjusted Rand Index** | **0.9989** |

Near-perfect unsupervised cluster recovery across 4 student
profiles from 5000 samples β€” zero labels used during training.

---

## Quick Start
```python

import torch

from huggingface_hub import hf_hub_download

from modeling_multimodal import MultiModalFramework



# Load model

model = MultiModalFramework.from_pretrained("YOUR_HF_USERNAME/multimodal-representation-framework")

model.eval()



# Example: single student

academic   = torch.tensor([[3.7, 92.0, 90.0, 85.0, 1.0]])   # gpa, attendance%, assignment%, exam_avg, late

behavioral = torch.tensor([[5.0, 90.0, 6.0,  8.0,  2.0]])   # library, session_min, peer, forum, login_var

activity   = torch.tensor([[9000.0, 7.5, 60.0, 5.0, 62.0]]) # steps, sleep, active_min, sedentary, hr



with torch.no_grad():

    embedding, attn = model(academic, behavioral, activity)



print(f"Embedding shape : {embedding.shape}")        # [1, 128]

print(f"Attn weights    : {attn.numpy().round(3)}")  # [academic, behavioral, activity]

```

---

## Modality Attention Weights

The model produces per-sample attention weights explaining which
modality contributed most to the unified embedding.

**Overall contribution across 5000 students:**
- Activity: 49.1%
- Behavioral: 29.1%
- Academic: 21.8%

**Per-profile insights:**
- Social Learner relies heavily on Activity (0.60)
- Quiet Worker relies on Behavioral (0.36)
- High Achiever shows balanced attention across all modalities

---

## Application to Wearable Sensor Fusion

This framework directly addresses the multi-modal fusion problem in
wearable health tech. Replace tabular encoders with 1D-CNN/LSTM
encoders to handle:

| This Model | Wearable Application |
|-----------|---------------------|
| Academic modality | EEG signals |
| Behavioral modality | EMG signals |
| Activity modality | IMU + PPG |
| Student profiles | Human activity states |

---

## Training Details

- **Dataset:** Synthetic β€” 5000 samples, 4 hidden profiles
- **Augmentation:** Gaussian noise (Οƒ=0.15) + 5% feature dropout
- **Optimizer:** Adam (lr=1e-3, weight_decay=1e-4)

- **LR Schedule:** 10-epoch warmup + cosine decay

- **Early stopping:** Patience=30

- **Hardware:** NVIDIA RTX 3050 4GB