raj5517
/

multimodal-representation-framework

multimodal_framework

representation-learning

contrastive-learning

Model card Files Files and versions

multimodal-representation-framework / README.md

raj5517's picture

Upload README.md with huggingface_hub

682718e verified 8 days ago

|

history blame contribute delete

3.84 kB

	---
	license: mit
	tags:
	- multimodal
	- representation-learning
	- contrastive-learning
	- simclr
	- unsupervised
	- pytorch
	- tabular
	- explainability
	metrics:
	- adjusted_rand_score
	- silhouette_score
	---

	# Multi-Modal Representation Learning Framework

	Unsupervised multi-modal representation learning framework that fuses
	heterogeneous tabular signals into unified embeddings using cross-modal
	attention and SimCLR contrastive training.

	Trained without any labels. Achieves ARI = 0.9989 on cluster recovery.

	---

	## Model Architecture
	```
	Academic [5] + Behavioral [5] + Activity [5]
	↓ ↓ ↓
	Encoder A Encoder B Encoder C
	(5→128→64) (5→128→64) (5→128→64)
	└──────────────┴───────────────┘
	↓
	CrossModalAttentionFusion
	- Concat [64,64,64] → 192
	- Per-modality attention scores
	- Softmax → weights sum to 1.0
	- Project 192 → 128
	↓
	unified_embedding [128]
	attention_weights [3] ← explainability
	```

	- Parameters: 66,243 (encoder only)
	- Training: SimCLR contrastive learning, 184 epochs, RTX 3050
	- Loss: NT-Xent (temperature=0.07)
	- Batch size: 128 with 256 negatives per step

	---

	## Results

	\| Metric \| Score \|
	\|--------\|-------\|
	\| NT-Xent Loss \| 0.5869 \|
	\| Silhouette Score \| 0.3310 \|
	\| Adjusted Rand Index \| 0.9989 \|

	Near-perfect unsupervised cluster recovery across 4 student
	profiles from 5000 samples — zero labels used during training.

	---

	## Quick Start
	```python
	import torch
	from huggingface_hub import hf_hub_download
	from modeling_multimodal import MultiModalFramework

	# Load model
	model = MultiModalFramework.from_pretrained("YOUR_HF_USERNAME/multimodal-representation-framework")
	model.eval()

	# Example: single student
	academic = torch.tensor([[3.7, 92.0, 90.0, 85.0, 1.0]]) # gpa, attendance%, assignment%, exam_avg, late
	behavioral = torch.tensor([[5.0, 90.0, 6.0, 8.0, 2.0]]) # library, session_min, peer, forum, login_var
	activity = torch.tensor([[9000.0, 7.5, 60.0, 5.0, 62.0]]) # steps, sleep, active_min, sedentary, hr

	with torch.no_grad():
	embedding, attn = model(academic, behavioral, activity)

	print(f"Embedding shape : {embedding.shape}") # [1, 128]
	print(f"Attn weights : {attn.numpy().round(3)}") # [academic, behavioral, activity]
	```

	---

	## Modality Attention Weights

	The model produces per-sample attention weights explaining which
	modality contributed most to the unified embedding.

	Overall contribution across 5000 students:
	- Activity: 49.1%
	- Behavioral: 29.1%
	- Academic: 21.8%

	Per-profile insights:
	- Social Learner relies heavily on Activity (0.60)
	- Quiet Worker relies on Behavioral (0.36)
	- High Achiever shows balanced attention across all modalities

	---

	## Application to Wearable Sensor Fusion

	This framework directly addresses the multi-modal fusion problem in
	wearable health tech. Replace tabular encoders with 1D-CNN/LSTM
	encoders to handle:

	\| This Model \| Wearable Application \|
	\|-----------\|---------------------\|
	\| Academic modality \| EEG signals \|
	\| Behavioral modality \| EMG signals \|
	\| Activity modality \| IMU + PPG \|
	\| Student profiles \| Human activity states \|

	---

	## Training Details

	- Dataset: Synthetic — 5000 samples, 4 hidden profiles
	- Augmentation: Gaussian noise (σ=0.15) + 5% feature dropout
	- Optimizer: Adam (lr=1e-3, weight_decay=1e-4)
	- LR Schedule: 10-epoch warmup + cosine decay
	- Early stopping: Patience=30
	- Hardware: NVIDIA RTX 3050 4GB