Multi-Modal Representation Learning Framework

Unsupervised multi-modal representation learning framework that fuses heterogeneous tabular signals into unified embeddings using cross-modal attention and SimCLR contrastive training.

Trained without any labels. Achieves ARI = 0.9989 on cluster recovery.

Model Architecture

Academic [5] + Behavioral [5] + Activity [5]
       ↓              ↓               ↓
  Encoder A      Encoder B       Encoder C
  (5→128→64)    (5→128→64)      (5→128→64)
       └──────────────┴───────────────┘
                       ↓
         CrossModalAttentionFusion
         - Concat [64,64,64] → 192
         - Per-modality attention scores
         - Softmax → weights sum to 1.0
         - Project 192 → 128
                       ↓
           unified_embedding [128]
           attention_weights [3]   ← explainability

Parameters: 66,243 (encoder only)
Training: SimCLR contrastive learning, 184 epochs, RTX 3050
Loss: NT-Xent (temperature=0.07)
Batch size: 128 with 256 negatives per step

Results

Metric	Score
NT-Xent Loss	0.5869
Silhouette Score	0.3310
Adjusted Rand Index	0.9989

Near-perfect unsupervised cluster recovery across 4 student profiles from 5000 samples — zero labels used during training.

Quick Start

import torch
from huggingface_hub import hf_hub_download
from modeling_multimodal import MultiModalFramework

# Load model
model = MultiModalFramework.from_pretrained("YOUR_HF_USERNAME/multimodal-representation-framework")
model.eval()

# Example: single student
academic   = torch.tensor([[3.7, 92.0, 90.0, 85.0, 1.0]])   # gpa, attendance%, assignment%, exam_avg, late
behavioral = torch.tensor([[5.0, 90.0, 6.0,  8.0,  2.0]])   # library, session_min, peer, forum, login_var
activity   = torch.tensor([[9000.0, 7.5, 60.0, 5.0, 62.0]]) # steps, sleep, active_min, sedentary, hr

with torch.no_grad():
    embedding, attn = model(academic, behavioral, activity)

print(f"Embedding shape : {embedding.shape}")        # [1, 128]
print(f"Attn weights    : {attn.numpy().round(3)}")  # [academic, behavioral, activity]

Modality Attention Weights

The model produces per-sample attention weights explaining which modality contributed most to the unified embedding.

Overall contribution across 5000 students:

Activity: 49.1%
Behavioral: 29.1%
Academic: 21.8%

Per-profile insights:

Social Learner relies heavily on Activity (0.60)
Quiet Worker relies on Behavioral (0.36)
High Achiever shows balanced attention across all modalities

Application to Wearable Sensor Fusion

This framework directly addresses the multi-modal fusion problem in wearable health tech. Replace tabular encoders with 1D-CNN/LSTM encoders to handle:

This Model	Wearable Application
Academic modality	EEG signals
Behavioral modality	EMG signals
Activity modality	IMU + PPG
Student profiles	Human activity states

Training Details

Dataset: Synthetic — 5000 samples, 4 hidden profiles
Augmentation: Gaussian noise (σ=0.15) + 5% feature dropout
Optimizer: Adam (lr=1e-3, weight_decay=1e-4)
LR Schedule: 10-epoch warmup + cosine decay
Early stopping: Patience=30
Hardware: NVIDIA RTX 3050 4GB

Downloads last month: 2

Safetensors

Model size

67k params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support