Multi-Modal Representation Learning Framework

Unsupervised multi-modal representation learning framework that fuses heterogeneous tabular signals into unified embeddings using cross-modal attention and SimCLR contrastive training.

Trained without any labels. Achieves ARI = 0.9989 on cluster recovery.


Model Architecture

Academic [5] + Behavioral [5] + Activity [5]
       ↓              ↓               ↓
  Encoder A      Encoder B       Encoder C
  (5β†’128β†’64)    (5β†’128β†’64)      (5β†’128β†’64)
       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                       ↓
         CrossModalAttentionFusion
         - Concat [64,64,64] β†’ 192
         - Per-modality attention scores
         - Softmax β†’ weights sum to 1.0
         - Project 192 β†’ 128
                       ↓
           unified_embedding [128]
           attention_weights [3]   ← explainability
  • Parameters: 66,243 (encoder only)
  • Training: SimCLR contrastive learning, 184 epochs, RTX 3050
  • Loss: NT-Xent (temperature=0.07)
  • Batch size: 128 with 256 negatives per step

Results

Metric Score
NT-Xent Loss 0.5869
Silhouette Score 0.3310
Adjusted Rand Index 0.9989

Near-perfect unsupervised cluster recovery across 4 student profiles from 5000 samples β€” zero labels used during training.


Quick Start

import torch
from huggingface_hub import hf_hub_download
from modeling_multimodal import MultiModalFramework

# Load model
model = MultiModalFramework.from_pretrained("YOUR_HF_USERNAME/multimodal-representation-framework")
model.eval()

# Example: single student
academic   = torch.tensor([[3.7, 92.0, 90.0, 85.0, 1.0]])   # gpa, attendance%, assignment%, exam_avg, late
behavioral = torch.tensor([[5.0, 90.0, 6.0,  8.0,  2.0]])   # library, session_min, peer, forum, login_var
activity   = torch.tensor([[9000.0, 7.5, 60.0, 5.0, 62.0]]) # steps, sleep, active_min, sedentary, hr

with torch.no_grad():
    embedding, attn = model(academic, behavioral, activity)

print(f"Embedding shape : {embedding.shape}")        # [1, 128]
print(f"Attn weights    : {attn.numpy().round(3)}")  # [academic, behavioral, activity]

Modality Attention Weights

The model produces per-sample attention weights explaining which modality contributed most to the unified embedding.

Overall contribution across 5000 students:

  • Activity: 49.1%
  • Behavioral: 29.1%
  • Academic: 21.8%

Per-profile insights:

  • Social Learner relies heavily on Activity (0.60)
  • Quiet Worker relies on Behavioral (0.36)
  • High Achiever shows balanced attention across all modalities

Application to Wearable Sensor Fusion

This framework directly addresses the multi-modal fusion problem in wearable health tech. Replace tabular encoders with 1D-CNN/LSTM encoders to handle:

This Model Wearable Application
Academic modality EEG signals
Behavioral modality EMG signals
Activity modality IMU + PPG
Student profiles Human activity states

Training Details

  • Dataset: Synthetic β€” 5000 samples, 4 hidden profiles
  • Augmentation: Gaussian noise (Οƒ=0.15) + 5% feature dropout
  • Optimizer: Adam (lr=1e-3, weight_decay=1e-4)
  • LR Schedule: 10-epoch warmup + cosine decay
  • Early stopping: Patience=30
  • Hardware: NVIDIA RTX 3050 4GB
Downloads last month
34
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support