| --- |
| license: mit |
| datasets: |
| - wisdm |
| language: |
| - en |
| tags: |
| - human-activity-recognition |
| - imu |
| - self-supervised-learning |
| - transformer |
| - time-series |
| - accelerometer |
| - gyroscope |
| - pytorch |
| pipeline_tag: feature-extraction |
| --- |
| |
| # IMU-SelfSupEncoder-v1 |
|
|
| A self-supervised Transformer encoder for Human Activity Recognition (HAR) from IMU sensor data. Trained on the [WISDM smartphone+smartwatch dataset](https://archive.ics.uci.edu/dataset/507/wisdm+smartphone+and+smartwatch+activity+and+biometrics+dataset) with a masked-prediction objective, SupCon contrastive learning, and LMM frequency-domain loss. |
|
|
| ## Model Overview |
|
|
| - **Architecture**: Vision Transformer (ViT) style with Conv-stem patch embedding + time-frequency fusion |
| - **Input**: 6-channel IMU window (accel x/y/z + gyro x/y/z), 200 timesteps @ 20 Hz (10 seconds) |
| - **Output**: 192-dim CLS token + 20 patch embeddings (192-dim each) + intermediate layer outputs |
| - **Params**: ~1.4M |
| - **Training**: Self-supervised masked prediction (teacher-student) + SupCon contrastive + LMM frequency loss |
|
|
| ## Usage |
|
|
| ```python |
| from modeling_imu_encoder import IMUMaskedEncoder |
| |
| model = IMUMaskedEncoder.from_pretrained("NikoKKK/IMU-SelfSupEncoder-v1") |
| model.eval() |
| |
| # Input: (batch, 6 channels, 200 timesteps) |
| x = torch.randn(8, 6, 200) |
| |
| with torch.no_grad(): |
| patch_out, intermediates, cls_out, global_freq = model(x) |
| |
| # cls_out: (8, 192) β use for classification |
| # patch_out: (8, 20, 192) β per-patch features |
| # intermediates: {2: (8, 20, 192), 4: (8, 20, 192)} |
| ``` |
|
|
| ### Linear Probe Example |
|
|
| ```python |
| import torch.nn as nn |
| |
| # Freeze encoder |
| for p in model.parameters(): |
| p.requires_grad = False |
| model.eval() |
| |
| # Simple classifier on CLS token |
| classifier = nn.Sequential( |
| nn.Linear(192, 256), nn.ReLU(), nn.Dropout(0.3), |
| nn.Linear(256, 18), # 18 activity classes |
| ) |
| |
| # Extract features and train classifier |
| with torch.no_grad(): |
| cls_features = model.encode(imu_windows) # (N, 192) |
| ``` |
|
|
| ## Training Details |
|
|
| ### Pretraining Objective |
|
|
| The model was trained with a self-supervised masked prediction approach: |
|
|
| 1. **x-encoder (student)**: Sees masked input with [MASK] tokens at masked positions |
| 2. **y-encoder (teacher)**: Sees full original signal, EMA-updated from x-encoder |
| 3. **Predictor**: x-encoder output β predicts y-encoder representations at masked positions |
|
|
| ### Masking Strategy |
|
|
| Multi-mask with 4 views per sample: |
|
|
| | Mask Type | Probability | Description | |
| |-----------|-------------|-------------| |
| | Time block | 50% | Blocks of 3-8, 10-18, or 20-30 patches | |
| | Channel | 25% | Mask 1-2 of 6 sensor channels | |
| | Frequency | 25% | Mask 30% of STFT frequency bins (bias toward mid-high) | |
|
|
| ### Loss Functions |
|
|
| | Component | Weight | Purpose | |
| |-----------|--------|---------| |
| | L_pred (MSE) | 1.0 | Predict teacher representations at masked positions | |
| | L_lmm (frequency) | 0.1 | Reconstruct original signal patches with frequency-domain loss | |
| | L_supcon | 0.15 | Supervised contrastive loss on CLS tokens | |
| | L_sigreg | adaptive | Prevent representation collapse | |
|
|
| ### Dataset |
|
|
| - **Source**: WISDM smartphone+smartwatch dataset (watch accelerometer + gyroscope) |
| - **Subjects**: 6 training (1600-1605), 2 validation (1606-1607), 2 test (1608-1609) |
| - **Activities**: 18 classes (walking, jogging, stairs, sitting, standing, typing, brushing teeth, eating/drinking, sports, writing, clapping, folding clothes) |
| - **Sampling**: 20 Hz, 10-second windows with 2.5s stride (75% overlap) |
|
|
| ### Training Config |
|
|
| | Parameter | Value | |
| |-----------|-------| |
| | Epochs | 12 | |
| | Batch size | 128 | |
| | Learning rate | 3e-4 (cosine to 1e-5) | |
| | Warmup epochs | 2 | |
| | Optimizer | AdamW (weight_decay=0.05) | |
| | EMA tau | 0.999 β 0.9999 (cosine) | |
| |
| ## Model Architecture |
| |
| ``` |
| Input: (B, 6, 200) |
| β |
| βββ Conv1d Stem (6β96, kernel=10, stride=10) |
| β βββ Time tokens: (B, 20, 96) |
| β |
| βββ Per-patch FFT β Linear |
| β βββ Freq tokens: (B, 20, 96) |
| β |
| βββ Concat + Fusion β (B, 20, 192) |
| β |
| βββ Global FFT (full 200-pt) β Linear β (B, 1, 192) |
| β |
| βββ Position Embedding (learned, 21 positions) |
| β |
| βββ Transformer Encoder (4 layers, 6 heads, 192-dim, MLP ratio 3.0) |
| βββ Layer 2 β intermediate output |
| βββ Layer 4 β intermediate output |
| βββ CLS token + 20 patch tokens + global_freq token |
| ``` |
| |
| ## Citation |
| |
| ```bibtex |
| @misc{imu-selfsup-encoder, |
| author = {Li, Yu}, |
| title = {IMU-SelfSupEncoder-v1: Self-Supervised Transformer for IMU Activity Recognition}, |
| year = {2026}, |
| url = {https://huggingface.co/NikoKKK/IMU-SelfSupEncoder-v1}, |
| } |
|
|
| @inproceedings{weiss2019wisdm, |
| title={Smartphone and Smartwatch-Based Biometrics Using Activities of Daily Living}, |
| author={Weiss, Gary M and Yoneda, Kenichi and Hayajneh, Thaier}, |
| booktitle={IEEE Access}, |
| volume={7}, |
| pages={133190--133202}, |
| year={2019}, |
| publisher={IEEE} |
| } |
| ``` |
| |