--- license: mit datasets: - wisdm language: - en tags: - human-activity-recognition - imu - self-supervised-learning - transformer - time-series - accelerometer - gyroscope - pytorch pipeline_tag: feature-extraction --- # IMU-SelfSupEncoder-v1 A self-supervised Transformer encoder for Human Activity Recognition (HAR) from IMU sensor data. Trained on the [WISDM smartphone+smartwatch dataset](https://archive.ics.uci.edu/dataset/507/wisdm+smartphone+and+smartwatch+activity+and+biometrics+dataset) with a masked-prediction objective, SupCon contrastive learning, and LMM frequency-domain loss. ## Model Overview - **Architecture**: Vision Transformer (ViT) style with Conv-stem patch embedding + time-frequency fusion - **Input**: 6-channel IMU window (accel x/y/z + gyro x/y/z), 200 timesteps @ 20 Hz (10 seconds) - **Output**: 192-dim CLS token + 20 patch embeddings (192-dim each) + intermediate layer outputs - **Params**: ~1.4M - **Training**: Self-supervised masked prediction (teacher-student) + SupCon contrastive + LMM frequency loss ## Usage ```python from modeling_imu_encoder import IMUMaskedEncoder model = IMUMaskedEncoder.from_pretrained("NikoKKK/IMU-SelfSupEncoder-v1") model.eval() # Input: (batch, 6 channels, 200 timesteps) x = torch.randn(8, 6, 200) with torch.no_grad(): patch_out, intermediates, cls_out, global_freq = model(x) # cls_out: (8, 192) — use for classification # patch_out: (8, 20, 192) — per-patch features # intermediates: {2: (8, 20, 192), 4: (8, 20, 192)} ``` ### Linear Probe Example ```python import torch.nn as nn # Freeze encoder for p in model.parameters(): p.requires_grad = False model.eval() # Simple classifier on CLS token classifier = nn.Sequential( nn.Linear(192, 256), nn.ReLU(), nn.Dropout(0.3), nn.Linear(256, 18), # 18 activity classes ) # Extract features and train classifier with torch.no_grad(): cls_features = model.encode(imu_windows) # (N, 192) ``` ## Training Details ### Pretraining Objective The model was trained with a self-supervised masked prediction approach: 1. **x-encoder (student)**: Sees masked input with [MASK] tokens at masked positions 2. **y-encoder (teacher)**: Sees full original signal, EMA-updated from x-encoder 3. **Predictor**: x-encoder output → predicts y-encoder representations at masked positions ### Masking Strategy Multi-mask with 4 views per sample: | Mask Type | Probability | Description | |-----------|-------------|-------------| | Time block | 50% | Blocks of 3-8, 10-18, or 20-30 patches | | Channel | 25% | Mask 1-2 of 6 sensor channels | | Frequency | 25% | Mask 30% of STFT frequency bins (bias toward mid-high) | ### Loss Functions | Component | Weight | Purpose | |-----------|--------|---------| | L_pred (MSE) | 1.0 | Predict teacher representations at masked positions | | L_lmm (frequency) | 0.1 | Reconstruct original signal patches with frequency-domain loss | | L_supcon | 0.15 | Supervised contrastive loss on CLS tokens | | L_sigreg | adaptive | Prevent representation collapse | ### Dataset - **Source**: WISDM smartphone+smartwatch dataset (watch accelerometer + gyroscope) - **Subjects**: 6 training (1600-1605), 2 validation (1606-1607), 2 test (1608-1609) - **Activities**: 18 classes (walking, jogging, stairs, sitting, standing, typing, brushing teeth, eating/drinking, sports, writing, clapping, folding clothes) - **Sampling**: 20 Hz, 10-second windows with 2.5s stride (75% overlap) ### Training Config | Parameter | Value | |-----------|-------| | Epochs | 12 | | Batch size | 128 | | Learning rate | 3e-4 (cosine to 1e-5) | | Warmup epochs | 2 | | Optimizer | AdamW (weight_decay=0.05) | | EMA tau | 0.999 → 0.9999 (cosine) | ## Model Architecture ``` Input: (B, 6, 200) │ ├── Conv1d Stem (6→96, kernel=10, stride=10) │ └── Time tokens: (B, 20, 96) │ ├── Per-patch FFT → Linear │ └── Freq tokens: (B, 20, 96) │ ├── Concat + Fusion → (B, 20, 192) │ ├── Global FFT (full 200-pt) → Linear → (B, 1, 192) │ ├── Position Embedding (learned, 21 positions) │ └── Transformer Encoder (4 layers, 6 heads, 192-dim, MLP ratio 3.0) ├── Layer 2 → intermediate output ├── Layer 4 → intermediate output └── CLS token + 20 patch tokens + global_freq token ``` ## Citation ```bibtex @misc{imu-selfsup-encoder, author = {Li, Yu}, title = {IMU-SelfSupEncoder-v1: Self-Supervised Transformer for IMU Activity Recognition}, year = {2026}, url = {https://huggingface.co/NikoKKK/IMU-SelfSupEncoder-v1}, } @inproceedings{weiss2019wisdm, title={Smartphone and Smartwatch-Based Biometrics Using Activities of Daily Living}, author={Weiss, Gary M and Yoneda, Kenichi and Hayajneh, Thaier}, booktitle={IEEE Access}, volume={7}, pages={133190--133202}, year={2019}, publisher={IEEE} } ```