File size: 4,953 Bytes
347b2fd | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 | ---
license: mit
datasets:
- wisdm
language:
- en
tags:
- human-activity-recognition
- imu
- self-supervised-learning
- transformer
- time-series
- accelerometer
- gyroscope
- pytorch
pipeline_tag: feature-extraction
---
# IMU-SelfSupEncoder-v1
A self-supervised Transformer encoder for Human Activity Recognition (HAR) from IMU sensor data. Trained on the [WISDM smartphone+smartwatch dataset](https://archive.ics.uci.edu/dataset/507/wisdm+smartphone+and+smartwatch+activity+and+biometrics+dataset) with a masked-prediction objective, SupCon contrastive learning, and LMM frequency-domain loss.
## Model Overview
- **Architecture**: Vision Transformer (ViT) style with Conv-stem patch embedding + time-frequency fusion
- **Input**: 6-channel IMU window (accel x/y/z + gyro x/y/z), 200 timesteps @ 20 Hz (10 seconds)
- **Output**: 192-dim CLS token + 20 patch embeddings (192-dim each) + intermediate layer outputs
- **Params**: ~1.4M
- **Training**: Self-supervised masked prediction (teacher-student) + SupCon contrastive + LMM frequency loss
## Usage
```python
from modeling_imu_encoder import IMUMaskedEncoder
model = IMUMaskedEncoder.from_pretrained("NikoKKK/IMU-SelfSupEncoder-v1")
model.eval()
# Input: (batch, 6 channels, 200 timesteps)
x = torch.randn(8, 6, 200)
with torch.no_grad():
patch_out, intermediates, cls_out, global_freq = model(x)
# cls_out: (8, 192) β use for classification
# patch_out: (8, 20, 192) β per-patch features
# intermediates: {2: (8, 20, 192), 4: (8, 20, 192)}
```
### Linear Probe Example
```python
import torch.nn as nn
# Freeze encoder
for p in model.parameters():
p.requires_grad = False
model.eval()
# Simple classifier on CLS token
classifier = nn.Sequential(
nn.Linear(192, 256), nn.ReLU(), nn.Dropout(0.3),
nn.Linear(256, 18), # 18 activity classes
)
# Extract features and train classifier
with torch.no_grad():
cls_features = model.encode(imu_windows) # (N, 192)
```
## Training Details
### Pretraining Objective
The model was trained with a self-supervised masked prediction approach:
1. **x-encoder (student)**: Sees masked input with [MASK] tokens at masked positions
2. **y-encoder (teacher)**: Sees full original signal, EMA-updated from x-encoder
3. **Predictor**: x-encoder output β predicts y-encoder representations at masked positions
### Masking Strategy
Multi-mask with 4 views per sample:
| Mask Type | Probability | Description |
|-----------|-------------|-------------|
| Time block | 50% | Blocks of 3-8, 10-18, or 20-30 patches |
| Channel | 25% | Mask 1-2 of 6 sensor channels |
| Frequency | 25% | Mask 30% of STFT frequency bins (bias toward mid-high) |
### Loss Functions
| Component | Weight | Purpose |
|-----------|--------|---------|
| L_pred (MSE) | 1.0 | Predict teacher representations at masked positions |
| L_lmm (frequency) | 0.1 | Reconstruct original signal patches with frequency-domain loss |
| L_supcon | 0.15 | Supervised contrastive loss on CLS tokens |
| L_sigreg | adaptive | Prevent representation collapse |
### Dataset
- **Source**: WISDM smartphone+smartwatch dataset (watch accelerometer + gyroscope)
- **Subjects**: 6 training (1600-1605), 2 validation (1606-1607), 2 test (1608-1609)
- **Activities**: 18 classes (walking, jogging, stairs, sitting, standing, typing, brushing teeth, eating/drinking, sports, writing, clapping, folding clothes)
- **Sampling**: 20 Hz, 10-second windows with 2.5s stride (75% overlap)
### Training Config
| Parameter | Value |
|-----------|-------|
| Epochs | 12 |
| Batch size | 128 |
| Learning rate | 3e-4 (cosine to 1e-5) |
| Warmup epochs | 2 |
| Optimizer | AdamW (weight_decay=0.05) |
| EMA tau | 0.999 β 0.9999 (cosine) |
## Model Architecture
```
Input: (B, 6, 200)
β
βββ Conv1d Stem (6β96, kernel=10, stride=10)
β βββ Time tokens: (B, 20, 96)
β
βββ Per-patch FFT β Linear
β βββ Freq tokens: (B, 20, 96)
β
βββ Concat + Fusion β (B, 20, 192)
β
βββ Global FFT (full 200-pt) β Linear β (B, 1, 192)
β
βββ Position Embedding (learned, 21 positions)
β
βββ Transformer Encoder (4 layers, 6 heads, 192-dim, MLP ratio 3.0)
βββ Layer 2 β intermediate output
βββ Layer 4 β intermediate output
βββ CLS token + 20 patch tokens + global_freq token
```
## Citation
```bibtex
@misc{imu-selfsup-encoder,
author = {Li, Yu},
title = {IMU-SelfSupEncoder-v1: Self-Supervised Transformer for IMU Activity Recognition},
year = {2026},
url = {https://huggingface.co/NikoKKK/IMU-SelfSupEncoder-v1},
}
@inproceedings{weiss2019wisdm,
title={Smartphone and Smartwatch-Based Biometrics Using Activities of Daily Living},
author={Weiss, Gary M and Yoneda, Kenichi and Hayajneh, Thaier},
booktitle={IEEE Access},
volume={7},
pages={133190--133202},
year={2019},
publisher={IEEE}
}
```
|