NikoKKK's picture
Upload README.md with huggingface_hub
347b2fd verified
---
license: mit
datasets:
- wisdm
language:
- en
tags:
- human-activity-recognition
- imu
- self-supervised-learning
- transformer
- time-series
- accelerometer
- gyroscope
- pytorch
pipeline_tag: feature-extraction
---
# IMU-SelfSupEncoder-v1
A self-supervised Transformer encoder for Human Activity Recognition (HAR) from IMU sensor data. Trained on the [WISDM smartphone+smartwatch dataset](https://archive.ics.uci.edu/dataset/507/wisdm+smartphone+and+smartwatch+activity+and+biometrics+dataset) with a masked-prediction objective, SupCon contrastive learning, and LMM frequency-domain loss.
## Model Overview
- **Architecture**: Vision Transformer (ViT) style with Conv-stem patch embedding + time-frequency fusion
- **Input**: 6-channel IMU window (accel x/y/z + gyro x/y/z), 200 timesteps @ 20 Hz (10 seconds)
- **Output**: 192-dim CLS token + 20 patch embeddings (192-dim each) + intermediate layer outputs
- **Params**: ~1.4M
- **Training**: Self-supervised masked prediction (teacher-student) + SupCon contrastive + LMM frequency loss
## Usage
```python
from modeling_imu_encoder import IMUMaskedEncoder
model = IMUMaskedEncoder.from_pretrained("NikoKKK/IMU-SelfSupEncoder-v1")
model.eval()
# Input: (batch, 6 channels, 200 timesteps)
x = torch.randn(8, 6, 200)
with torch.no_grad():
patch_out, intermediates, cls_out, global_freq = model(x)
# cls_out: (8, 192) β€” use for classification
# patch_out: (8, 20, 192) β€” per-patch features
# intermediates: {2: (8, 20, 192), 4: (8, 20, 192)}
```
### Linear Probe Example
```python
import torch.nn as nn
# Freeze encoder
for p in model.parameters():
p.requires_grad = False
model.eval()
# Simple classifier on CLS token
classifier = nn.Sequential(
nn.Linear(192, 256), nn.ReLU(), nn.Dropout(0.3),
nn.Linear(256, 18), # 18 activity classes
)
# Extract features and train classifier
with torch.no_grad():
cls_features = model.encode(imu_windows) # (N, 192)
```
## Training Details
### Pretraining Objective
The model was trained with a self-supervised masked prediction approach:
1. **x-encoder (student)**: Sees masked input with [MASK] tokens at masked positions
2. **y-encoder (teacher)**: Sees full original signal, EMA-updated from x-encoder
3. **Predictor**: x-encoder output β†’ predicts y-encoder representations at masked positions
### Masking Strategy
Multi-mask with 4 views per sample:
| Mask Type | Probability | Description |
|-----------|-------------|-------------|
| Time block | 50% | Blocks of 3-8, 10-18, or 20-30 patches |
| Channel | 25% | Mask 1-2 of 6 sensor channels |
| Frequency | 25% | Mask 30% of STFT frequency bins (bias toward mid-high) |
### Loss Functions
| Component | Weight | Purpose |
|-----------|--------|---------|
| L_pred (MSE) | 1.0 | Predict teacher representations at masked positions |
| L_lmm (frequency) | 0.1 | Reconstruct original signal patches with frequency-domain loss |
| L_supcon | 0.15 | Supervised contrastive loss on CLS tokens |
| L_sigreg | adaptive | Prevent representation collapse |
### Dataset
- **Source**: WISDM smartphone+smartwatch dataset (watch accelerometer + gyroscope)
- **Subjects**: 6 training (1600-1605), 2 validation (1606-1607), 2 test (1608-1609)
- **Activities**: 18 classes (walking, jogging, stairs, sitting, standing, typing, brushing teeth, eating/drinking, sports, writing, clapping, folding clothes)
- **Sampling**: 20 Hz, 10-second windows with 2.5s stride (75% overlap)
### Training Config
| Parameter | Value |
|-----------|-------|
| Epochs | 12 |
| Batch size | 128 |
| Learning rate | 3e-4 (cosine to 1e-5) |
| Warmup epochs | 2 |
| Optimizer | AdamW (weight_decay=0.05) |
| EMA tau | 0.999 β†’ 0.9999 (cosine) |
## Model Architecture
```
Input: (B, 6, 200)
β”‚
β”œβ”€β”€ Conv1d Stem (6β†’96, kernel=10, stride=10)
β”‚ └── Time tokens: (B, 20, 96)
β”‚
β”œβ”€β”€ Per-patch FFT β†’ Linear
β”‚ └── Freq tokens: (B, 20, 96)
β”‚
β”œβ”€β”€ Concat + Fusion β†’ (B, 20, 192)
β”‚
β”œβ”€β”€ Global FFT (full 200-pt) β†’ Linear β†’ (B, 1, 192)
β”‚
β”œβ”€β”€ Position Embedding (learned, 21 positions)
β”‚
└── Transformer Encoder (4 layers, 6 heads, 192-dim, MLP ratio 3.0)
β”œβ”€β”€ Layer 2 β†’ intermediate output
β”œβ”€β”€ Layer 4 β†’ intermediate output
└── CLS token + 20 patch tokens + global_freq token
```
## Citation
```bibtex
@misc{imu-selfsup-encoder,
author = {Li, Yu},
title = {IMU-SelfSupEncoder-v1: Self-Supervised Transformer for IMU Activity Recognition},
year = {2026},
url = {https://huggingface.co/NikoKKK/IMU-SelfSupEncoder-v1},
}
@inproceedings{weiss2019wisdm,
title={Smartphone and Smartwatch-Based Biometrics Using Activities of Daily Living},
author={Weiss, Gary M and Yoneda, Kenichi and Hayajneh, Thaier},
booktitle={IEEE Access},
volume={7},
pages={133190--133202},
year={2019},
publisher={IEEE}
}
```