Upload README.md with huggingface_hub

347b2fd verified 6 days ago

4.95 kB

	---
	license: mit
	datasets:
	- wisdm
	language:
	- en
	tags:
	- human-activity-recognition
	- imu
	- self-supervised-learning
	- transformer
	- time-series
	- accelerometer
	- gyroscope
	- pytorch
	pipeline_tag: feature-extraction
	---

	# IMU-SelfSupEncoder-v1

	A self-supervised Transformer encoder for Human Activity Recognition (HAR) from IMU sensor data. Trained on the [WISDM smartphone+smartwatch dataset](https://archive.ics.uci.edu/dataset/507/wisdm+smartphone+and+smartwatch+activity+and+biometrics+dataset) with a masked-prediction objective, SupCon contrastive learning, and LMM frequency-domain loss.

	## Model Overview

	- Architecture: Vision Transformer (ViT) style with Conv-stem patch embedding + time-frequency fusion
	- Input: 6-channel IMU window (accel x/y/z + gyro x/y/z), 200 timesteps @ 20 Hz (10 seconds)
	- Output: 192-dim CLS token + 20 patch embeddings (192-dim each) + intermediate layer outputs
	- Params: ~1.4M
	- Training: Self-supervised masked prediction (teacher-student) + SupCon contrastive + LMM frequency loss

	## Usage

	```python
	from modeling_imu_encoder import IMUMaskedEncoder

	model = IMUMaskedEncoder.from_pretrained("NikoKKK/IMU-SelfSupEncoder-v1")
	model.eval()

	# Input: (batch, 6 channels, 200 timesteps)
	x = torch.randn(8, 6, 200)

	with torch.no_grad():
	patch_out, intermediates, cls_out, global_freq = model(x)

	# cls_out: (8, 192) — use for classification
	# patch_out: (8, 20, 192) — per-patch features
	# intermediates: {2: (8, 20, 192), 4: (8, 20, 192)}
	```

	### Linear Probe Example

	```python
	import torch.nn as nn

	# Freeze encoder
	for p in model.parameters():
	p.requires_grad = False
	model.eval()

	# Simple classifier on CLS token
	classifier = nn.Sequential(
	nn.Linear(192, 256), nn.ReLU(), nn.Dropout(0.3),
	nn.Linear(256, 18), # 18 activity classes
	)

	# Extract features and train classifier
	with torch.no_grad():
	cls_features = model.encode(imu_windows) # (N, 192)
	```

	## Training Details

	### Pretraining Objective

	The model was trained with a self-supervised masked prediction approach:

	1. x-encoder (student): Sees masked input with [MASK] tokens at masked positions
	2. y-encoder (teacher): Sees full original signal, EMA-updated from x-encoder
	3. Predictor: x-encoder output → predicts y-encoder representations at masked positions

	### Masking Strategy

	Multi-mask with 4 views per sample:

	\| Mask Type \| Probability \| Description \|
	\|-----------\|-------------\|-------------\|
	\| Time block \| 50% \| Blocks of 3-8, 10-18, or 20-30 patches \|
	\| Channel \| 25% \| Mask 1-2 of 6 sensor channels \|
	\| Frequency \| 25% \| Mask 30% of STFT frequency bins (bias toward mid-high) \|

	### Loss Functions

	\| Component \| Weight \| Purpose \|
	\|-----------\|--------\|---------\|
	\| L_pred (MSE) \| 1.0 \| Predict teacher representations at masked positions \|
	\| L_lmm (frequency) \| 0.1 \| Reconstruct original signal patches with frequency-domain loss \|
	\| L_supcon \| 0.15 \| Supervised contrastive loss on CLS tokens \|
	\| L_sigreg \| adaptive \| Prevent representation collapse \|

	### Dataset

	- Source: WISDM smartphone+smartwatch dataset (watch accelerometer + gyroscope)
	- Subjects: 6 training (1600-1605), 2 validation (1606-1607), 2 test (1608-1609)
	- Activities: 18 classes (walking, jogging, stairs, sitting, standing, typing, brushing teeth, eating/drinking, sports, writing, clapping, folding clothes)
	- Sampling: 20 Hz, 10-second windows with 2.5s stride (75% overlap)

	### Training Config

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Epochs \| 12 \|
	\| Batch size \| 128 \|
	\| Learning rate \| 3e-4 (cosine to 1e-5) \|
	\| Warmup epochs \| 2 \|
	\| Optimizer \| AdamW (weight_decay=0.05) \|
	\| EMA tau \| 0.999 → 0.9999 (cosine) \|

	## Model Architecture

	```
	Input: (B, 6, 200)
	│
	├── Conv1d Stem (6→96, kernel=10, stride=10)
	│ └── Time tokens: (B, 20, 96)
	│
	├── Per-patch FFT → Linear
	│ └── Freq tokens: (B, 20, 96)
	│
	├── Concat + Fusion → (B, 20, 192)
	│
	├── Global FFT (full 200-pt) → Linear → (B, 1, 192)
	│
	├── Position Embedding (learned, 21 positions)
	│
	└── Transformer Encoder (4 layers, 6 heads, 192-dim, MLP ratio 3.0)
	├── Layer 2 → intermediate output
	├── Layer 4 → intermediate output
	└── CLS token + 20 patch tokens + global_freq token
	```

	## Citation

	```bibtex
	@misc{imu-selfsup-encoder,
	author = {Li, Yu},
	title = {IMU-SelfSupEncoder-v1: Self-Supervised Transformer for IMU Activity Recognition},
	year = {2026},
	url = {https://huggingface.co/NikoKKK/IMU-SelfSupEncoder-v1},
	}

	@inproceedings{weiss2019wisdm,
	title={Smartphone and Smartwatch-Based Biometrics Using Activities of Daily Living},
	author={Weiss, Gary M and Yoneda, Kenichi and Hayajneh, Thaier},
	booktitle={IEEE Access},
	volume={7},
	pages={133190--133202},
	year={2019},
	publisher={IEEE}
	}
	```