File size: 4,953 Bytes
347b2fd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
---
license: mit
datasets:
- wisdm
language:
- en
tags:
- human-activity-recognition
- imu
- self-supervised-learning
- transformer
- time-series
- accelerometer
- gyroscope
- pytorch
pipeline_tag: feature-extraction
---

# IMU-SelfSupEncoder-v1

A self-supervised Transformer encoder for Human Activity Recognition (HAR) from IMU sensor data. Trained on the [WISDM smartphone+smartwatch dataset](https://archive.ics.uci.edu/dataset/507/wisdm+smartphone+and+smartwatch+activity+and+biometrics+dataset) with a masked-prediction objective, SupCon contrastive learning, and LMM frequency-domain loss.

## Model Overview

- **Architecture**: Vision Transformer (ViT) style with Conv-stem patch embedding + time-frequency fusion
- **Input**: 6-channel IMU window (accel x/y/z + gyro x/y/z), 200 timesteps @ 20 Hz (10 seconds)
- **Output**: 192-dim CLS token + 20 patch embeddings (192-dim each) + intermediate layer outputs
- **Params**: ~1.4M
- **Training**: Self-supervised masked prediction (teacher-student) + SupCon contrastive + LMM frequency loss

## Usage

```python
from modeling_imu_encoder import IMUMaskedEncoder

model = IMUMaskedEncoder.from_pretrained("NikoKKK/IMU-SelfSupEncoder-v1")
model.eval()

# Input: (batch, 6 channels, 200 timesteps)
x = torch.randn(8, 6, 200)

with torch.no_grad():
    patch_out, intermediates, cls_out, global_freq = model(x)

# cls_out: (8, 192) β€” use for classification
# patch_out: (8, 20, 192) β€” per-patch features
# intermediates: {2: (8, 20, 192), 4: (8, 20, 192)}
```

### Linear Probe Example

```python
import torch.nn as nn

# Freeze encoder
for p in model.parameters():
    p.requires_grad = False
model.eval()

# Simple classifier on CLS token
classifier = nn.Sequential(
    nn.Linear(192, 256), nn.ReLU(), nn.Dropout(0.3),
    nn.Linear(256, 18),  # 18 activity classes
)

# Extract features and train classifier
with torch.no_grad():
    cls_features = model.encode(imu_windows)  # (N, 192)
```

## Training Details

### Pretraining Objective

The model was trained with a self-supervised masked prediction approach:

1. **x-encoder (student)**: Sees masked input with [MASK] tokens at masked positions
2. **y-encoder (teacher)**: Sees full original signal, EMA-updated from x-encoder
3. **Predictor**: x-encoder output β†’ predicts y-encoder representations at masked positions

### Masking Strategy

Multi-mask with 4 views per sample:

| Mask Type | Probability | Description |
|-----------|-------------|-------------|
| Time block | 50% | Blocks of 3-8, 10-18, or 20-30 patches |
| Channel | 25% | Mask 1-2 of 6 sensor channels |
| Frequency | 25% | Mask 30% of STFT frequency bins (bias toward mid-high) |

### Loss Functions

| Component | Weight | Purpose |
|-----------|--------|---------|
| L_pred (MSE) | 1.0 | Predict teacher representations at masked positions |
| L_lmm (frequency) | 0.1 | Reconstruct original signal patches with frequency-domain loss |
| L_supcon | 0.15 | Supervised contrastive loss on CLS tokens |
| L_sigreg | adaptive | Prevent representation collapse |

### Dataset

- **Source**: WISDM smartphone+smartwatch dataset (watch accelerometer + gyroscope)
- **Subjects**: 6 training (1600-1605), 2 validation (1606-1607), 2 test (1608-1609)
- **Activities**: 18 classes (walking, jogging, stairs, sitting, standing, typing, brushing teeth, eating/drinking, sports, writing, clapping, folding clothes)
- **Sampling**: 20 Hz, 10-second windows with 2.5s stride (75% overlap)

### Training Config

| Parameter | Value |
|-----------|-------|
| Epochs | 12 |
| Batch size | 128 |
| Learning rate | 3e-4 (cosine to 1e-5) |
| Warmup epochs | 2 |
| Optimizer | AdamW (weight_decay=0.05) |
| EMA tau | 0.999 β†’ 0.9999 (cosine) |

## Model Architecture

```
Input: (B, 6, 200)
    β”‚
    β”œβ”€β”€ Conv1d Stem (6β†’96, kernel=10, stride=10)
    β”‚   └── Time tokens: (B, 20, 96)
    β”‚
    β”œβ”€β”€ Per-patch FFT β†’ Linear
    β”‚   └── Freq tokens: (B, 20, 96)
    β”‚
    β”œβ”€β”€ Concat + Fusion β†’ (B, 20, 192)
    β”‚
    β”œβ”€β”€ Global FFT (full 200-pt) β†’ Linear β†’ (B, 1, 192)
    β”‚
    β”œβ”€β”€ Position Embedding (learned, 21 positions)
    β”‚
    └── Transformer Encoder (4 layers, 6 heads, 192-dim, MLP ratio 3.0)
        β”œβ”€β”€ Layer 2 β†’ intermediate output
        β”œβ”€β”€ Layer 4 β†’ intermediate output
        └── CLS token + 20 patch tokens + global_freq token
```

## Citation

```bibtex
@misc{imu-selfsup-encoder,
  author = {Li, Yu},
  title = {IMU-SelfSupEncoder-v1: Self-Supervised Transformer for IMU Activity Recognition},
  year = {2026},
  url = {https://huggingface.co/NikoKKK/IMU-SelfSupEncoder-v1},
}

@inproceedings{weiss2019wisdm,
  title={Smartphone and Smartwatch-Based Biometrics Using Activities of Daily Living},
  author={Weiss, Gary M and Yoneda, Kenichi and Hayajneh, Thaier},
  booktitle={IEEE Access},
  volume={7},
  pages={133190--133202},
  year={2019},
  publisher={IEEE}
}
```