m97j commited on
Commit
20bb31d
·
verified ·
1 Parent(s): 9bbc001

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +217 -3
README.md CHANGED
@@ -1,3 +1,217 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ library_name: pytorch
4
+ pipeline_tag: video-classification
5
+ license: apache-2.0
6
+
7
+ tags:
8
+ - human-activity-recognition
9
+ - action-recognition
10
+ - pose-estimation
11
+ - multimodal
12
+ - self-supervised-learning
13
+ - transformer
14
+ - skeleton
15
+ - computer-vision
16
+
17
+ model-index:
18
+ - name: Multi-Scale Multimodal Pose HAR
19
+ results: []
20
+ ---
21
+
22
+
23
+ # Multi-Scale Multimodal Pose HAR
24
+
25
+ A **custom PyTorch model** for **Human Activity Recognition (HAR)** that integrates:
26
+
27
+ - **Short-term pose transformers** (factorized temporal + spatial attention)
28
+ - **Long-term temporal aggregation**
29
+ - **Optional multimodal fusion with RGB images**
30
+ - **Multi-stage self-supervised + supervised training pipeline**
31
+
32
+
33
+ ---
34
+
35
+ ## 🧠 Model Overview
36
+
37
+ ### Architecture Summary
38
+
39
+ **Pose stream**
40
+ - Input: `(B, L, T, J, C)`
41
+ - Short-term encoder: `PoseFormerFactorized`
42
+ - Temporal attention (per joint)
43
+ - Spatial attention (per frame)
44
+ - Long-term encoder: Transformer over segment-level features
45
+
46
+ **Image stream (optional)**
47
+ - Backbone: ResNet18 / ResNet50
48
+ - Temporal pooling per segment
49
+
50
+ **Fusion**
51
+ - `concat` (default): feature concatenation + MLP
52
+ - `xattn`: shallow cross-attention (pose tokens ↔ image token)
53
+
54
+ **Output**
55
+ - Activity classification logits
56
+ - Optional intermediate embeddings / tokens
57
+
58
+ ---
59
+
60
+ ## 🏗️ Model Components
61
+
62
+ | Module | Description |
63
+ |------|------------|
64
+ | `PoseFormerFactorized` | Short-term pose transformer |
65
+ | `LongTermTemporalBlock` | Long-range temporal modeling |
66
+ | `ImageEncoder` | CNN-based RGB feature extractor |
67
+ | `MMFusionConcatLN` | Concatenation-based multimodal fusion |
68
+ | `MMFusionCrossAttnShallow` | Cross-attention multimodal fusion |
69
+ | `SSLHeads` | Contrastive, reconstruction, temporal order heads |
70
+
71
+ ---
72
+
73
+ ## 📥 Input Format
74
+
75
+ ### Pose Input
76
+ ```text
77
+ (B, L, T, J, C)
78
+ ````
79
+
80
+ * `B`: batch size
81
+ * `L`: number of temporal segments
82
+ * `T`: frames per segment
83
+ * `J`: number of joints (e.g. 17)
84
+ * `C`: joint channels (2D or 3D)
85
+
86
+ ### Image Input (optional)
87
+
88
+ ```text
89
+ (B, L, T, 3, H, W)
90
+ ```
91
+ * `3`: RGB channels
92
+ * `H`: image height
93
+ * `W`: image width
94
+
95
+ ---
96
+
97
+ ## 🚀 Usage
98
+
99
+ ### 1️⃣ Load Model Code
100
+
101
+ ```python
102
+ from model_har_final import (
103
+ PoseFormerFactorized,
104
+ MultiScaleTemporalModel
105
+ )
106
+ ```
107
+
108
+ ---
109
+
110
+ ### 2️⃣ Build Model
111
+
112
+ ```python
113
+ pose_backbone = PoseFormerFactorized(
114
+ joints=17,
115
+ in_ch=3,
116
+ dim=128,
117
+ layers=4,
118
+ num_classes=6,
119
+ return_tokens=True
120
+ )
121
+
122
+ model = MultiScaleTemporalModel(
123
+ short_seq_model=pose_backbone,
124
+ num_classes=6,
125
+ enable_long_term=True,
126
+ multimodal=True,
127
+ fusion_mode="concat" # or "xattn"
128
+ )
129
+ ```
130
+
131
+ ---
132
+
133
+ ### 3️⃣ Load Weights
134
+
135
+ ```python
136
+ import torch
137
+
138
+ ckpt = torch.load("best_stage3_dual_sched.pth", map_location="cpu")
139
+ model.load_state_dict(ckpt)
140
+ model.eval()
141
+ ```
142
+
143
+ > ℹ️ This model is saved using **`state_dict`**, not pickle-serialized objects, for maximum compatibility.
144
+
145
+ ---
146
+
147
+ ### 4️⃣ Inference
148
+
149
+ ```python
150
+ with torch.no_grad():
151
+ logits = model(pose_seq, img_seq)
152
+ preds = logits.argmax(dim=-1)
153
+ ```
154
+
155
+ ---
156
+
157
+ ## 🧪 Training Strategy
158
+
159
+ The model is trained in **three stages**:
160
+
161
+ ### Stage 1 — Pose SSL + Weak Supervision
162
+
163
+ * Masked joint modeling (MJM)
164
+ * Contrastive learning (InfoNCE)
165
+ * Temporal order prediction
166
+ * Optional labeled supervision
167
+
168
+ ### Stage 2 — Pose-only SSL Refinement
169
+
170
+ * Contrastive + reconstruction losses
171
+ * Temporal attention disabled for stability
172
+
173
+ ### Stage 3 — Multimodal Supervised Fine-tuning
174
+
175
+ * Label smoothing CE
176
+ * Semantic prototype distillation
177
+ * Metric learning (triplet)
178
+ * Optional knowledge distillation
179
+
180
+ ---
181
+
182
+ ## 📊 Outputs
183
+
184
+ | Output | Shape |
185
+ | ---------------------- | ------------------ |
186
+ | Logits | `(B, num_classes)` |
187
+ | Pose embedding | `(B, D)` |
188
+ | Pose tokens (optional) | `(B, T, J, D)` |
189
+
190
+ ---
191
+
192
+ ## 📎 Limitations
193
+
194
+ * Not compatible with `AutoModel.from_pretrained`
195
+ * Requires custom code to instantiate architecture
196
+ * Input pose format must match training configuration
197
+
198
+ ---
199
+
200
+ ## 📜 License
201
+
202
+ Apache License 2.0
203
+
204
+ ---
205
+
206
+ ## 📌 Citation
207
+
208
+ If you use this work, please cite:
209
+
210
+ ```bibtex
211
+ @misc{kim2025multiscalehar,
212
+ title = {Multi-Scale Multimodal Pose Transformer for Human Activity Recognition},
213
+ author = {Minjae Kim},
214
+ year = {2025},
215
+ howpublished = {\url{https://huggingface.co/m97j/har-safety-model}}
216
+ }
217
+ ```