|
|
--- |
|
|
language: en |
|
|
library_name: pytorch |
|
|
pipeline_tag: video-classification |
|
|
license: apache-2.0 |
|
|
|
|
|
tags: |
|
|
- human-activity-recognition |
|
|
- action-recognition |
|
|
- pose-estimation |
|
|
- multimodal |
|
|
- self-supervised-learning |
|
|
- transformer |
|
|
- skeleton |
|
|
- computer-vision |
|
|
|
|
|
model-index: |
|
|
- name: Multi-Scale Multimodal Pose HAR |
|
|
results: [] |
|
|
--- |
|
|
|
|
|
|
|
|
# Multi-Scale Multimodal Pose HAR |
|
|
|
|
|
A **custom PyTorch model** for **Human Activity Recognition (HAR)** that integrates: |
|
|
|
|
|
- **Short-term pose transformers** (factorized temporal + spatial attention) |
|
|
- **Long-term temporal aggregation** |
|
|
- **Optional multimodal fusion with RGB images** |
|
|
- **Multi-stage self-supervised + supervised training pipeline** |
|
|
|
|
|
|
|
|
--- |
|
|
|
|
|
## π§ Model Overview |
|
|
|
|
|
### Architecture Summary |
|
|
|
|
|
**Pose stream** |
|
|
- Input: `(B, L, T, J, C)` |
|
|
- Short-term encoder: `PoseFormerFactorized` |
|
|
- Temporal attention (per joint) |
|
|
- Spatial attention (per frame) |
|
|
- Long-term encoder: Transformer over segment-level features |
|
|
|
|
|
**Image stream (optional)** |
|
|
- Backbone: ResNet18 / ResNet50 |
|
|
- Temporal pooling per segment |
|
|
|
|
|
**Fusion** |
|
|
- `concat` (default): feature concatenation + MLP |
|
|
- `xattn`: shallow cross-attention (pose tokens β image token) |
|
|
|
|
|
**Output** |
|
|
- Activity classification logits |
|
|
- Optional intermediate embeddings / tokens |
|
|
|
|
|
--- |
|
|
|
|
|
## ποΈ Model Components |
|
|
|
|
|
| Module | Description | |
|
|
|------|------------| |
|
|
| `PoseFormerFactorized` | Short-term pose transformer | |
|
|
| `LongTermTemporalBlock` | Long-range temporal modeling | |
|
|
| `ImageEncoder` | CNN-based RGB feature extractor | |
|
|
| `MMFusionConcatLN` | Concatenation-based multimodal fusion | |
|
|
| `MMFusionCrossAttnShallow` | Cross-attention multimodal fusion | |
|
|
| `SSLHeads` | Contrastive, reconstruction, temporal order heads | |
|
|
|
|
|
--- |
|
|
|
|
|
## π₯ Input Format |
|
|
|
|
|
### Pose Input |
|
|
```text |
|
|
(B, L, T, J, C) |
|
|
```` |
|
|
|
|
|
* `B`: batch size |
|
|
* `L`: number of temporal segments |
|
|
* `T`: frames per segment |
|
|
* `J`: number of joints (e.g. 17) |
|
|
* `C`: joint channels (2D or 3D) |
|
|
|
|
|
### Image Input (optional) |
|
|
|
|
|
```text |
|
|
(B, L, T, 3, H, W) |
|
|
``` |
|
|
* `3`: RGB channels |
|
|
* `H`: image height |
|
|
* `W`: image width |
|
|
|
|
|
--- |
|
|
|
|
|
## π Usage |
|
|
|
|
|
### 1οΈβ£ Load Model Code |
|
|
|
|
|
```python |
|
|
from model_har_final import ( |
|
|
PoseFormerFactorized, |
|
|
MultiScaleTemporalModel |
|
|
) |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
### 2οΈβ£ Build Model |
|
|
|
|
|
```python |
|
|
pose_backbone = PoseFormerFactorized( |
|
|
joints=17, |
|
|
in_ch=3, |
|
|
dim=128, |
|
|
layers=4, |
|
|
num_classes=6, |
|
|
return_tokens=True |
|
|
) |
|
|
|
|
|
model = MultiScaleTemporalModel( |
|
|
short_seq_model=pose_backbone, |
|
|
num_classes=6, |
|
|
enable_long_term=True, |
|
|
multimodal=True, |
|
|
fusion_mode="concat" # or "xattn" |
|
|
) |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
### 3οΈβ£ Load Weights |
|
|
|
|
|
```python |
|
|
import torch |
|
|
|
|
|
ckpt = torch.load("best_stage3_dual_sched.pth", map_location="cpu") |
|
|
model.load_state_dict(ckpt) |
|
|
model.eval() |
|
|
``` |
|
|
|
|
|
> βΉοΈ This model is saved using **`state_dict`**, not pickle-serialized objects, for maximum compatibility. |
|
|
|
|
|
--- |
|
|
|
|
|
### 4οΈβ£ Inference |
|
|
|
|
|
```python |
|
|
with torch.no_grad(): |
|
|
logits = model(pose_seq, img_seq) |
|
|
preds = logits.argmax(dim=-1) |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## π§ͺ Training Strategy |
|
|
|
|
|
The model is trained in **three stages**: |
|
|
|
|
|
### Stage 1 β Pose SSL + Weak Supervision |
|
|
|
|
|
* Masked joint modeling (MJM) |
|
|
* Contrastive learning (InfoNCE) |
|
|
* Temporal order prediction |
|
|
* Optional labeled supervision |
|
|
|
|
|
### Stage 2 β Pose-only SSL Refinement |
|
|
|
|
|
* Contrastive + reconstruction losses |
|
|
* Temporal attention disabled for stability |
|
|
|
|
|
### Stage 3 β Multimodal Supervised Fine-tuning |
|
|
|
|
|
* Label smoothing CE |
|
|
* Semantic prototype distillation |
|
|
* Metric learning (triplet) |
|
|
* Optional knowledge distillation |
|
|
|
|
|
--- |
|
|
|
|
|
## π Outputs |
|
|
|
|
|
| Output | Shape | |
|
|
| ---------------------- | ------------------ | |
|
|
| Logits | `(B, num_classes)` | |
|
|
| Pose embedding | `(B, D)` | |
|
|
| Pose tokens (optional) | `(B, T, J, D)` | |
|
|
|
|
|
--- |
|
|
|
|
|
## π Limitations |
|
|
|
|
|
* Not compatible with `AutoModel.from_pretrained` |
|
|
* Requires custom code to instantiate architecture |
|
|
* Input pose format must match training configuration |
|
|
|
|
|
--- |
|
|
|
|
|
## π License |
|
|
|
|
|
Apache License 2.0 |
|
|
|
|
|
--- |
|
|
|
|
|
## π Citation |
|
|
|
|
|
If you use this work, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{kim2025multiscalehar, |
|
|
title = {Multi-Scale Multimodal Pose Transformer for Human Activity Recognition}, |
|
|
author = {Minjae Kim}, |
|
|
year = {2025}, |
|
|
howpublished = {\url{https://huggingface.co/m97j/har-safety-model}} |
|
|
} |
|
|
``` |