---
language: en
library_name: pytorch
pipeline_tag: video-classification
license: apache-2.0

tags:
  - human-activity-recognition
  - action-recognition
  - pose-estimation
  - multimodal
  - self-supervised-learning
  - transformer
  - skeleton
  - computer-vision

model-index:
  - name: Multi-Scale Multimodal Pose HAR
    results: []
---


# Multi-Scale Multimodal Pose HAR

A **custom PyTorch model** for **Human Activity Recognition (HAR)** that integrates:

- **Short-term pose transformers** (factorized temporal + spatial attention)
- **Long-term temporal aggregation**
- **Optional multimodal fusion with RGB images**
- **Multi-stage self-supervised + supervised training pipeline**


---

## 🧠 Model Overview

### Architecture Summary

**Pose stream**
- Input: `(B, L, T, J, C)`
- Short-term encoder: `PoseFormerFactorized`
  - Temporal attention (per joint)
  - Spatial attention (per frame)
- Long-term encoder: Transformer over segment-level features

**Image stream (optional)**
- Backbone: ResNet18 / ResNet50
- Temporal pooling per segment

**Fusion**
- `concat` (default): feature concatenation + MLP
- `xattn`: shallow cross-attention (pose tokens ↔ image token)

**Output**
- Activity classification logits
- Optional intermediate embeddings / tokens

---

## 🏗️ Model Components

| Module | Description |
|------|------------|
| `PoseFormerFactorized` | Short-term pose transformer |
| `LongTermTemporalBlock` | Long-range temporal modeling |
| `ImageEncoder` | CNN-based RGB feature extractor |
| `MMFusionConcatLN` | Concatenation-based multimodal fusion |
| `MMFusionCrossAttnShallow` | Cross-attention multimodal fusion |
| `SSLHeads` | Contrastive, reconstruction, temporal order heads |

---

## 📥 Input Format

### Pose Input
```text
(B, L, T, J, C)
````

* `B`: batch size
* `L`: number of temporal segments
* `T`: frames per segment
* `J`: number of joints (e.g. 17)
* `C`: joint channels (2D or 3D)

### Image Input (optional)

```text
(B, L, T, 3, H, W)
```
* `3`: RGB channels
* `H`: image height
* `W`: image width

---

## 🚀 Usage

### 1️⃣ Load Model Code

```python
from model_har_final import (
    PoseFormerFactorized,
    MultiScaleTemporalModel
)
```

---

### 2️⃣ Build Model

```python
pose_backbone = PoseFormerFactorized(
    joints=17,
    in_ch=3,
    dim=128,
    layers=4,
    num_classes=6,
    return_tokens=True
)

model = MultiScaleTemporalModel(
    short_seq_model=pose_backbone,
    num_classes=6,
    enable_long_term=True,
    multimodal=True,
    fusion_mode="concat"  # or "xattn"
)
```

---

### 3️⃣ Load Weights

```python
import torch

ckpt = torch.load("best_stage3_dual_sched.pth", map_location="cpu")
model.load_state_dict(ckpt)
model.eval()
```

> ℹ️ This model is saved using **`state_dict`**, not pickle-serialized objects, for maximum compatibility.

---

### 4️⃣ Inference

```python
with torch.no_grad():
    logits = model(pose_seq, img_seq)
    preds = logits.argmax(dim=-1)
```

---

## 🧪 Training Strategy

The model is trained in **three stages**:

### Stage 1 — Pose SSL + Weak Supervision

* Masked joint modeling (MJM)
* Contrastive learning (InfoNCE)
* Temporal order prediction
* Optional labeled supervision

### Stage 2 — Pose-only SSL Refinement

* Contrastive + reconstruction losses
* Temporal attention disabled for stability

### Stage 3 — Multimodal Supervised Fine-tuning

* Label smoothing CE
* Semantic prototype distillation
* Metric learning (triplet)
* Optional knowledge distillation

---

## 📊 Outputs

| Output                 | Shape              |
| ---------------------- | ------------------ |
| Logits                 | `(B, num_classes)` |
| Pose embedding         | `(B, D)`           |
| Pose tokens (optional) | `(B, T, J, D)`     |

---

## 📎 Limitations

* Not compatible with `AutoModel.from_pretrained`
* Requires custom code to instantiate architecture
* Input pose format must match training configuration

---

## 📜 License

Apache License 2.0

---

## 📌 Citation

If you use this work, please cite:

```bibtex
@misc{kim2025multiscalehar,
  title        = {Multi-Scale Multimodal Pose Transformer for Human Activity Recognition},
  author       = {Minjae Kim},
  year         = {2025},
  howpublished = {\url{https://huggingface.co/m97j/har-safety-model}}
}
```