m97j
/

har-safety-model

+---
+language: en
+library_name: pytorch
+pipeline_tag: video-classification
+license: apache-2.0
+tags:
+  - human-activity-recognition
+  - action-recognition
+  - pose-estimation
+  - multimodal
+  - self-supervised-learning
+  - transformer
+  - skeleton
+  - computer-vision
+model-index:
+  - name: Multi-Scale Multimodal Pose HAR
+    results: []
+---
+# Multi-Scale Multimodal Pose HAR
+A **custom PyTorch model** for **Human Activity Recognition (HAR)** that integrates:
+- **Short-term pose transformers** (factorized temporal + spatial attention)
+- **Long-term temporal aggregation**
+- **Optional multimodal fusion with RGB images**
+- **Multi-stage self-supervised + supervised training pipeline**
+---
+## 🧠 Model Overview
+### Architecture Summary
+**Pose stream**
+- Input: `(B, L, T, J, C)`
+- Short-term encoder: `PoseFormerFactorized`
+  - Temporal attention (per joint)
+  - Spatial attention (per frame)
+- Long-term encoder: Transformer over segment-level features
+**Image stream (optional)**
+- Backbone: ResNet18 / ResNet50
+- Temporal pooling per segment
+**Fusion**
+- `concat` (default): feature concatenation + MLP
+- `xattn`: shallow cross-attention (pose tokens ↔ image token)
+**Output**
+- Activity classification logits
+- Optional intermediate embeddings / tokens
+---
+## 🏗️ Model Components
+| Module | Description |
+|------|------------|
+| `PoseFormerFactorized` | Short-term pose transformer |
+| `LongTermTemporalBlock` | Long-range temporal modeling |
+| `ImageEncoder` | CNN-based RGB feature extractor |
+| `MMFusionConcatLN` | Concatenation-based multimodal fusion |
+| `MMFusionCrossAttnShallow` | Cross-attention multimodal fusion |
+| `SSLHeads` | Contrastive, reconstruction, temporal order heads |
+---
+## 📥 Input Format
+### Pose Input
+```text
+(B, L, T, J, C)
+````
+* `B`: batch size
+* `L`: number of temporal segments
+* `T`: frames per segment
+* `J`: number of joints (e.g. 17)
+* `C`: joint channels (2D or 3D)
+### Image Input (optional)
+```text
+(B, L, T, 3, H, W)
+```
+* `3`: RGB channels
+* `H`: image height
+* `W`: image width
+---
+## 🚀 Usage
+### 1️⃣ Load Model Code
+```python
+from model_har_final import (
+    PoseFormerFactorized,
+    MultiScaleTemporalModel
+)
+```
+---
+### 2️⃣ Build Model
+```python
+pose_backbone = PoseFormerFactorized(
+    joints=17,
+    in_ch=3,
+    dim=128,
+    layers=4,
+    num_classes=6,
+    return_tokens=True
+)
+model = MultiScaleTemporalModel(
+    short_seq_model=pose_backbone,
+    num_classes=6,
+    enable_long_term=True,
+    multimodal=True,
+    fusion_mode="concat"  # or "xattn"
+)
+```
+---
+### 3️⃣ Load Weights
+```python
+import torch
+ckpt = torch.load("best_stage3_dual_sched.pth", map_location="cpu")
+model.load_state_dict(ckpt)
+model.eval()
+```
+> ℹ️ This model is saved using **`state_dict`**, not pickle-serialized objects, for maximum compatibility.
+---
+### 4️⃣ Inference
+```python
+with torch.no_grad():
+    logits = model(pose_seq, img_seq)
+    preds = logits.argmax(dim=-1)
+```
+---
+## 🧪 Training Strategy
+The model is trained in **three stages**:
+### Stage 1 — Pose SSL + Weak Supervision
+* Masked joint modeling (MJM)
+* Contrastive learning (InfoNCE)
+* Temporal order prediction
+* Optional labeled supervision
+### Stage 2 — Pose-only SSL Refinement
+* Contrastive + reconstruction losses
+* Temporal attention disabled for stability
+### Stage 3 — Multimodal Supervised Fine-tuning
+* Label smoothing CE
+* Semantic prototype distillation
+* Metric learning (triplet)
+* Optional knowledge distillation
+---
+## 📊 Outputs
+| Output                 | Shape              |
+| ---------------------- | ------------------ |
+| Logits                 | `(B, num_classes)` |
+| Pose embedding         | `(B, D)`           |
+| Pose tokens (optional) | `(B, T, J, D)`     |
+---
+## 📎 Limitations
+* Not compatible with `AutoModel.from_pretrained`
+* Requires custom code to instantiate architecture
+* Input pose format must match training configuration
+---
+## 📜 License
+Apache License 2.0
+---
+## 📌 Citation
+If you use this work, please cite:
+```bibtex
+@misc{kim2025multiscalehar,
+  title        = {Multi-Scale Multimodal Pose Transformer for Human Activity Recognition},
+  author       = {Minjae Kim},
+  year         = {2025},
+  howpublished = {\url{https://huggingface.co/m97j/har-safety-model}}
+}
+```