uni-har / README.md

Update README.md

20bb31d verified about 1 month ago

4.34 kB

	---
	language: en
	library_name: pytorch
	pipeline_tag: video-classification
	license: apache-2.0

	tags:
	- human-activity-recognition
	- action-recognition
	- pose-estimation
	- multimodal
	- self-supervised-learning
	- transformer
	- skeleton
	- computer-vision

	model-index:
	- name: Multi-Scale Multimodal Pose HAR
	results: []
	---


	# Multi-Scale Multimodal Pose HAR

	A custom PyTorch model for Human Activity Recognition (HAR) that integrates:

	- Short-term pose transformers (factorized temporal + spatial attention)
	- Long-term temporal aggregation
	- Optional multimodal fusion with RGB images
	- Multi-stage self-supervised + supervised training pipeline


	---

	## 🧠 Model Overview

	### Architecture Summary

	Pose stream
	- Input: `(B, L, T, J, C)`
	- Short-term encoder: `PoseFormerFactorized`
	- Temporal attention (per joint)
	- Spatial attention (per frame)
	- Long-term encoder: Transformer over segment-level features

	Image stream (optional)
	- Backbone: ResNet18 / ResNet50
	- Temporal pooling per segment

	Fusion
	- `concat` (default): feature concatenation + MLP
	- `xattn`: shallow cross-attention (pose tokens ↔ image token)

	Output
	- Activity classification logits
	- Optional intermediate embeddings / tokens

	---

	## 🏗️ Model Components

	\| Module \| Description \|
	\|------\|------------\|
	\| `PoseFormerFactorized` \| Short-term pose transformer \|
	\| `LongTermTemporalBlock` \| Long-range temporal modeling \|
	\| `ImageEncoder` \| CNN-based RGB feature extractor \|
	\| `MMFusionConcatLN` \| Concatenation-based multimodal fusion \|
	\| `MMFusionCrossAttnShallow` \| Cross-attention multimodal fusion \|
	\| `SSLHeads` \| Contrastive, reconstruction, temporal order heads \|

	---

	## 📥 Input Format

	### Pose Input
	```text
	(B, L, T, J, C)
	````

	* `B`: batch size
	* `L`: number of temporal segments
	* `T`: frames per segment
	* `J`: number of joints (e.g. 17)
	* `C`: joint channels (2D or 3D)

	### Image Input (optional)

	```text
	(B, L, T, 3, H, W)
	```
	* `3`: RGB channels
	* `H`: image height
	* `W`: image width

	---

	## 🚀 Usage

	### 1️⃣ Load Model Code

	```python
	from model_har_final import (
	PoseFormerFactorized,
	MultiScaleTemporalModel
	)
	```

	---

	### 2️⃣ Build Model

	```python
	pose_backbone = PoseFormerFactorized(
	joints=17,
	in_ch=3,
	dim=128,
	layers=4,
	num_classes=6,
	return_tokens=True
	)

	model = MultiScaleTemporalModel(
	short_seq_model=pose_backbone,
	num_classes=6,
	enable_long_term=True,
	multimodal=True,
	fusion_mode="concat" # or "xattn"
	)
	```

	---

	### 3️⃣ Load Weights

	```python
	import torch

	ckpt = torch.load("best_stage3_dual_sched.pth", map_location="cpu")
	model.load_state_dict(ckpt)
	model.eval()
	```

	> ℹ️ This model is saved using `state_dict`, not pickle-serialized objects, for maximum compatibility.

	---

	### 4️⃣ Inference

	```python
	with torch.no_grad():
	logits = model(pose_seq, img_seq)
	preds = logits.argmax(dim=-1)
	```

	---

	## 🧪 Training Strategy

	The model is trained in three stages:

	### Stage 1 — Pose SSL + Weak Supervision

	* Masked joint modeling (MJM)
	* Contrastive learning (InfoNCE)
	* Temporal order prediction
	* Optional labeled supervision

	### Stage 2 — Pose-only SSL Refinement

	* Contrastive + reconstruction losses
	* Temporal attention disabled for stability

	### Stage 3 — Multimodal Supervised Fine-tuning

	* Label smoothing CE
	* Semantic prototype distillation
	* Metric learning (triplet)
	* Optional knowledge distillation

	---

	## 📊 Outputs

	\| Output \| Shape \|
	\| ---------------------- \| ------------------ \|
	\| Logits \| `(B, num_classes)` \|
	\| Pose embedding \| `(B, D)` \|
	\| Pose tokens (optional) \| `(B, T, J, D)` \|

	---

	## 📎 Limitations

	* Not compatible with `AutoModel.from_pretrained`
	* Requires custom code to instantiate architecture
	* Input pose format must match training configuration

	---

	## 📜 License

	Apache License 2.0

	---

	## 📌 Citation

	If you use this work, please cite:

	```bibtex
	@misc{kim2025multiscalehar,
	title = {Multi-Scale Multimodal Pose Transformer for Human Activity Recognition},
	author = {Minjae Kim},
	year = {2025},
	howpublished = {\url{https://huggingface.co/m97j/har-safety-model}}
	}
	```