--- language: en library_name: pytorch pipeline_tag: video-classification license: apache-2.0 tags: - human-activity-recognition - action-recognition - pose-estimation - multimodal - self-supervised-learning - transformer - skeleton - computer-vision model-index: - name: Multi-Scale Multimodal Pose HAR results: [] --- # Multi-Scale Multimodal Pose HAR A **custom PyTorch model** for **Human Activity Recognition (HAR)** that integrates: - **Short-term pose transformers** (factorized temporal + spatial attention) - **Long-term temporal aggregation** - **Optional multimodal fusion with RGB images** - **Multi-stage self-supervised + supervised training pipeline** --- ## ๐Ÿง  Model Overview ### Architecture Summary **Pose stream** - Input: `(B, L, T, J, C)` - Short-term encoder: `PoseFormerFactorized` - Temporal attention (per joint) - Spatial attention (per frame) - Long-term encoder: Transformer over segment-level features **Image stream (optional)** - Backbone: ResNet18 / ResNet50 - Temporal pooling per segment **Fusion** - `concat` (default): feature concatenation + MLP - `xattn`: shallow cross-attention (pose tokens โ†” image token) **Output** - Activity classification logits - Optional intermediate embeddings / tokens --- ## ๐Ÿ—๏ธ Model Components | Module | Description | |------|------------| | `PoseFormerFactorized` | Short-term pose transformer | | `LongTermTemporalBlock` | Long-range temporal modeling | | `ImageEncoder` | CNN-based RGB feature extractor | | `MMFusionConcatLN` | Concatenation-based multimodal fusion | | `MMFusionCrossAttnShallow` | Cross-attention multimodal fusion | | `SSLHeads` | Contrastive, reconstruction, temporal order heads | --- ## ๐Ÿ“ฅ Input Format ### Pose Input ```text (B, L, T, J, C) ```` * `B`: batch size * `L`: number of temporal segments * `T`: frames per segment * `J`: number of joints (e.g. 17) * `C`: joint channels (2D or 3D) ### Image Input (optional) ```text (B, L, T, 3, H, W) ``` * `3`: RGB channels * `H`: image height * `W`: image width --- ## ๐Ÿš€ Usage ### 1๏ธโƒฃ Load Model Code ```python from model_har_final import ( PoseFormerFactorized, MultiScaleTemporalModel ) ``` --- ### 2๏ธโƒฃ Build Model ```python pose_backbone = PoseFormerFactorized( joints=17, in_ch=3, dim=128, layers=4, num_classes=6, return_tokens=True ) model = MultiScaleTemporalModel( short_seq_model=pose_backbone, num_classes=6, enable_long_term=True, multimodal=True, fusion_mode="concat" # or "xattn" ) ``` --- ### 3๏ธโƒฃ Load Weights ```python import torch ckpt = torch.load("best_stage3_dual_sched.pth", map_location="cpu") model.load_state_dict(ckpt) model.eval() ``` > โ„น๏ธ This model is saved using **`state_dict`**, not pickle-serialized objects, for maximum compatibility. --- ### 4๏ธโƒฃ Inference ```python with torch.no_grad(): logits = model(pose_seq, img_seq) preds = logits.argmax(dim=-1) ``` --- ## ๐Ÿงช Training Strategy The model is trained in **three stages**: ### Stage 1 โ€” Pose SSL + Weak Supervision * Masked joint modeling (MJM) * Contrastive learning (InfoNCE) * Temporal order prediction * Optional labeled supervision ### Stage 2 โ€” Pose-only SSL Refinement * Contrastive + reconstruction losses * Temporal attention disabled for stability ### Stage 3 โ€” Multimodal Supervised Fine-tuning * Label smoothing CE * Semantic prototype distillation * Metric learning (triplet) * Optional knowledge distillation --- ## ๐Ÿ“Š Outputs | Output | Shape | | ---------------------- | ------------------ | | Logits | `(B, num_classes)` | | Pose embedding | `(B, D)` | | Pose tokens (optional) | `(B, T, J, D)` | --- ## ๐Ÿ“Ž Limitations * Not compatible with `AutoModel.from_pretrained` * Requires custom code to instantiate architecture * Input pose format must match training configuration --- ## ๐Ÿ“œ License Apache License 2.0 --- ## ๐Ÿ“Œ Citation If you use this work, please cite: ```bibtex @misc{kim2025multiscalehar, title = {Multi-Scale Multimodal Pose Transformer for Human Activity Recognition}, author = {Minjae Kim}, year = {2025}, howpublished = {\url{https://huggingface.co/m97j/har-safety-model}} } ```