vrg-prague
/

BBoxMaskPose

+---
+license: gpl-3.0
+tags:
+  - human-pose-estimation
+  - pose-estimation
+  - instance-segmentation
+  - detection
+  - person-detection
+  - computer-vision
+datasets:
+  - COCO
+  - AIC
+  - MPII
+  - OCHuman
+metrics:
+  - mAP
+---
+</h1><div id="toc">
+  <ul align="center" style="list-style: none; padding: 0; margin: 0;">
+    <summary>
+      <h1 style="margin-bottom: 0.0em;">
+        Detection, Pose Estimation and Segmentation for Multiple Bodies: Closing the Virtuous Circle
+      </h1>
+    </summary>
+  </ul>
+</div>
+</h1><div id="toc">
+  <ul align="center" style="list-style: none; padding: 0; margin: 0;">
+    <summary>
+      <h2 style="margin-bottom: 0.2em;">
+        ICCV 2025
+      </h2>
+    </summary>
+  </ul>
+</div>
+<div style="text-align: justify;">
+The BBox-Mask-Pose (BMP) method integrates detection, pose estimation, and segmentation into a self-improving loop by conditioning these tasks on each other.
+This approach enhances all three tasks simultaneously.
+Using segmentation masks instead of bounding boxes improves performance in crowded scenarios, making top-down methods competitive with bottom-up approaches.
+Key contributions:
+1. **MaskPose**: a pose estimation model conditioned by segmentation masks instead of bounding boxes, boosting performance in dense scenes without adding parameters
+    - Download pre-trained weights below
+2. **BBox-MaskPose (BMP)**: method linking bounding boxes, segmentation masks, and poses to simultaneously address multi-body detection, segmentation and pose estimation
+    - Try the demo!
+3. Fine-tuned RTMDet adapted for itterative detection (ignoring 'holes')
+    - Download pre-trained weights below
+5. Support for multi-dataset training of ViTPose, previously implemented in the official ViTPose repository but absent in MMPose.
+</div>
+<div align="left">
+[![arXiv](https://img.shields.io/badge/arXiv-2412.01562-b31b1b?style=flat)](https://arxiv.org/abs/2412.01562) &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
+[![GitHub repository](https://img.shields.io/badge/GitHub-black?style=flat&logo=github&logoColor=white)](https://github.com/MiraPurkrabek/BBoxMaskPose) &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
+[![Project Website](https://img.shields.io/badge/Project%20Website-blue?style=flat&logo=google-chrome&logoColor=white)](https://mirapurkrabek.github.io/BBox-Mask-Pose/)
+</div>
+For more details, see the [GitHub repository](https://github.com/MiraPurkrabek/BBoxMaskPose).
+## 📝 Models List
+1. **ViTPose-b multi-dataset**
+2. **MaskPose-b**
+3. fine-tuned **RTMDet-l**
+See details of each model below.
+-----------------------------------------
+## 1. ViTPose-B [multi-dataset]
+- **Model type**: ViT-b backbone with multi-layer decoder
+- **Input**: RGB images (192x256)
+- **Output**: Keypoints Coordinates (48x64 heatmap for each keypoint, 21 keypoints)
+- **Language(s)**: Not language-dependent (vision model)
+- **License**: GPL-3.0
+- **Framework**: MMPose
+#### Training Details
+- **Training data**: [COCO Dataset](https://cocodataset.org/#home), [MPII Dataset](https://www.mpi-inf.mpg.de/departments/computer-vision-and-machine-learning/software-and-datasets/mpii-human-pose-dataset), [AIC Datasel](https://arxiv.org/abs/1711.06475)
+- **Training script**: [GitHub - BBoxMaskPose_code](https://github.com/MiraPurkrabek/BBoxMaskPose)
+- **Epochs**: 210
+- **Batch size**: 64
+- **Learning rate**: 5e-5
+- **Hardware**: 4x NVIDIA A-100
+**What's new?**
+ViTPose trained on multiple datasets perform much better in multi-body (and crowded) scenarios than COCO-trained ViTPose.
+The model was trained in multi-dataset setup by authors before, this is reproduction compatible with MMPose 2.0.
+-----------------------------------------
+## 2. MaskPose-B
+- **Model type**: ViT-b backbone with multi-layer decoder
+- **Input**: RGB images (192x256) + estimated instance segmentation
+- **Output**: Keypoints Coordinates (48x64 heatmap for each keypoint, 21 keypoints)
+- **Language(s)**: Not language-dependent (vision model)
+- **License**: GPL-3.0
+- **Framework**: MMPose
+#### Training Details
+- **Training data**: [COCO Dataset](https://cocodataset.org/#home), [MPII Dataset](https://www.mpi-inf.mpg.de/departments/computer-vision-and-machine-learning/software-and-datasets/mpii-human-pose-dataset), [AIC Datasel](https://arxiv.org/abs/1711.06475) + SAM-estimated instance masks
+- **Training script**: [GitHub - BBoxMaskPose_code](https://github.com/MiraPurkrabek/BBoxMaskPose)
+- **Epochs**: 210
+- **Batch size**: 64
+- **Learning rate**: 5e-5
+- **Hardware**: 4x NVIDIA A-100
+**What's new?**
+Compared to ViTPose, MaskPose takes instance segmentation as an input and is even better in distinguishing instances in muli-body scenes.
+No computational overhead compared to ViTPose.
+-----------------------------------------
+## 3. fine-tuned RTMDet-L
+- **Model type**: CSPNeXt-P5 backbone, CSPNeXtPAFPN neck, RTMDetInsSepBN head
+- **Input**: RGB images
+- **Output**: Detected instances -- bbox, instance mask and class for each
+- **Language(s)**: Not language-dependent (vision model)
+- **License**: GPL-3.0
+- **Framework**: MMDetection
+#### Training Details
+- **Training data**: [COCO Dataset](https://cocodataset.org/#home) with randomly masked-out instances
+- **Training script**: [GitHub - BBoxMaskPose_code](https://github.com/MiraPurkrabek/BBoxMaskPose)
+- **Epochs**: 10
+- **Batch size**: 16
+- **Learning rate**: 2e-2
+- **Hardware**: 4x NVIDIA A-100
+**What's new?**
+RTMDet fine-tuned to ignore masked-out instances is designed for itterative detection.
+Especially effective in multi-body scenes where background would not be detected otherwise.
+## 📄 Citation
+If you use our work, please cite:
+```bibtex
+@InProceedings{Purkrabek2025ICCV,
+  author={Purkrabek, Miroslav and Matas, Jiri},
+  title={Detection, Pose Estimation and Segmentation for Multiple Bodies: Closing the Virtuous Circle},
+  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
+  year={2025},
+  month={October},
+}
+```
+## 🧑‍💻 Authors
+- Miroslav Purkrabek ([personal website](https://github.com/MiraPurkrabek))
+- Jiri Matas ([personal website](https://cmp.felk.cvut.cz/~matas/))