|
|
--- |
|
|
license: gpl-3.0 |
|
|
tags: |
|
|
- human-pose-estimation |
|
|
- pose-estimation |
|
|
- instance-segmentation |
|
|
- detection |
|
|
- person-detection |
|
|
- computer-vision |
|
|
datasets: |
|
|
- COCO |
|
|
- AIC |
|
|
- MPII |
|
|
- OCHuman |
|
|
metrics: |
|
|
- mAP |
|
|
pipeline_tag: keypoint-detection |
|
|
--- |
|
|
</h1><div id="toc"> |
|
|
<ul align="center" style="list-style: none; padding: 0; margin: 0;"> |
|
|
<summary> |
|
|
<h1 style="margin-bottom: 0.0em;"> |
|
|
Detection, Pose Estimation and Segmentation for Multiple Bodies: Closing the Virtuous Circle |
|
|
</h1> |
|
|
</summary> |
|
|
</ul> |
|
|
</div> |
|
|
</h1><div id="toc"> |
|
|
<ul align="center" style="list-style: none; padding: 0; margin: 0;"> |
|
|
<summary> |
|
|
<h2 style="margin-bottom: 0.2em;"> |
|
|
ICCV 2025 |
|
|
</h2> |
|
|
</summary> |
|
|
</ul> |
|
|
</div> |
|
|
|
|
|
<div style="text-align: justify;"> |
|
|
The BBox-Mask-Pose (BMP) method integrates detection, pose estimation, and segmentation into a self-improving loop by conditioning these tasks on each other. |
|
|
This approach enhances all three tasks simultaneously. |
|
|
Using segmentation masks instead of bounding boxes improves performance in crowded scenarios, making top-down methods competitive with bottom-up approaches. |
|
|
|
|
|
Key contributions: |
|
|
1. **MaskPose**: a pose estimation model conditioned by segmentation masks instead of bounding boxes, boosting performance in dense scenes without adding parameters |
|
|
- Download pre-trained weights below |
|
|
2. **BBox-MaskPose (BMP)**: method linking bounding boxes, segmentation masks, and poses to simultaneously address multi-body detection, segmentation and pose estimation |
|
|
- Try the demo! |
|
|
3. Fine-tuned RTMDet adapted for itterative detection (ignoring 'holes') |
|
|
- Download pre-trained weights below |
|
|
5. Support for multi-dataset training of ViTPose, previously implemented in the official ViTPose repository but absent in MMPose. |
|
|
</div> |
|
|
|
|
|
<div align="left"> |
|
|
|
|
|
[](https://arxiv.org/abs/2412.01562) |
|
|
[](https://github.com/MiraPurkrabek/BBoxMaskPose) |
|
|
[](https://mirapurkrabek.github.io/BBox-Mask-Pose/) |
|
|
</div> |
|
|
|
|
|
For more details, see the [GitHub repository](https://github.com/MiraPurkrabek/BBoxMaskPose). |
|
|
|
|
|
|
|
|
## π Models List |
|
|
|
|
|
1. **ViTPose-b multi-dataset** |
|
|
2. **MaskPose-b** |
|
|
3. fine-tuned **RTMDet-l** |
|
|
|
|
|
See details of each model below. |
|
|
|
|
|
----------------------------------------- |
|
|
## 1. ViTPose-B [multi-dataset] |
|
|
|
|
|
- **Model type**: ViT-b backbone with multi-layer decoder |
|
|
- **Input**: RGB images (192x256) |
|
|
- **Output**: Keypoints Coordinates (48x64 heatmap for each keypoint, 21 keypoints) |
|
|
- **Language(s)**: Not language-dependent (vision model) |
|
|
- **License**: GPL-3.0 |
|
|
- **Framework**: MMPose |
|
|
|
|
|
#### Training Details |
|
|
|
|
|
- **Training data**: [COCO Dataset](https://cocodataset.org/#home), [MPII Dataset](https://www.mpi-inf.mpg.de/departments/computer-vision-and-machine-learning/software-and-datasets/mpii-human-pose-dataset), [AIC Datasel](https://arxiv.org/abs/1711.06475) |
|
|
- **Training script**: [GitHub - BBoxMaskPose_code](https://github.com/MiraPurkrabek/BBoxMaskPose) |
|
|
- **Epochs**: 210 |
|
|
- **Batch size**: 64 |
|
|
- **Learning rate**: 5e-5 |
|
|
- **Hardware**: 4x NVIDIA A-100 |
|
|
|
|
|
**What's new?** |
|
|
ViTPose trained on multiple datasets perform much better in multi-body (and crowded) scenarios than COCO-trained ViTPose. |
|
|
The model was trained in multi-dataset setup by authors before, this is reproduction compatible with MMPose 2.0. |
|
|
|
|
|
----------------------------------------- |
|
|
## 2. MaskPose-B |
|
|
|
|
|
- **Model type**: ViT-b backbone with multi-layer decoder |
|
|
- **Input**: RGB images (192x256) + estimated instance segmentation |
|
|
- **Output**: Keypoints Coordinates (48x64 heatmap for each keypoint, 21 keypoints) |
|
|
- **Language(s)**: Not language-dependent (vision model) |
|
|
- **License**: GPL-3.0 |
|
|
- **Framework**: MMPose |
|
|
|
|
|
#### Training Details |
|
|
|
|
|
- **Training data**: [COCO Dataset](https://cocodataset.org/#home), [MPII Dataset](https://www.mpi-inf.mpg.de/departments/computer-vision-and-machine-learning/software-and-datasets/mpii-human-pose-dataset), [AIC Datasel](https://arxiv.org/abs/1711.06475) + SAM-estimated instance masks |
|
|
- **Training script**: [GitHub - BBoxMaskPose_code](https://github.com/MiraPurkrabek/BBoxMaskPose) |
|
|
- **Epochs**: 210 |
|
|
- **Batch size**: 64 |
|
|
- **Learning rate**: 5e-5 |
|
|
- **Hardware**: 4x NVIDIA A-100 |
|
|
|
|
|
**What's new?** |
|
|
Compared to ViTPose, MaskPose takes instance segmentation as an input and is even better in distinguishing instances in muli-body scenes. |
|
|
No computational overhead compared to ViTPose. |
|
|
|
|
|
----------------------------------------- |
|
|
## 3. fine-tuned RTMDet-L |
|
|
|
|
|
- **Model type**: CSPNeXt-P5 backbone, CSPNeXtPAFPN neck, RTMDetInsSepBN head |
|
|
- **Input**: RGB images |
|
|
- **Output**: Detected instances -- bbox, instance mask and class for each |
|
|
- **Language(s)**: Not language-dependent (vision model) |
|
|
- **License**: GPL-3.0 |
|
|
- **Framework**: MMDetection |
|
|
|
|
|
#### Training Details |
|
|
|
|
|
- **Training data**: [COCO Dataset](https://cocodataset.org/#home) with randomly masked-out instances |
|
|
- **Training script**: [GitHub - BBoxMaskPose_code](https://github.com/MiraPurkrabek/BBoxMaskPose) |
|
|
- **Epochs**: 10 |
|
|
- **Batch size**: 16 |
|
|
- **Learning rate**: 2e-2 |
|
|
- **Hardware**: 4x NVIDIA A-100 |
|
|
|
|
|
**What's new?** |
|
|
RTMDet fine-tuned to ignore masked-out instances is designed for itterative detection. |
|
|
Especially effective in multi-body scenes where background would not be detected otherwise. |
|
|
|
|
|
|
|
|
## π Citation |
|
|
|
|
|
If you use our work, please cite: |
|
|
|
|
|
```bibtex |
|
|
@InProceedings{Purkrabek2025ICCV, |
|
|
author={Purkrabek, Miroslav and Matas, Jiri}, |
|
|
title={Detection, Pose Estimation and Segmentation for Multiple Bodies: Closing the Virtuous Circle}, |
|
|
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision}, |
|
|
year={2025}, |
|
|
month={October}, |
|
|
} |
|
|
``` |
|
|
|
|
|
## π§βπ» Authors |
|
|
|
|
|
- Miroslav Purkrabek ([personal website](https://github.com/MiraPurkrabek)) |
|
|
- Jiri Matas ([personal website](https://cmp.felk.cvut.cz/~matas/)) |