File size: 6,185 Bytes
f9bd472
 
 
79338f9
 
 
 
 
 
f9bd472
79338f9
 
 
 
f9bd472
79338f9
 
f9bd472
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d1a872e
f9bd472
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
79338f9
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
---
license: gpl-3.0
tags:
- human-pose-estimation
- pose-estimation
- instance-segmentation
- detection
- person-detection
- computer-vision
datasets:
- COCO
- AIC
- MPII
- OCHuman
metrics:
- mAP
pipeline_tag: keypoint-detection
---
</h1><div id="toc">
  <ul align="center" style="list-style: none; padding: 0; margin: 0;">
    <summary>
      <h1 style="margin-bottom: 0.0em;">
        Detection, Pose Estimation and Segmentation for Multiple Bodies: Closing the Virtuous Circle
      </h1>
    </summary>
  </ul>
</div>
</h1><div id="toc">
  <ul align="center" style="list-style: none; padding: 0; margin: 0;">
    <summary>
      <h2 style="margin-bottom: 0.2em;">
        ICCV 2025
      </h2>
    </summary>
  </ul>
</div>

<div style="text-align: justify;">
The BBox-Mask-Pose (BMP) method integrates detection, pose estimation, and segmentation into a self-improving loop by conditioning these tasks on each other.
This approach enhances all three tasks simultaneously.
Using segmentation masks instead of bounding boxes improves performance in crowded scenarios, making top-down methods competitive with bottom-up approaches.

Key contributions:
1. **MaskPose**: a pose estimation model conditioned by segmentation masks instead of bounding boxes, boosting performance in dense scenes without adding parameters
    - Download pre-trained weights below
2. **BBox-MaskPose (BMP)**: method linking bounding boxes, segmentation masks, and poses to simultaneously address multi-body detection, segmentation and pose estimation
    - Try the demo!
3. Fine-tuned RTMDet adapted for itterative detection (ignoring 'holes')
    - Download pre-trained weights below
5. Support for multi-dataset training of ViTPose, previously implemented in the official ViTPose repository but absent in MMPose.
</div>

<div align="left">

[![arXiv](https://img.shields.io/badge/arXiv-2412.01562-b31b1b?style=flat)](https://arxiv.org/abs/2412.01562) &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
[![GitHub repository](https://img.shields.io/badge/GitHub-black?style=flat&logo=github&logoColor=white)](https://github.com/MiraPurkrabek/BBoxMaskPose) &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
[![Project Website](https://img.shields.io/badge/Project%20Website-blue?style=flat&logo=google-chrome&logoColor=white)](https://mirapurkrabek.github.io/BBox-Mask-Pose/)
</div>

For more details, see the [GitHub repository](https://github.com/MiraPurkrabek/BBoxMaskPose).


## ๐Ÿ“ Models List

1. **ViTPose-b multi-dataset**
2. **MaskPose-b**
3. fine-tuned **RTMDet-l**

See details of each model below.

-----------------------------------------
## 1. ViTPose-B [multi-dataset]

- **Model type**: ViT-b backbone with multi-layer decoder
- **Input**: RGB images (192x256)
- **Output**: Keypoints Coordinates (48x64 heatmap for each keypoint, 21 keypoints)
- **Language(s)**: Not language-dependent (vision model)
- **License**: GPL-3.0
- **Framework**: MMPose

#### Training Details

- **Training data**: [COCO Dataset](https://cocodataset.org/#home), [MPII Dataset](https://www.mpi-inf.mpg.de/departments/computer-vision-and-machine-learning/software-and-datasets/mpii-human-pose-dataset), [AIC Datasel](https://arxiv.org/abs/1711.06475)
- **Training script**: [GitHub - BBoxMaskPose_code](https://github.com/MiraPurkrabek/BBoxMaskPose)
- **Epochs**: 210
- **Batch size**: 64
- **Learning rate**: 5e-5
- **Hardware**: 4x NVIDIA A-100

**What's new?**
ViTPose trained on multiple datasets perform much better in multi-body (and crowded) scenarios than COCO-trained ViTPose.
The model was trained in multi-dataset setup by authors before, this is reproduction compatible with MMPose 2.0.

-----------------------------------------
## 2. MaskPose-B

- **Model type**: ViT-b backbone with multi-layer decoder
- **Input**: RGB images (192x256) + estimated instance segmentation
- **Output**: Keypoints Coordinates (48x64 heatmap for each keypoint, 21 keypoints)
- **Language(s)**: Not language-dependent (vision model)
- **License**: GPL-3.0
- **Framework**: MMPose

#### Training Details

- **Training data**: [COCO Dataset](https://cocodataset.org/#home), [MPII Dataset](https://www.mpi-inf.mpg.de/departments/computer-vision-and-machine-learning/software-and-datasets/mpii-human-pose-dataset), [AIC Datasel](https://arxiv.org/abs/1711.06475) + SAM-estimated instance masks
- **Training script**: [GitHub - BBoxMaskPose_code](https://github.com/MiraPurkrabek/BBoxMaskPose)
- **Epochs**: 210
- **Batch size**: 64
- **Learning rate**: 5e-5
- **Hardware**: 4x NVIDIA A-100

**What's new?**
Compared to ViTPose, MaskPose takes instance segmentation as an input and is even better in distinguishing instances in muli-body scenes.
No computational overhead compared to ViTPose.

-----------------------------------------
## 3. fine-tuned RTMDet-L

- **Model type**: CSPNeXt-P5 backbone, CSPNeXtPAFPN neck, RTMDetInsSepBN head
- **Input**: RGB images
- **Output**: Detected instances -- bbox, instance mask and class for each
- **Language(s)**: Not language-dependent (vision model)
- **License**: GPL-3.0
- **Framework**: MMDetection

#### Training Details

- **Training data**: [COCO Dataset](https://cocodataset.org/#home) with randomly masked-out instances
- **Training script**: [GitHub - BBoxMaskPose_code](https://github.com/MiraPurkrabek/BBoxMaskPose)
- **Epochs**: 10
- **Batch size**: 16
- **Learning rate**: 2e-2
- **Hardware**: 4x NVIDIA A-100

**What's new?**
RTMDet fine-tuned to ignore masked-out instances is designed for itterative detection.
Especially effective in multi-body scenes where background would not be detected otherwise.


## ๐Ÿ“„ Citation

If you use our work, please cite:

```bibtex
@InProceedings{Purkrabek2025ICCV,
  author={Purkrabek, Miroslav and Matas, Jiri},
  title={Detection, Pose Estimation and Segmentation for Multiple Bodies: Closing the Virtuous Circle}, 
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  year={2025},
  month={October},
}
```

## ๐Ÿง‘โ€๐Ÿ’ป Authors

- Miroslav Purkrabek ([personal website](https://github.com/MiraPurkrabek))
- Jiri Matas ([personal website](https://cmp.felk.cvut.cz/~matas/))