Update README.md
Browse files
README.md
CHANGED
|
@@ -30,12 +30,14 @@ library_name: bboxmaskpose
|
|
| 30 |
<ul align="center" style="list-style: none; padding: 0; margin: 0;">
|
| 31 |
<summary>
|
| 32 |
<h2 style="margin-bottom: 0.2em;">
|
| 33 |
-
ICCV 2025
|
| 34 |
</h2>
|
| 35 |
</summary>
|
| 36 |
</ul>
|
| 37 |
</div>
|
| 38 |
|
|
|
|
|
|
|
| 39 |
<div style="text-align: justify;">
|
| 40 |
The BBox-Mask-Pose (BMP) method integrates detection, pose estimation, and segmentation into a self-improving loop by conditioning these tasks on each other.
|
| 41 |
This approach enhances all three tasks simultaneously.
|
|
@@ -44,9 +46,10 @@ Using segmentation masks instead of bounding boxes improves performance in crowd
|
|
| 44 |
Key contributions:
|
| 45 |
1. **MaskPose**: a pose estimation model conditioned by segmentation masks instead of bounding boxes, boosting performance in dense scenes without adding parameters
|
| 46 |
- Download pre-trained weights below
|
| 47 |
-
2. **
|
|
|
|
| 48 |
- Try the demo!
|
| 49 |
-
|
| 50 |
- Download pre-trained weights below
|
| 51 |
5. Support for multi-dataset training of ViTPose, previously implemented in the official ViTPose repository but absent in MMPose.
|
| 52 |
</div>
|
|
@@ -64,8 +67,9 @@ For more details, see the [GitHub repository](https://github.com/MiraPurkrabek/B
|
|
| 64 |
## 📝 Models List
|
| 65 |
|
| 66 |
1. **ViTPose-b multi-dataset**
|
| 67 |
-
2. **MaskPose
|
| 68 |
-
3.
|
|
|
|
| 69 |
|
| 70 |
See details of each model below.
|
| 71 |
|
|
@@ -93,14 +97,15 @@ ViTPose trained on multiple datasets perform much better in multi-body (and crow
|
|
| 93 |
The model was trained in multi-dataset setup by authors before, this is reproduction compatible with MMPose 2.0.
|
| 94 |
|
| 95 |
-----------------------------------------
|
| 96 |
-
## 2. MaskPose-
|
| 97 |
|
| 98 |
- **Model type**: ViT-b backbone with multi-layer decoder
|
| 99 |
- **Input**: RGB images (192x256) + estimated instance segmentation
|
| 100 |
-
- **Output**: Keypoints Coordinates (48x64 heatmap for each keypoint,
|
| 101 |
- **Language(s)**: Not language-dependent (vision model)
|
| 102 |
- **License**: GPL-3.0
|
| 103 |
- **Framework**: MMPose
|
|
|
|
| 104 |
|
| 105 |
#### Training Details
|
| 106 |
|
|
@@ -115,8 +120,36 @@ The model was trained in multi-dataset setup by authors before, this is reproduc
|
|
| 115 |
Compared to ViTPose, MaskPose takes instance segmentation as an input and is even better in distinguishing instances in muli-body scenes.
|
| 116 |
No computational overhead compared to ViTPose.
|
| 117 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 118 |
-----------------------------------------
|
| 119 |
-
##
|
| 120 |
|
| 121 |
- **Model type**: CSPNeXt-P5 backbone, CSPNeXtPAFPN neck, RTMDetInsSepBN head
|
| 122 |
- **Input**: RGB images
|
|
@@ -143,17 +176,34 @@ Especially effective in multi-body scenes where background would not be detected
|
|
| 143 |
|
| 144 |
If you use our work, please cite:
|
| 145 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 146 |
```bibtex
|
| 147 |
@InProceedings{Purkrabek2025ICCV,
|
| 148 |
-
|
| 149 |
-
|
| 150 |
-
|
| 151 |
-
|
| 152 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 153 |
}
|
| 154 |
```
|
| 155 |
|
| 156 |
## 🧑💻 Authors
|
| 157 |
|
| 158 |
- Miroslav Purkrabek ([personal website](https://github.com/MiraPurkrabek))
|
|
|
|
| 159 |
- Jiri Matas ([personal website](https://cmp.felk.cvut.cz/~matas/))
|
|
|
|
| 30 |
<ul align="center" style="list-style: none; padding: 0; margin: 0;">
|
| 31 |
<summary>
|
| 32 |
<h2 style="margin-bottom: 0.2em;">
|
| 33 |
+
ICCV 2025 + CVPR 2025
|
| 34 |
</h2>
|
| 35 |
</summary>
|
| 36 |
</ul>
|
| 37 |
</div>
|
| 38 |
|
| 39 |
+

|
| 40 |
+
|
| 41 |
<div style="text-align: justify;">
|
| 42 |
The BBox-Mask-Pose (BMP) method integrates detection, pose estimation, and segmentation into a self-improving loop by conditioning these tasks on each other.
|
| 43 |
This approach enhances all three tasks simultaneously.
|
|
|
|
| 46 |
Key contributions:
|
| 47 |
1. **MaskPose**: a pose estimation model conditioned by segmentation masks instead of bounding boxes, boosting performance in dense scenes without adding parameters
|
| 48 |
- Download pre-trained weights below
|
| 49 |
+
2. **PMPose**: a pose estimation model conditioned by segmentation masks AND predicting full description of each keypoint. Combination of MaskPose and ProbPose (CVPR'25).
|
| 50 |
+
3. **BBox-MaskPose (BMP)**: method linking bounding boxes, segmentation masks, and poses to simultaneously address multi-body detection, segmentation and pose estimation
|
| 51 |
- Try the demo!
|
| 52 |
+
4. Fine-tuned RTMDet adapted for itterative detection (ignoring 'holes')
|
| 53 |
- Download pre-trained weights below
|
| 54 |
5. Support for multi-dataset training of ViTPose, previously implemented in the official ViTPose repository but absent in MMPose.
|
| 55 |
</div>
|
|
|
|
| 67 |
## 📝 Models List
|
| 68 |
|
| 69 |
1. **ViTPose-b multi-dataset**
|
| 70 |
+
2. **MaskPose**
|
| 71 |
+
3. **PMPose**
|
| 72 |
+
4. fine-tuned **RTMDet-l**
|
| 73 |
|
| 74 |
See details of each model below.
|
| 75 |
|
|
|
|
| 97 |
The model was trained in multi-dataset setup by authors before, this is reproduction compatible with MMPose 2.0.
|
| 98 |
|
| 99 |
-----------------------------------------
|
| 100 |
+
## 2. MaskPose-1.1.0
|
| 101 |
|
| 102 |
- **Model type**: ViT-b backbone with multi-layer decoder
|
| 103 |
- **Input**: RGB images (192x256) + estimated instance segmentation
|
| 104 |
+
- **Output**: Keypoints Coordinates (48x64 heatmap for each keypoint, 23 keypoints)
|
| 105 |
- **Language(s)**: Not language-dependent (vision model)
|
| 106 |
- **License**: GPL-3.0
|
| 107 |
- **Framework**: MMPose
|
| 108 |
+
- **Size(s)**: -S, -B, -L, -H
|
| 109 |
|
| 110 |
#### Training Details
|
| 111 |
|
|
|
|
| 120 |
Compared to ViTPose, MaskPose takes instance segmentation as an input and is even better in distinguishing instances in muli-body scenes.
|
| 121 |
No computational overhead compared to ViTPose.
|
| 122 |
|
| 123 |
+
**V1.0.0 vs. V1.1.0**
|
| 124 |
+
The previous version (v1.0.0) predicted 21 keypoints and was trained using a different training recipe. V1.1.0 predicts 23 keypoints and improved training recipe with dataset balancing, which improves numbers.
|
| 125 |
+
|
| 126 |
+
-----------------------------------------
|
| 127 |
+
## 3. PMPose-1.0.0
|
| 128 |
+
|
| 129 |
+
- **Model type**: ViT-b backbone with multi-layer decoder
|
| 130 |
+
- **Input**: RGB images (192x256) + estimated instance segmentation
|
| 131 |
+
- **Output**: Keypoints Coordinates (48x64 probmap for each keypoint, 23 keypoints), Presence Probabilities, Visibilities, Expected OKS for each keypoint
|
| 132 |
+
- **Language(s)**: Not language-dependent (vision model)
|
| 133 |
+
- **License**: GPL-3.0
|
| 134 |
+
- **Framework**: MMPose
|
| 135 |
+
- **Size(s)**: -S, -B, -L, -H
|
| 136 |
+
|
| 137 |
+
#### Training Details
|
| 138 |
+
|
| 139 |
+
- **Training data**: [COCO Dataset](https://cocodataset.org/#home), [MPII Dataset](https://www.mpi-inf.mpg.de/departments/computer-vision-and-machine-learning/software-and-datasets/mpii-human-pose-dataset), [AIC Datasel](https://arxiv.org/abs/1711.06475) + SAM-estimated instance masks
|
| 140 |
+
- **Training script**: [GitHub - BBoxMaskPose_code](https://github.com/MiraPurkrabek/BBoxMaskPose)
|
| 141 |
+
- **Epochs**: 20
|
| 142 |
+
- **Batch size**: 64
|
| 143 |
+
- **Learning rate**: 5e-5
|
| 144 |
+
- **Frozen backbone**
|
| 145 |
+
- **Hardware**: 4x NVIDIA A-100
|
| 146 |
+
|
| 147 |
+
|
| 148 |
+
**What's new?**
|
| 149 |
+
PMPose combines MaskPose-1.1.0 and [ProbPose (CVPR'25)](https://mirapurkrabek.github.io/ProbPose/). It is conditioned by masks and has superior in-crowd performance as MaskPose and also precicts proabilities and visibilities as ProbPose.
|
| 150 |
+
|
| 151 |
-----------------------------------------
|
| 152 |
+
## 4. fine-tuned RTMDet-L
|
| 153 |
|
| 154 |
- **Model type**: CSPNeXt-P5 backbone, CSPNeXtPAFPN neck, RTMDetInsSepBN head
|
| 155 |
- **Input**: RGB images
|
|
|
|
| 176 |
|
| 177 |
If you use our work, please cite:
|
| 178 |
|
| 179 |
+
```bibtex
|
| 180 |
+
@InProceedings{BMPv2,
|
| 181 |
+
author = {Purkrabek, Miroslav and Kolomiiets, Constantin and Matas, Jiri},
|
| 182 |
+
title = {BBoxMaskPose v2: Expanding Mutual Conditioning to 3D},
|
| 183 |
+
booktitle = {arXiv preprint arXiv:to be added},
|
| 184 |
+
year = {2026}
|
| 185 |
+
}
|
| 186 |
+
```
|
| 187 |
```bibtex
|
| 188 |
@InProceedings{Purkrabek2025ICCV,
|
| 189 |
+
author = {Purkrabek, Miroslav and Matas, Jiri},
|
| 190 |
+
title = {Detection, Pose Estimation and Segmentation for Multiple Bodies: Closing the Virtuous Circle},
|
| 191 |
+
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
|
| 192 |
+
month = {October},
|
| 193 |
+
year = {2025}
|
| 194 |
+
}
|
| 195 |
+
```
|
| 196 |
+
```bibtex
|
| 197 |
+
@InProceedings{Kolomiiets2026CVWW,
|
| 198 |
+
author = {Kolomiiets, Constantin and Purkrabek, Miroslav and Matas, Jiri},
|
| 199 |
+
title = {SAM-pose2seg: Pose-Guided Human Instance Segmentation in Crowds},
|
| 200 |
+
booktitle = {Computer Vision Winter Workshop (CVWW)},
|
| 201 |
+
year = {2026}
|
| 202 |
}
|
| 203 |
```
|
| 204 |
|
| 205 |
## 🧑💻 Authors
|
| 206 |
|
| 207 |
- Miroslav Purkrabek ([personal website](https://github.com/MiraPurkrabek))
|
| 208 |
+
- Constantin Kolomiiets
|
| 209 |
- Jiri Matas ([personal website](https://cmp.felk.cvut.cz/~matas/))
|