purkrmir commited on
Commit
f9bd472
·
verified ·
1 Parent(s): 2773f75

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +159 -3
README.md CHANGED
@@ -1,3 +1,159 @@
1
- ---
2
- license: gpl-3.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: gpl-3.0
3
+ tags:
4
+ - human-pose-estimation
5
+ - pose-estimation
6
+ - instance-segmentation
7
+ - detection
8
+ - person-detection
9
+ - computer-vision
10
+ datasets:
11
+ - COCO
12
+ - AIC
13
+ - MPII
14
+ - OCHuman
15
+ metrics:
16
+ - mAP
17
+ ---
18
+ </h1><div id="toc">
19
+ <ul align="center" style="list-style: none; padding: 0; margin: 0;">
20
+ <summary>
21
+ <h1 style="margin-bottom: 0.0em;">
22
+ Detection, Pose Estimation and Segmentation for Multiple Bodies: Closing the Virtuous Circle
23
+ </h1>
24
+ </summary>
25
+ </ul>
26
+ </div>
27
+ </h1><div id="toc">
28
+ <ul align="center" style="list-style: none; padding: 0; margin: 0;">
29
+ <summary>
30
+ <h2 style="margin-bottom: 0.2em;">
31
+ ICCV 2025
32
+ </h2>
33
+ </summary>
34
+ </ul>
35
+ </div>
36
+
37
+ <div style="text-align: justify;">
38
+ The BBox-Mask-Pose (BMP) method integrates detection, pose estimation, and segmentation into a self-improving loop by conditioning these tasks on each other.
39
+ This approach enhances all three tasks simultaneously.
40
+ Using segmentation masks instead of bounding boxes improves performance in crowded scenarios, making top-down methods competitive with bottom-up approaches.
41
+
42
+ Key contributions:
43
+ 1. **MaskPose**: a pose estimation model conditioned by segmentation masks instead of bounding boxes, boosting performance in dense scenes without adding parameters
44
+ - Download pre-trained weights below
45
+ 2. **BBox-MaskPose (BMP)**: method linking bounding boxes, segmentation masks, and poses to simultaneously address multi-body detection, segmentation and pose estimation
46
+ - Try the demo!
47
+ 3. Fine-tuned RTMDet adapted for itterative detection (ignoring 'holes')
48
+ - Download pre-trained weights below
49
+ 5. Support for multi-dataset training of ViTPose, previously implemented in the official ViTPose repository but absent in MMPose.
50
+ </div>
51
+
52
+ <div align="left">
53
+
54
+ [![arXiv](https://img.shields.io/badge/arXiv-2412.01562-b31b1b?style=flat)](https://arxiv.org/abs/2412.01562) &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
55
+ [![GitHub repository](https://img.shields.io/badge/GitHub-black?style=flat&logo=github&logoColor=white)](https://github.com/MiraPurkrabek/BBoxMaskPose) &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
56
+ [![Project Website](https://img.shields.io/badge/Project%20Website-blue?style=flat&logo=google-chrome&logoColor=white)](https://mirapurkrabek.github.io/BBox-Mask-Pose/)
57
+ </div>
58
+
59
+ For more details, see the [GitHub repository](https://github.com/MiraPurkrabek/BBoxMaskPose).
60
+
61
+
62
+ ## 📝 Models List
63
+
64
+ 1. **ViTPose-b multi-dataset**
65
+ 2. **MaskPose-b**
66
+ 3. fine-tuned **RTMDet-l**
67
+
68
+ See details of each model below.
69
+
70
+ -----------------------------------------
71
+ ## 1. ViTPose-B [multi-dataset]
72
+
73
+ - **Model type**: ViT-b backbone with multi-layer decoder
74
+ - **Input**: RGB images (192x256)
75
+ - **Output**: Keypoints Coordinates (48x64 heatmap for each keypoint, 21 keypoints)
76
+ - **Language(s)**: Not language-dependent (vision model)
77
+ - **License**: GPL-3.0
78
+ - **Framework**: MMPose
79
+
80
+ #### Training Details
81
+
82
+ - **Training data**: [COCO Dataset](https://cocodataset.org/#home), [MPII Dataset](https://www.mpi-inf.mpg.de/departments/computer-vision-and-machine-learning/software-and-datasets/mpii-human-pose-dataset), [AIC Datasel](https://arxiv.org/abs/1711.06475)
83
+ - **Training script**: [GitHub - BBoxMaskPose_code](https://github.com/MiraPurkrabek/BBoxMaskPose)
84
+ - **Epochs**: 210
85
+ - **Batch size**: 64
86
+ - **Learning rate**: 5e-5
87
+ - **Hardware**: 4x NVIDIA A-100
88
+
89
+ **What's new?**
90
+ ViTPose trained on multiple datasets perform much better in multi-body (and crowded) scenarios than COCO-trained ViTPose.
91
+ The model was trained in multi-dataset setup by authors before, this is reproduction compatible with MMPose 2.0.
92
+
93
+ -----------------------------------------
94
+ ## 2. MaskPose-B
95
+
96
+ - **Model type**: ViT-b backbone with multi-layer decoder
97
+ - **Input**: RGB images (192x256) + estimated instance segmentation
98
+ - **Output**: Keypoints Coordinates (48x64 heatmap for each keypoint, 21 keypoints)
99
+ - **Language(s)**: Not language-dependent (vision model)
100
+ - **License**: GPL-3.0
101
+ - **Framework**: MMPose
102
+
103
+ #### Training Details
104
+
105
+ - **Training data**: [COCO Dataset](https://cocodataset.org/#home), [MPII Dataset](https://www.mpi-inf.mpg.de/departments/computer-vision-and-machine-learning/software-and-datasets/mpii-human-pose-dataset), [AIC Datasel](https://arxiv.org/abs/1711.06475) + SAM-estimated instance masks
106
+ - **Training script**: [GitHub - BBoxMaskPose_code](https://github.com/MiraPurkrabek/BBoxMaskPose)
107
+ - **Epochs**: 210
108
+ - **Batch size**: 64
109
+ - **Learning rate**: 5e-5
110
+ - **Hardware**: 4x NVIDIA A-100
111
+
112
+ **What's new?**
113
+ Compared to ViTPose, MaskPose takes instance segmentation as an input and is even better in distinguishing instances in muli-body scenes.
114
+ No computational overhead compared to ViTPose.
115
+
116
+ -----------------------------------------
117
+ ## 3. fine-tuned RTMDet-L
118
+
119
+ - **Model type**: CSPNeXt-P5 backbone, CSPNeXtPAFPN neck, RTMDetInsSepBN head
120
+ - **Input**: RGB images
121
+ - **Output**: Detected instances -- bbox, instance mask and class for each
122
+ - **Language(s)**: Not language-dependent (vision model)
123
+ - **License**: GPL-3.0
124
+ - **Framework**: MMDetection
125
+
126
+ #### Training Details
127
+
128
+ - **Training data**: [COCO Dataset](https://cocodataset.org/#home) with randomly masked-out instances
129
+ - **Training script**: [GitHub - BBoxMaskPose_code](https://github.com/MiraPurkrabek/BBoxMaskPose)
130
+ - **Epochs**: 10
131
+ - **Batch size**: 16
132
+ - **Learning rate**: 2e-2
133
+ - **Hardware**: 4x NVIDIA A-100
134
+
135
+ **What's new?**
136
+ RTMDet fine-tuned to ignore masked-out instances is designed for itterative detection.
137
+ Especially effective in multi-body scenes where background would not be detected otherwise.
138
+
139
+
140
+ ## 📄 Citation
141
+
142
+ If you use our work, please cite:
143
+
144
+ ```bibtex
145
+ @InProceedings{Purkrabek2025ICCV,
146
+ author={Purkrabek, Miroslav and Matas, Jiri},
147
+ title={Detection, Pose Estimation and Segmentation for Multiple Bodies: Closing the Virtuous Circle},
148
+ booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
149
+ year={2025},
150
+ month={October},
151
+ }
152
+ ```
153
+
154
+ ## 🧑‍💻 Authors
155
+
156
+ - Miroslav Purkrabek ([personal website](https://github.com/MiraPurkrabek))
157
+ - Jiri Matas ([personal website](https://cmp.felk.cvut.cz/~matas/))
158
+
159
+