BBoxMaskPose / README.md

merve HF Staff

Improve metadata 🤗

79338f9 verified 7 months ago

6.19 kB

	---
	license: gpl-3.0
	tags:
	- human-pose-estimation
	- pose-estimation
	- instance-segmentation
	- detection
	- person-detection
	- computer-vision
	datasets:
	- COCO
	- AIC
	- MPII
	- OCHuman
	metrics:
	- mAP
	pipeline_tag: keypoint-detection
	---
	</h1><div id="toc">
	<ul align="center" style="list-style: none; padding: 0; margin: 0;">
	<summary>
	<h1 style="margin-bottom: 0.0em;">
	Detection, Pose Estimation and Segmentation for Multiple Bodies: Closing the Virtuous Circle
	</h1>
	</summary>
	</ul>
	</div>
	</h1><div id="toc">
	<ul align="center" style="list-style: none; padding: 0; margin: 0;">
	<summary>
	<h2 style="margin-bottom: 0.2em;">
	ICCV 2025
	</h2>
	</summary>
	</ul>
	</div>

	<div style="text-align: justify;">
	The BBox-Mask-Pose (BMP) method integrates detection, pose estimation, and segmentation into a self-improving loop by conditioning these tasks on each other.
	This approach enhances all three tasks simultaneously.
	Using segmentation masks instead of bounding boxes improves performance in crowded scenarios, making top-down methods competitive with bottom-up approaches.

	Key contributions:
	1. MaskPose: a pose estimation model conditioned by segmentation masks instead of bounding boxes, boosting performance in dense scenes without adding parameters
	- Download pre-trained weights below
	2. BBox-MaskPose (BMP): method linking bounding boxes, segmentation masks, and poses to simultaneously address multi-body detection, segmentation and pose estimation
	- Try the demo!
	3. Fine-tuned RTMDet adapted for itterative detection (ignoring 'holes')
	- Download pre-trained weights below
	5. Support for multi-dataset training of ViTPose, previously implemented in the official ViTPose repository but absent in MMPose.
	</div>

	<div align="left">

	[![arXiv](https://img.shields.io/badge/arXiv-2412.01562-b31b1b?style=flat)](https://arxiv.org/abs/2412.01562)
	[![GitHub repository](https://img.shields.io/badge/GitHub-black?style=flat&logo=github&logoColor=white)](https://github.com/MiraPurkrabek/BBoxMaskPose)
	[![Project Website](https://img.shields.io/badge/Project%20Website-blue?style=flat&logo=google-chrome&logoColor=white)](https://mirapurkrabek.github.io/BBox-Mask-Pose/)
	</div>

	For more details, see the [GitHub repository](https://github.com/MiraPurkrabek/BBoxMaskPose).


	## 📝 Models List

	1. ViTPose-b multi-dataset
	2. MaskPose-b
	3. fine-tuned RTMDet-l

	See details of each model below.

	-----------------------------------------
	## 1. ViTPose-B [multi-dataset]

	- Model type: ViT-b backbone with multi-layer decoder
	- Input: RGB images (192x256)
	- Output: Keypoints Coordinates (48x64 heatmap for each keypoint, 21 keypoints)
	- Language(s): Not language-dependent (vision model)
	- License: GPL-3.0
	- Framework: MMPose

	#### Training Details

	- Training data: [COCO Dataset](https://cocodataset.org/#home), [MPII Dataset](https://www.mpi-inf.mpg.de/departments/computer-vision-and-machine-learning/software-and-datasets/mpii-human-pose-dataset), [AIC Datasel](https://arxiv.org/abs/1711.06475)
	- Training script: [GitHub - BBoxMaskPose_code](https://github.com/MiraPurkrabek/BBoxMaskPose)
	- Epochs: 210
	- Batch size: 64
	- Learning rate: 5e-5
	- Hardware: 4x NVIDIA A-100

	What's new?
	ViTPose trained on multiple datasets perform much better in multi-body (and crowded) scenarios than COCO-trained ViTPose.
	The model was trained in multi-dataset setup by authors before, this is reproduction compatible with MMPose 2.0.

	-----------------------------------------
	## 2. MaskPose-B

	- Model type: ViT-b backbone with multi-layer decoder
	- Input: RGB images (192x256) + estimated instance segmentation
	- Output: Keypoints Coordinates (48x64 heatmap for each keypoint, 21 keypoints)
	- Language(s): Not language-dependent (vision model)
	- License: GPL-3.0
	- Framework: MMPose

	#### Training Details

	- Training data: [COCO Dataset](https://cocodataset.org/#home), [MPII Dataset](https://www.mpi-inf.mpg.de/departments/computer-vision-and-machine-learning/software-and-datasets/mpii-human-pose-dataset), [AIC Datasel](https://arxiv.org/abs/1711.06475) + SAM-estimated instance masks
	- Training script: [GitHub - BBoxMaskPose_code](https://github.com/MiraPurkrabek/BBoxMaskPose)
	- Epochs: 210
	- Batch size: 64
	- Learning rate: 5e-5
	- Hardware: 4x NVIDIA A-100

	What's new?
	Compared to ViTPose, MaskPose takes instance segmentation as an input and is even better in distinguishing instances in muli-body scenes.
	No computational overhead compared to ViTPose.

	-----------------------------------------
	## 3. fine-tuned RTMDet-L

	- Model type: CSPNeXt-P5 backbone, CSPNeXtPAFPN neck, RTMDetInsSepBN head
	- Input: RGB images
	- Output: Detected instances -- bbox, instance mask and class for each
	- Language(s): Not language-dependent (vision model)
	- License: GPL-3.0
	- Framework: MMDetection

	#### Training Details

	- Training data: [COCO Dataset](https://cocodataset.org/#home) with randomly masked-out instances
	- Training script: [GitHub - BBoxMaskPose_code](https://github.com/MiraPurkrabek/BBoxMaskPose)
	- Epochs: 10
	- Batch size: 16
	- Learning rate: 2e-2
	- Hardware: 4x NVIDIA A-100

	What's new?
	RTMDet fine-tuned to ignore masked-out instances is designed for itterative detection.
	Especially effective in multi-body scenes where background would not be detected otherwise.


	## 📄 Citation

	If you use our work, please cite:

	```bibtex
	@InProceedings{Purkrabek2025ICCV,
	author={Purkrabek, Miroslav and Matas, Jiri},
	title={Detection, Pose Estimation and Segmentation for Multiple Bodies: Closing the Virtuous Circle},
	booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
	year={2025},
	month={October},
	}
	```

	## 🧑‍💻 Authors

	- Miroslav Purkrabek ([personal website](https://github.com/MiraPurkrabek))
	- Jiri Matas ([personal website](https://cmp.felk.cvut.cz/~matas/))