Sapiens

Foundation for Human Vision Models

Rawal Khirodkar · Timur Bagautdinov · Julieta Martinez · Su Zhaoen · Austin James
Peter Selednik . Stuart Anderson . Shunsuke Saito

ECCV 2024 - Best Paper Candidate

Sapiens offers a comprehensive suite for human-centric vision tasks (e.g., 2D pose, part segmentation, depth, normal, etc.). The model family is pretrained on 300 million in-the-wild human images and shows excellent generalization to unconstrained conditions. These models are also designed for extracting high-resolution features, having been natively trained at a 1024 x 1024 image resolution with a 16-pixel patch size.

## 🚀 Getting Started ### Clone the Repository ```bash git clone https://github.com/facebookresearch/sapiens.git export SAPIENS_ROOT=/path/to/sapiens ``` ### Recommended: Lite Installation (Inference-only) For users setting up their own environment primarily for running existing models in inference mode, we recommend the [Sapiens-Lite installation](lite/README.md).\ This setup offers optimized inference (4x faster) with minimal dependencies (only PyTorch + numpy + cv2). ### Full Installation To replicate our complete training setup, run the provided installation script. \ This will create a new conda environment named `sapiens` and install all necessary dependencies. ```bash cd $SAPIENS_ROOT/_install ./conda.sh ``` Please download the **original** checkpoints from [hugging-face](https://huggingface.co/facebook/sapiens). \ You can be selective about only downloading the checkpoints of interest.\ Set `$SAPIENS_CHECKPOINT_ROOT` to be the path to the `sapiens_host` folder. Place the checkpoints following this directory structure: ```plaintext sapiens_host/ ├── detector/ │ └── checkpoints/ │ └── rtmpose/ ├── pretrain/ │ └── checkpoints/ │ ├── sapiens_0.3b/ ├── sapiens_0.3b_epoch_1600_clean.pth │ ├── sapiens_0.6b/ ├── sapiens_0.6b_epoch_1600_clean.pth │ ├── sapiens_1b/ │ └── sapiens_2b/ ├── pose/ └── checkpoints/ ├── sapiens_0.3b/ └── seg/ └── depth/ └── normal/ ``` ## 🌟 Human-Centric Vision Tasks We finetune sapiens for multiple human-centric vision tasks. Please checkout the list below. - ### [Image Encoder](docs/PRETRAIN_README.md) ^[lite] - ### [Pose Estimation](docs/POSE_README.md) ^[lite] - ### [Body Part Segmentation](docs/SEG_README.md) ^[lite] - ### [Depth Estimation](docs/DEPTH_README.md) ^[lite] - ### [Surface Normal Estimation](docs/NORMAL_README.md) ^[lite] ## 🎯 Easy Steps to Finetuning Sapiens Finetuning our models is super-easy! Here is a detailed training guide for the following tasks. - ### [Pose Estimation](docs/finetune/POSE_README.md) - ### [Body-Part Segmentation](docs/finetune/SEG_README.md) - ### [Depth Estimation](docs/finetune/DEPTH_README.md) - ### [Surface Normal Estimation](docs/finetune/NORMAL_README.md) ## 📈 Quantitative Evaluations - ### [Pose Estimation](docs/evaluate/POSE_README.md) ## 🤝 Acknowledgements & Support & Contributing We would like to acknowledge the work by [OpenMMLab](https://github.com/open-mmlab) which this project benefits from.\ For any questions or issues, please open an issue in the repository.\ See [contributing](CONTRIBUTING.md) and the [code of conduct](CODE_OF_CONDUCT.md). ## License This project is licensed under [LICENSE](LICENSE).\ Portions derived from open-source projects are licensed under [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0). ## 📚 Citation If you use Sapiens in your research, please consider citing us. ```bibtex @article{khirodkar2024sapiens, title={Sapiens: Foundation for Human Vision Models}, author={Khirodkar, Rawal and Bagautdinov, Timur and Martinez, Julieta and Zhaoen, Su and James, Austin and Selednik, Peter and Anderson, Stuart and Saito, Shunsuke}, journal={arXiv preprint arXiv:2408.12569}, year={2024} } ```