Add comprehensive model card for Hulk
Browse filesThis PR significantly enhances the model card for the `OpenGVLab/Hulk` model.
It includes:
- Adding `pipeline_tag: any-to-any` to classify the model's generalist, multimodal capabilities.
- Adding `library_name: transformers` to indicate compatibility with the Transformers library and enable the "Use in Transformers" widget.
- Adding comprehensive tags like `human-centric`, `multimodal`, `2d-vision`, `3d-vision`, `vision-language`, and various task-specific tags for better discoverability.
- Incorporating the full paper abstract, project page, and GitHub repository links for detailed information.
- Including key visuals (teaser and framework images) from the GitHub repository.
- Providing clear guidance on where to find usage examples, training, and evaluation details (the official GitHub repository).
- Adding the academic citation.
These changes will greatly improve the model's visibility and utility on the Hugging Face Hub.
|
@@ -1,3 +1,57 @@
|
|
| 1 |
---
|
| 2 |
license: mit
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
license: mit
|
| 3 |
+
pipeline_tag: any-to-any
|
| 4 |
+
library_name: transformers
|
| 5 |
+
tags:
|
| 6 |
+
- human-centric
|
| 7 |
+
- multimodal
|
| 8 |
+
- 2d-vision
|
| 9 |
+
- 3d-vision
|
| 10 |
+
- skeleton-based
|
| 11 |
+
- vision-language
|
| 12 |
+
- pose-estimation
|
| 13 |
+
- object-detection
|
| 14 |
+
- image-segmentation
|
| 15 |
+
- action-recognition
|
| 16 |
+
- image-captioning
|
| 17 |
+
- attribute-recognition
|
| 18 |
---
|
| 19 |
+
|
| 20 |
+
# Hulk: A Universal Knowledge Translator for Human-Centric Tasks
|
| 21 |
+
|
| 22 |
+
This model was presented in the paper [Hulk: A Universal Knowledge Translator for Human-Centric Tasks](https://huggingface.co/papers/2312.01697).
|
| 23 |
+
|
| 24 |
+
* **Project Page**: [https://humancentricmodels.github.io/Hulk/](https://humancentricmodels.github.io/Hulk/)
|
| 25 |
+
* **GitHub Repository**: [https://github.com/OpenGVLab/Hulk](https://github.com/OpenGVLab/Hulk)
|
| 26 |
+
* **ArXiv Paper**: [https://arxiv.org/abs/2312.01697](https://arxiv.org/abs/2312.01697)
|
| 27 |
+
|
| 28 |
+
<p align="center">
|
| 29 |
+
<img src="https://huggingface.co/OpenGVLab/Hulk/resolve/main/assets/teaser.png" width="1000" />
|
| 30 |
+
</p>
|
| 31 |
+
|
| 32 |
+
## Abstract
|
| 33 |
+
Human-centric perception tasks, e.g., pedestrian detection, skeleton-based action recognition, and pose estimation, have wide industrial applications, such as metaverse and sports analysis. There is a recent surge to develop human-centric foundation models that can benefit a broad range of human-centric perception tasks. While many human-centric foundation models have achieved success, they did not explore 3D and vision-language tasks for human-centric and required task-specific finetuning. These limitations restrict their application to more downstream tasks and situations. To tackle these problems, we present Hulk, the first multimodal human-centric generalist model, capable of addressing 2D vision, 3D vision, skeleton-based, and vision-language tasks without task-specific finetuning. The key to achieving this is condensing various task-specific heads into two general heads, one for discrete representations, \emph{e.g.,} languages, and the other for continuous representations, \emph{e.g.,} location coordinates. The outputs of two heads can be further stacked into four distinct input and output modalities. This uniform representation enables Hulk to treat diverse human-centric tasks as modality translation, integrating knowledge across a wide range of tasks. Comprehensive evaluations of Hulk on 12 benchmarks covering 8 human-centric tasks demonstrate the superiority of our proposed method, achieving state-of-the-art performance in 11 benchmarks.
|
| 34 |
+
|
| 35 |
+
## Model Framework
|
| 36 |
+
<p align="center">
|
| 37 |
+
<img src="https://huggingface.co/OpenGVLab/Hulk/resolve/main/assets/framework.png" width="1000" />
|
| 38 |
+
</p>
|
| 39 |
+
|
| 40 |
+
## Usage
|
| 41 |
+
For detailed installation instructions, dataset preparation, training procedures, evaluation scripts, and comprehensive inference examples across various human-centric tasks, please refer to the official [Hulk GitHub repository](https://github.com/OpenGVLab/Hulk).
|
| 42 |
+
|
| 43 |
+
The codebase is built on top of the 🤗 [Diffusers](https://github.com/huggingface/diffusers) and 🤗 [Transformers](https://github.com/huggingface/transformers) libraries, and users should consult the repository for specific usage patterns.
|
| 44 |
+
|
| 45 |
+
## Model Performance
|
| 46 |
+
Hulk has achieved state-of-the-art results on various human-centric benchmarks, demonstrating its superiority in both direct evaluation and fine-tuning scenarios. For detailed performance metrics across different tasks and datasets, please consult the tables in the [GitHub README](https://github.com/OpenGVLab/Hulk#model-performance) and the [original paper](https://huggingface.co/papers/2312.01697).
|
| 47 |
+
|
| 48 |
+
## Citation
|
| 49 |
+
If you find this work useful, please consider citing:
|
| 50 |
+
```bibtex
|
| 51 |
+
@article{wang2023hulk,
|
| 52 |
+
title={Hulk: A Universal Knowledge Translator for Human-Centric Tasks},
|
| 53 |
+
author={Wang, Yizhou and Wu, Yixuan and Tang, Shixiang and He, Weizhen and Guo, Xun and Zhu, Feng and Bai, Lei and Zhao, Rui and Wu, Jian and He, Tong and others},
|
| 54 |
+
journal={arXiv preprint arXiv:2312.01697},
|
| 55 |
+
year={2023}
|
| 56 |
+
}
|
| 57 |
+
```
|