Update README.md
Browse files
README.md
CHANGED
|
@@ -24,16 +24,17 @@ pipeline_tag: visual-question-answering
|
|
| 24 |
|
| 25 |
Human-related vision and language tasks are widely applied across various social scenarios. The latest studies demonstrate that the large vision-language model can enhance the performance of various downstream tasks in visual-language understanding. Since, models in the general domain often not perform well in the specialized field. In this study, we train a domain-specific Large Language-Vision model, Human-LLaVA, which aim to construct an unified multimodal Language-Vision Model for Human-related tasks.
|
| 26 |
|
| 27 |
-
Specifically, (1) we first construct a large-scale and high-quality human-related image-text (caption) dataset extracted from Internet for domain-specific alignment in the first stage (Coming soon); (2) we also propose to construct a multi-granularity caption for human-related images (Coming soon), including human face, human body, and whole image, thereby fine-tuning a large language model. Lastly, we evaluate our model on a series of downstream tasks, our Human-LLaVA achieved the best overall performance among multimodal models of similar scale. In particular, it exhibits the best performance in a series of human-related tasks, significantly surpassing similar models and ChatGPT-4o. We believe that the Huaman-LLaVA model and a series of datasets presented in this work can promote research in related fields.
|
| 28 |
|
| 29 |
|
| 30 |
## Result
|
| 31 |
|
| 32 |
## News and Update π₯π₯π₯
|
| 33 |
-
* 2024.
|
| 34 |
|
| 35 |
|
| 36 |
-
##
|
|
|
|
| 37 |
``` python
|
| 38 |
import requests
|
| 39 |
from PIL import Image
|
|
@@ -67,6 +68,7 @@ print(predict)
|
|
| 67 |
HumanCaption-10M(self construct): Coming Soon!
|
| 68 |
|
| 69 |
#### Instruction Tuning Stage
|
|
|
|
| 70 |
|
| 71 |
HumanCaptionHQ-300K(self construct): Coming Soon!
|
| 72 |
|
|
@@ -76,13 +78,12 @@ humanvg_high_reg(self construct):Coming Soon!
|
|
| 76 |
|
| 77 |
humanvg_high_rec(self construct):Coming Soon!
|
| 78 |
|
| 79 |
-
celeba_attribute(self construct):
|
| 80 |
|
| 81 |
-
|
| 82 |
|
| 83 |
LLaVA-Instruct_zh :
|
| 84 |
|
| 85 |
-
ShareGPT4V_vqa:
|
| 86 |
|
| 87 |
verified_ref3rec:
|
| 88 |
|
|
|
|
| 24 |
|
| 25 |
Human-related vision and language tasks are widely applied across various social scenarios. The latest studies demonstrate that the large vision-language model can enhance the performance of various downstream tasks in visual-language understanding. Since, models in the general domain often not perform well in the specialized field. In this study, we train a domain-specific Large Language-Vision model, Human-LLaVA, which aim to construct an unified multimodal Language-Vision Model for Human-related tasks.
|
| 26 |
|
| 27 |
+
Specifically, (1) we first construct **a large-scale and high-quality human-related image-text (caption) dataset** extracted from Internet for domain-specific alignment in the first stage (Coming soon); (2) we also propose to construct **a multi-granularity caption for human-related images** (Coming soon), including human face, human body, and whole image, thereby fine-tuning a large language model. Lastly, we evaluate our model on a series of downstream tasks, our **Human-LLaVA** achieved the best overall performance among multimodal models of similar scale. In particular, it exhibits the best performance in a series of human-related tasks, significantly surpassing similar models and ChatGPT-4o. We believe that the Huaman-LLaVA model and a series of datasets presented in this work can promote research in related fields.
|
| 28 |
|
| 29 |
|
| 30 |
## Result
|
| 31 |
|
| 32 |
## News and Update π₯π₯π₯
|
| 33 |
+
* Sep.8, 2024. **π€[Human-LLaVA-8B](https://huggingface.co/OpenFace-CQUPT/Human_LLaVA), is released!πππ**
|
| 34 |
|
| 35 |
|
| 36 |
+
## π€ Transformers
|
| 37 |
+
To use Human-LLaVA for the inference, all you need to do is to input a few lines of codes as demonstrated below. However, please make sure that you are using latest code.
|
| 38 |
``` python
|
| 39 |
import requests
|
| 40 |
from PIL import Image
|
|
|
|
| 68 |
HumanCaption-10M(self construct): Coming Soon!
|
| 69 |
|
| 70 |
#### Instruction Tuning Stage
|
| 71 |
+
All public data sets have been filtered, and we will consider publishing all processed text in the future
|
| 72 |
|
| 73 |
HumanCaptionHQ-300K(self construct): Coming Soon!
|
| 74 |
|
|
|
|
| 78 |
|
| 79 |
humanvg_high_rec(self construct):Coming Soon!
|
| 80 |
|
| 81 |
+
celeba_attribute(self construct): [CelebA](https://mmlab.ie.cuhk.edu.hk/projects/CelebA.html)
|
| 82 |
|
| 83 |
+
ShareGPT4V:[ShareGPT4V]https://github.com/InternLM/InternLM-XComposer/blob/main/projects/ShareGPT4V/docs/Data.md
|
| 84 |
|
| 85 |
LLaVA-Instruct_zh :
|
| 86 |
|
|
|
|
| 87 |
|
| 88 |
verified_ref3rec:
|
| 89 |
|