Update README.md
Browse files
README.md
CHANGED
|
@@ -8,6 +8,10 @@ tags:
|
|
| 8 |
---
|
| 9 |
# Human-LLaVA-(HumanCaption-10M dataset)
|
| 10 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 11 |
### Introduction
|
| 12 |
|
| 13 |
Human-related vision and language tasks are widely applied across various social scenarios. The latest studies demonstrate that the large vision-language model can enhance the performance of various downstream tasks in visual-language understanding. Since, models in the general domain often not perform well in the specialized field. In this study, we train a domain-specific Large Language-Vision model, Human-LLaVA, which aim to construct an unified multimodal Language-Vision Model for Human-related tasks.
|
|
@@ -15,18 +19,11 @@ Human-related vision and language tasks are widely applied across various social
|
|
| 15 |
Specifically, (1) we first construct a large-scale and high-quality human-related image-text (caption) dataset extracted from Internet for domain-specific alignment in the first stage (Coming soon); (2) we also propose to construct a multi-granularity caption for human-related images (Coming soon), including human face, human body, and whole image, thereby fine-tuning a large language model. Lastly, we evaluate our model on a series of downstream tasks, our Human-LLaVA achieved the best overall performance among multimodal models of similar scale. In particular, it exhibits the best performance in a series of human-related tasks, significantly surpassing similar models and ChatGPT-4o. We believe that the Huaman-LLaVA model and a series of datasets presented in this work can promote research in related fields.
|
| 16 |
|
| 17 |
|
| 18 |
-
## DEMO
|
| 19 |
-
|
| 20 |
-
<video controls autoplay src="https://cdn-uploads.huggingface.co/production/uploads/64259db7d3e6fdf87e4792d0/tyT9FvycyyVWISd1-_A-m.mp4"></video>
|
| 21 |
-
|
| 22 |
-
|
| 23 |
## Result
|
| 24 |
|
| 25 |
|
| 26 |
|
| 27 |
|
| 28 |
-
|
| 29 |
-
|
| 30 |
## How to Use
|
| 31 |
``` python
|
| 32 |
import requests
|
|
|
|
| 8 |
---
|
| 9 |
# Human-LLaVA-(HumanCaption-10M dataset)
|
| 10 |
|
| 11 |
+
## DEMO
|
| 12 |
+
|
| 13 |
+
<video controls autoplay src="https://cdn-uploads.huggingface.co/production/uploads/64259db7d3e6fdf87e4792d0/tyT9FvycyyVWISd1-_A-m.mp4"></video>
|
| 14 |
+
|
| 15 |
### Introduction
|
| 16 |
|
| 17 |
Human-related vision and language tasks are widely applied across various social scenarios. The latest studies demonstrate that the large vision-language model can enhance the performance of various downstream tasks in visual-language understanding. Since, models in the general domain often not perform well in the specialized field. In this study, we train a domain-specific Large Language-Vision model, Human-LLaVA, which aim to construct an unified multimodal Language-Vision Model for Human-related tasks.
|
|
|
|
| 19 |
Specifically, (1) we first construct a large-scale and high-quality human-related image-text (caption) dataset extracted from Internet for domain-specific alignment in the first stage (Coming soon); (2) we also propose to construct a multi-granularity caption for human-related images (Coming soon), including human face, human body, and whole image, thereby fine-tuning a large language model. Lastly, we evaluate our model on a series of downstream tasks, our Human-LLaVA achieved the best overall performance among multimodal models of similar scale. In particular, it exhibits the best performance in a series of human-related tasks, significantly surpassing similar models and ChatGPT-4o. We believe that the Huaman-LLaVA model and a series of datasets presented in this work can promote research in related fields.
|
| 20 |
|
| 21 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 22 |
## Result
|
| 23 |
|
| 24 |
|
| 25 |
|
| 26 |
|
|
|
|
|
|
|
| 27 |
## How to Use
|
| 28 |
``` python
|
| 29 |
import requests
|