upload README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,154 @@
|
|
| 1 |
-
---
|
| 2 |
-
license:
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
datasets:
|
| 4 |
+
- FreedomIntelligence/ALLaVA-4V
|
| 5 |
+
language:
|
| 6 |
+
- en
|
| 7 |
+
pipeline_tag: text-generation
|
| 8 |
+
---
|
| 9 |
+
|
| 10 |
+
|
| 11 |
+
# ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model
|
| 12 |
+
|
| 13 |
+
|
| 14 |
+
|
| 15 |
+
<p align="center">
|
| 16 |
+
β‘ALLaVA is a project that provides a large-scale GPT4V-synthesized dataset for training LVLMs.β‘
|
| 17 |
+
</p>
|
| 18 |
+
|
| 19 |
+
<!-- <p align="center">
|
| 20 |
+
|
| 21 |
+
  
|
| 22 |
+
</p> -->
|
| 23 |
+
|
| 24 |
+
|
| 25 |
+
|
| 26 |
+
<p align="center">
|
| 27 |
+
π <a href="https://arxiv.org/abs/2402.11684" target="_blank">Paper</a> β’ π <a href="https://allava.freedomai.cn/#/" target="_blank">Demo</a> β’ π¨π»βπ» <a href="https://github.com/FreedomIntelligence/ALLaVA" target="_blank">Github</a>
|
| 28 |
+
</p>
|
| 29 |
+
|
| 30 |
+
<p align="center">
|
| 31 |
+
π€ <a href="https://huggingface.co/datasets/FreedomIntelligence/ALLaVA-4V" target="_blank">ALLaVA-4V Dataset</a>
|
| 32 |
+
</p>
|
| 33 |
+
|
| 34 |
+
<p align="center">
|
| 35 |
+
π€ <a href="https://huggingface.co/FreedomIntelligence/ALLaVA-Phi3-mini-128k" target="_blank">ALLaVA-Phi3-mini-128k</a>
|
| 36 |
+
β’ π€ <a href="https://huggingface.co/FreedomIntelligence/ALLaVA-StableLM2-1_6B" target="_blank">ALLaVA-StableLM2-1_6B</a>
|
| 37 |
+
β’ π€ <a href="https://huggingface.co/FreedomIntelligence/ALLaVA-Phi2-2_7B" target="_blank">ALLaVA-Phi2-2_7B</a>
|
| 38 |
+
</p>
|
| 39 |
+
|
| 40 |
+
<!-- <p align="center">
|
| 41 |
+
π <a href="https://arxiv.org/abs/2402.11684" target="_blank">Paper</a> β’ π <a href="https://allava.freedomai.cn/#/" target="_blank">Demo</a> β’ π€ <a href="https://huggingface.co/datasets/FreedomIntelligence/ALLaVA-4V" target="_blank">ALLaVA-4V Dataset</a> β’ π€ <a href="https://huggingface.co/FreedomIntelligence/ALLaVA-3B-Longer" target="_blank">ALLaVA-3B-Longer</a> β’ π€ <a href="https://huggingface.co/FreedomIntelligence/ALLaVA-3B" target="_blank">ALLaVA-3B</a>
|
| 42 |
+
<br> <a href="https://github.com/FreedomIntelligence/CMB/blob/main/README_zh.md"> δΈζ</a> | <a href="https://github.com/FreedomIntelligence/CMB/blob/main/README.md"> English
|
| 43 |
+
</p> -->
|
| 44 |
+
|
| 45 |
+
## Benchmark Result
|
| 46 |
+
|
| 47 |
+
Our models [**ALLaVA-Phi3-mini-128k**](https://huggingface.co/FreedomIntelligence/ALLaVA-Phi3-mini-128k),
|
| 48 |
+
[**ALLaVA-StableLM2-1_6B**](https://huggingface.co/FreedomIntelligence/ALLaVA-StableLM2-1_6B)
|
| 49 |
+
and [**ALLaVA-Phi2-2_7B**](https://huggingface.co/FreedomIntelligence/ALLaVA-Phi2-2_7B)
|
| 50 |
+
achieve competitive results on 17 benchmarks.
|
| 51 |
+
|
| 52 |
+
|
| 53 |
+
| Models | Vicuna-80 | GQA | HallusionBench | MME-P | MMVP | TouchStone | TextVQA | MME-C | MathVista | MM-Vet | MMMU-val | SQA (img) | LLaVA (In-the-Wild) | MLLM-Bench | MMB-en | MMB-cn | SEEDBench (img, v1) |
|
| 54 |
+
|---------------------------|-----------|-----|-------|-------|------|----|---------|-------|----|--------|-----------------|---------|---------------|----|--------|--------|--------------------|
|
| 55 |
+
| **Large VLMs** | | | | | | | | | | | | | | | | | |
|
| 56 |
+
| BLIP-2 | - | - | - | - | - | - | - | - | - | 22.4 | 34.4 | - | - | 3.0*| - | - | 49.7 |
|
| 57 |
+
| InstructBLIP | - | 49.5| - | - | - | - | - | - | - | 25.6 | - | - | 58.2 | - | 44.0 | - | - |
|
| 58 |
+
| Qwen-VL-Chat | - | 57.5| - | 1487.6| - | - | 61.5 | 360.7 | - | 31.1 | - | 68.2 | - | - | 60.6 | 56.7 | 65.4 |
|
| 59 |
+
| LLaVA-1.5-7B | 13.8* | 62.0| 36.6* | 1504.4*| 24.7*| 594.9*| 58.2| 324.6*| 25.0*| 31.1| 35.1*| 66.8| 65.4| 23.0*| 64.3| 58.3| 66.1|
|
| 60 |
+
| LLaVA-1.5-13B | 22.5 | 63.3| 36.5* | 1531.3 | 38.0*| 617.7*| 61.3| 295.4| 28.3*| 35.4| 34.4*| 71.6| 72.5| -| 67.7| 63.6| 68.2|
|
| 61 |
+
| LVIS-7B | - | 62.6| - | - | - | - | 58.7 | - | - | 31.5 | - | - | 67.0 | 29.0*| 66.2 | - | - |
|
| 62 |
+
| LVIS-13B | - | 63.6*| - | - | - | - | 62.5* | - | - | 37.4* | - | - | 71.3* | - | 68.0* | - | - |
|
| 63 |
+
| ShareGPT4V-7B | 13.8* | 63.3| 36.0* | 1540.1*| 34.0*| 637.2*| 60.4| 346.1*| 24.7*| 37.6| 35.4*| 68.4*| 72.6| 30.2*| 68.8| 61.0*| 69.7|
|
| 64 |
+
| ShareGPT4V-13B | 17.5* | 64.8| 39.0* | 1576.1*| 35.3*| 648.7*| 62.2| 309.3*| 28.8*| 43.1| 35.6*| 70.0*| 79.9| 35.5*| 71.2| 61.7*| 70.8|
|
| 65 |
+
| **4B-scale Lite VLMs** | | | | | | | | | | | | | | | | | |
|
| 66 |
+
| MobileVLM-v2 | 5.0* | 61.1| 30.8* | 1440.5 | 18.7*| 541.0*| 57.5| 261.8*| 28.3*| 26.1*| 30.8*| 70.0| 53.2*| 15.7*| 63.2| 43.2*| 64.5*|
|
| 67 |
+
| Mipha-3B | 16.2* | **63.9**| 34.3*| **1488.9**| 32.0*| 619.0*| 56.6| 285.0*| 27.8*| 33.5*| 35.8*| 70.9| 64.7*| 23.1*| **69.7**| 42.9*| **71.2***|
|
| 68 |
+
| TinyLLaVA | 15.6* | 62.1| 37.2* | 1465.5*| 33.3*| 663.5*| **60.3**| 281.1*| 30.3*| 37.5| 38.4| **73.0**| 70.8*| 29.8*| **69.7***| 42.8*| 70.4*|
|
| 69 |
+
| **Ours** | | | | | | | | | | | | | | | | | |
|
| 70 |
+
| **ALLaVA-Phi2** | 49.4 | 48.8| 24.8 | 1316.2| **36.0**| 632.0| 49.5| 301.8| 27.4| 32.2| 35.3| 67.6| 69.4| 43.6| 64.0| 40.8| 65.2|
|
| 71 |
+
| **ALLaVA-StableLM2** | 38.8 | 49.8| 25.3 | 1311.7| 34.0 | 655.2| 51.7| 257.9| 27.7| 31.7| 33.3| 64.7| **72.0**| 39.3| 64.6| 49.8| 65.7|
|
| 72 |
+
| **ALLaVA-Phi3** | **56.9**| 52.2| **48.1**| 1382.3| 32.7| **667.8**| 53.0| **347.1**| **32.9**| **37.8**| **41.1**| 64.0| 68.5| **54.8**| 68.1| **55.3**| 69.0|
|
| 73 |
+
|
| 74 |
+
|
| 75 |
+
> \* denotes the results of our evaluation. **Bold numbers** are the best results among all 4B-scale LVLMs.The detailed information of each benchmark is shown in Table 4 of our [technical report](https://arxiv.org/pdf/2402.11684.pdf).
|
| 76 |
+
|
| 77 |
+
|
| 78 |
+
|
| 79 |
+
## π Inference
|
| 80 |
+
|
| 81 |
+
All models can be loaded from π€ with `.from_pretrained()`.
|
| 82 |
+
Check out the [example scripts](https://github.com/FreedomIntelligence/ALLaVA/tree/main/allava/serve) and make sure you have the same outputs as shown in the scripts.
|
| 83 |
+
<!-- ### Load from π€ (Recommended)
|
| 84 |
+
See the [example script](https://github.com/FreedomIntelligence/ALLaVA/blob/main/allava/serve/huggingface_inference.py). -->
|
| 85 |
+
|
| 86 |
+
<!-- ### CLI
|
| 87 |
+
See [here](https://github.com/FreedomIntelligence/ALLaVA/tree/main?tab=readme-ov-file#cli) for CLI code snippet. -->
|
| 88 |
+
|
| 89 |
+
|
| 90 |
+
|
| 91 |
+
## ποΈββοΈ Training
|
| 92 |
+
|
| 93 |
+
### Data
|
| 94 |
+
<div align=center>
|
| 95 |
+
<img src="training_datasets_by_stage.jpg" width = "640" alt="training_datasets" align=center />
|
| 96 |
+
</div>
|
| 97 |
+
|
| 98 |
+
ALLaVA uses 1.0M and 1.5M data for PT. and FT., respectively.
|
| 99 |
+
|
| 100 |
+
|
| 101 |
+
### Code
|
| 102 |
+
The training code is largely based on [LLaVA-v1.5](https://github.com/haotian-liu/LLaVA).
|
| 103 |
+
We wholeheartedly express our gratitude for their invaluable contributions to open-sourcing LVLMs.
|
| 104 |
+
|
| 105 |
+
<!-- ### Cost
|
| 106 |
+
We train our models on 8*A800 GPUs.
|
| 107 |
+
[ALLaVA-3B-Longer](https://huggingface.co/FreedomIntelligence/ALLaVA-3B-Longer) takes 8.3h for PT and 21.3h for FT.
|
| 108 |
+
[ALLaVA-3B](https://huggingface.co/FreedomIntelligence/ALLaVA-3B) takes 8.3h for PT and 10.6h for FT.
|
| 109 |
+
These two models share the same PT procedure. -->
|
| 110 |
+
|
| 111 |
+
|
| 112 |
+
### Hyperparameters
|
| 113 |
+
|
| 114 |
+
| Global Batch Size| ZeRO Stage| Optimizer | Max LR| Min LR | Scheduler | Weight decay |
|
| 115 |
+
| ---: | ---: |--:| ---: | ---: | ---: | ---: |
|
| 116 |
+
| 256 (PT) / 128 (FT) | 1| AdamW | 2e-5 | 2e-6 | CosineAnnealingWarmRestarts | 0 |
|
| 117 |
+
|
| 118 |
+
The LM backbone, projector are trainable, while the vision encoder is kept frozen.
|
| 119 |
+
**The trainabilities of each module are the same for both stages.**
|
| 120 |
+
|
| 121 |
+
|
| 122 |
+
## π ALLaVA-4V Data
|
| 123 |
+
|
| 124 |
+
The majority part of training data is [ALLaVA-4V](https://huggingface.co/datasets/FreedomIntelligence/ALLaVA-4V). See [here](https://github.com/FreedomIntelligence/ALLaVA/tree/main?tab=readme-ov-file#data-preparation) to prepare it for training.
|
| 125 |
+
|
| 126 |
+
|
| 127 |
+
## π Contributors
|
| 128 |
+
|
| 129 |
+
- Project Leader: [Guiming Hardy Chen](https://g-h-chen.github.io/)
|
| 130 |
+
|
| 131 |
+
- Data: Shunian Chen, [Junying Chen](https://jymchen.github.io/), Xiangbo Wu
|
| 132 |
+
|
| 133 |
+
- Evaluation: [Ruifei Zhang](https://scholar.google.com/citations?user=W4zOhmEAAAAJ&hl=zh-CN)
|
| 134 |
+
|
| 135 |
+
- Deployment: Xiangbo Wu, Zhiyi Zhang
|
| 136 |
+
|
| 137 |
+
- Advising: [Zhihong Chen](https://zhjohnchan.github.io/), [Benyou Wang](https://wabyking.github.io/old.html)
|
| 138 |
+
|
| 139 |
+
- Others: Jianquan Li, [Xiang Wan](https://scholar.google.com/citations?user=e3_kWigAAAAJ&hl=zh-CN)
|
| 140 |
+
|
| 141 |
+
|
| 142 |
+
|
| 143 |
+
|
| 144 |
+
|
| 145 |
+
## π Citation
|
| 146 |
+
If you find our data useful, please consider citing our work! We are FreedomIntelligence from [Shenzhen Research Institute of Big Data](http://sribd.cn/en) and [The Chinese University of Hong Kong, Shenzhen](https://sds.cuhk.edu.cn/en)
|
| 147 |
+
```
|
| 148 |
+
@article{chen2024allava,
|
| 149 |
+
title={ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model},
|
| 150 |
+
author={Chen, Guiming Hardy and Chen, Shunian and Zhang, Ruifei and Chen, Junying and Wu, Xiangbo and Zhang, Zhiyi and Chen, Zhihong and Li, Jianquan and Wan, Xiang and Wang, Benyou},
|
| 151 |
+
journal={arXiv preprint arXiv:2402.11684},
|
| 152 |
+
year={2024}
|
| 153 |
+
}
|
| 154 |
+
```
|