| --- |
| license: apache-2.0 |
| datasets: |
| - laion/laion400m |
| - kakaobrain/coyo-700m |
| pipeline_tag: feature-extraction |
| tags: |
| - Vision |
| - LLaVA |
| --- |
| |
|
|
|
|
|
|
| [[Paper]](https://arxiv.org/abs/2407.17331) [[GitHub]](https://github.com/deepglint/unicom) |
| ## Model |
| We used the same Vision Transformer architecture [ViT-L/14@336px as CLIP](https://huggingface.co/openai/clip-vit-large-patch14-336). |
|
|
|  |
|
|
|
|
| ## Data |
| Our model was trained on publicly available image-caption data from the [LAION400M](https://arxiv.org/abs/2111.02114) and [COYO700M](https://github.com/kakaobrain/coyo-dataset) datasets. |
|
|
| ## Performance and Limitations |
|
|
| ### A. MLLMs Evaluation Results |
| In our experiments, we replaced the CLIP model in [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT) with the MLCD model to demonstrate the performance of the MLCD model in Multimodal Large Language Models (MLLMs). For the language model, we used [Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B). The evaluation results show that the modified model performs exceptionally well across multiple benchmarks, validating the effectiveness of the MLCD model within MLLMs. |
|
|
| | Vision Tower | MLCD (ViT_L_14_336px) | CLIP (ViT_L_14_336px) | |
| |:----------------|:----------------------|:----------------------| |
| | LLM | Qwen2.5-7B | Qwen2.5-7B | |
| | AI2D | <span style="color:red">76.98</span> | 73.15 | |
| | ScienceQA_img | <span style="color:red">78.09</span> | 76.35 | |
| | GQA | <span style="color:red">64.17</span> | 63.31 | |
| | InfoVQA_val | <span style="color:red">43.48</span> | 38.88 | |
| | MMBench_cn_dev | <span style="color:red">74.83</span> | 72.51 | |
| | MMBench_en_dev | <span style="color:red">76.37</span> | 74.57 | |
| | MME(cognition) | <span style="color:red">432</span> | 384 | |
| | MME(perception) | <span style="color:red">1598</span> | 1512 | |
| | SeedBench | <span style="color:red">68.20</span> | 66.80 | |
| | SeedBench_img | <span style="color:red">73.75</span> | 72.72 | |
| | MMStar | <span style="color:red">50.98</span> | 48.98 | |
| | MMMU | <span style="color:red">44.30</span> | 44.20 | |
| | OCRBench | <span style="color:red">531.00</span> | 525.00 | |
| | ChartQA | <span style="color:red">67.84</span> | 66.52 | |
| | DocVQA_val | <span style="color:red">76.46</span> | 75.21 | |
| | POPE | 88.69 | <span style="color:red">88.83</span> | |
| | TextVQA_val | 61.69 | <span style="color:red">62.47</span> | |
| |
| |
| |
| |
| ### B. Linear Probe Evaluation Results |
| This table presents the results of linear probe evaluations comparing CLIP and MLCD models on the ViT_L_14_336px architecture across various datasets. The linear probe test freezes the pre-trained model's weights and trains a linear classifier on top to assess how well the model's representations generalize to different tasks. |
|
|
| | Dataset | MLCD (ViT_L_14_336px) | CLIP (ViT_L_14_336px) | |
| |:---------------|:----------------------|:----------------------| |
| | AVG | <span style="color:red">87.15</span> | 85.35 | |
| | Food101 | <span style="color:red">96.21</span> | 95.90 | |
| | CIFAR-10 | <span style="color:red">99.36</span> | 97.90 | |
| | CIFAR-100 | <span style="color:red">93.69</span> | 87.40 | |
| | Birdsnap | <span style="color:red">88.18</span> | 79.90 | |
| | SUN397 | <span style="color:red">87.96</span> | 82.20 | |
| | Stanford Cars | <span style="color:red">95.16</span> | 91.50 | |
| | FGVC Aircraft | <span style="color:red">86.38</span> | 71.60 | |
| | Describable Textures Dataset | <span style="color:red">86.70</span> | 83.00 | |
| | Oxford-IIIT Pets | <span style="color:red">96.27</span> | 95.10 | |
| | Caltech-101 | <span style="color:red">97.92</span> | 96.00 | |
| | Flowers102 | <span style="color:red">99.58</span> | 99.20 | |
| | MNIST | 98.67 | <span style="color:red">99.20</span> | |
| | STL-10 | 99.28 | <span style="color:red">99.70</span> | |
| | EuroSAT | <span style="color:red">99.06</span> | 98.10 | |
| | RESISC45 | <span style="color:red">95.48</span> | 94.90 | |
| | GTSRB | 92.32 | <span style="color:red">92.40</span> | |
| | KITTI | <span style="color:red">75.39</span> | 69.20 | |
| | Country211 | 38.12 | <span style="color:red">46.40</span> | |
| | PatchCamelyon | <span style="color:red">88.00</span> | 85.60 | |
| | UCF101 | <span style="color:red">92.86</span> | 92.00 | |
| | Kinetics-700 | <span style="color:red">73.35</span> | 73.00 | |
| | CLEVR | <span style="color:red">64.40</span> | 60.30 | |
| | Hateful Memes | 72.00 | <span style="color:red">77.30</span> | |
| | SST-2 | 76.33 | <span style="color:red">80.50</span> | |
| | ImageNet | <span style="color:red">86.30</span> | 85.40 | |
|
|
|
|
| ### C. Limitations |
|
|
| Models with higher resolution are more friendly to OCR results. We are currently training such models and will soon make them available. |
|
|
|
|
| ## Acknowledgments |
|
|
| We would like to express our gratitude to [Xie Yin](https://huggingface.co/Yin-Xie) and [Yumeng Wang](https://huggingface.co/devymex) for their significant contributions to the experimental validation in MLLMs. |