Zhang Hui commited on
Commit ·
ff8d687
1
Parent(s): 83bb89e
update readme
Browse files
README.md
CHANGED
|
@@ -4,12 +4,15 @@ license: apache-2.0
|
|
| 4 |
|
| 5 |
# CatVision
|
| 6 |
|
| 7 |
-
|
| 8 |
## Introduction
|
| 9 |
|
| 10 |
-
A multimodal large-scale model, characterized by its open-source nature, closely emulates the functionalities of the GPT4V/Qwen-VL-Plus model. Built upon the foundation of Qwen-72b-Chat, CatVision in handling inputs that combine both images and text. This model is designed to effectively follow instructions for output formats, benefiting from the strengths of Qwen72b.
|
|
|
|
|
|
|
| 11 |
|
| 12 |
-
|
|
|
|
|
|
|
| 13 |
|
| 14 |
- Our training approach consisted of two stages, inspired by LLava1.5. In the initial stage, we trained the visual encoder + perceptual resampler, and in the second stage, we focused on training the large language model + perceptual resampler with instructional data. To overcome limited computational resources (32xA100-80G), we used Lora for training in both stages.
|
| 15 |
|
|
@@ -68,7 +71,7 @@ Our model achieved favorable results on the many leaderboards.
|
|
| 68 |
| Gemini Pro | 47.9 | ---- |
|
| 69 |
| Yi-VL-34B | 45.9 | 41.6 |
|
| 70 |
| Qwen-VL-PLUS | 45.2 | 40.8 |
|
| 71 |
-
| **CatVision**
|
| 72 |
| Macro-VL | 41.2 | 40.4 |
|
| 73 |
| InfiMM-Zephyr-7B | 39.4 | 35.5 |
|
| 74 |
| Yi-VL-6B | 39.1 | 37.8 |
|
|
@@ -79,11 +82,11 @@ Our model achieved favorable results on the many leaderboards.
|
|
| 79 |
|
| 80 |
- **[CMMMU](https://github.com/CMMMU-Benchmark/CMMMU/blob/main/README.md)**
|
| 81 |
|
| 82 |
-
| Model | Val (900) | Test (11K)
|
| 83 |
|--------------------------------|:---------:|:------------:|
|
| 84 |
-
| GPT-4V(ision) (Playground) | 42.5 | 43.7
|
| 85 |
| Qwen-VL-PLUS* | 39.5 | 36.8 |
|
| 86 |
-
| **CatVision**
|
| 87 |
| Yi-VL-34B | 36.2 | 36.5 |
|
| 88 |
| Yi-VL-6B | 35.8 | 35.0 |
|
| 89 |
| Qwen-VL-7B-Chat | 30.7 | 31.3 |
|
|
@@ -104,7 +107,7 @@ Our model achieved favorable results on the many leaderboards.
|
|
| 104 |
| Qwen-VL-PLUS(BASE) | 83.3 | 83.2 | 82.7 | 81.5 | 77.6 |
|
| 105 |
| GPT4v | 77.0 | 75.1 | 74.4 | 75.0 | 46.5 |
|
| 106 |
| Qwen-VL-PLUS | 67.0 | 66.2 | 70.7 | 69.6 | 55.1 |
|
| 107 |
-
| **CatVision**
|
| 108 |
| Qwen-VL-Chat | 61.8 | 60.6 | 56.3 | 56.7 | 41.2 |
|
| 109 |
|
| 110 |
- **[MME](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models)**
|
|
@@ -113,7 +116,7 @@ Our model achieved favorable results on the many leaderboards.
|
|
| 113 |
|---------------|:----------:|:---------:|
|
| 114 |
| GPT4v | 1409.43 | 517.14 |
|
| 115 |
| Qwen-VL-PLUS | 1681.25 | 502.14 |
|
| 116 |
-
| **CatVision**
|
| 117 |
| Qwen-VL-Chat | 1487.57 | 360.71 |
|
| 118 |
|
| 119 |
- **Open Compress**
|
|
|
|
| 4 |
|
| 5 |
# CatVision
|
| 6 |
|
|
|
|
| 7 |
## Introduction
|
| 8 |
|
| 9 |
+
A multimodal large-scale model, characterized by its open-source nature, closely emulates the functionalities of the GPT4V/Qwen-VL-Plus model. Built upon the foundation of Qwen-72b-Chat, CatVision in handling inputs that combine both images and text. This model is designed to effectively follow instructions for output formats, benefiting from the strengths of Qwen72b.
|
| 10 |
+
|
| 11 |
+
一个开源多模态大模型,紧密模拟了GPT4V/Qwen-VL-PLUS系列模型的功能。该模型建立在Qwen-72b-Chat的基础上,可以处理包含交错的图文输入。该模型从Qwen72b的优势中受益,旨在有效地遵循输出格式指令。
|
| 12 |
|
| 13 |
+
Our model performs close to the closed-source Qwen-VL-PLUS on many datasets and significantly surpasses the performance of the open-source model Qwen-VL-7B-Chat.
|
| 14 |
+
|
| 15 |
+
我们的模型在很多数据集上,接近闭源的Qwen-VL-PLUS的效果,并大幅超过开源模型Qwen-VL-7B-Chat的效果。
|
| 16 |
|
| 17 |
- Our training approach consisted of two stages, inspired by LLava1.5. In the initial stage, we trained the visual encoder + perceptual resampler, and in the second stage, we focused on training the large language model + perceptual resampler with instructional data. To overcome limited computational resources (32xA100-80G), we used Lora for training in both stages.
|
| 18 |
|
|
|
|
| 71 |
| Gemini Pro | 47.9 | ---- |
|
| 72 |
| Yi-VL-34B | 45.9 | 41.6 |
|
| 73 |
| Qwen-VL-PLUS | 45.2 | 40.8 |
|
| 74 |
+
| **CatVision** | 45.9 | 40.1 |
|
| 75 |
| Macro-VL | 41.2 | 40.4 |
|
| 76 |
| InfiMM-Zephyr-7B | 39.4 | 35.5 |
|
| 77 |
| Yi-VL-6B | 39.1 | 37.8 |
|
|
|
|
| 82 |
|
| 83 |
- **[CMMMU](https://github.com/CMMMU-Benchmark/CMMMU/blob/main/README.md)**
|
| 84 |
|
| 85 |
+
| Model | Val (900) | Test (11K) |
|
| 86 |
|--------------------------------|:---------:|:------------:|
|
| 87 |
+
| GPT-4V(ision) (Playground) | 42.5 | 43.7 |
|
| 88 |
| Qwen-VL-PLUS* | 39.5 | 36.8 |
|
| 89 |
+
| **CatVision** | 39.6 | ---- |
|
| 90 |
| Yi-VL-34B | 36.2 | 36.5 |
|
| 91 |
| Yi-VL-6B | 35.8 | 35.0 |
|
| 92 |
| Qwen-VL-7B-Chat | 30.7 | 31.3 |
|
|
|
|
| 107 |
| Qwen-VL-PLUS(BASE) | 83.3 | 83.2 | 82.7 | 81.5 | 77.6 |
|
| 108 |
| GPT4v | 77.0 | 75.1 | 74.4 | 75.0 | 46.5 |
|
| 109 |
| Qwen-VL-PLUS | 67.0 | 66.2 | 70.7 | 69.6 | 55.1 |
|
| 110 |
+
| **CatVision** | 70.9 | 71.8 | 70.2 | 71.6 | 49.8 |
|
| 111 |
| Qwen-VL-Chat | 61.8 | 60.6 | 56.3 | 56.7 | 41.2 |
|
| 112 |
|
| 113 |
- **[MME](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models)**
|
|
|
|
| 116 |
|---------------|:----------:|:---------:|
|
| 117 |
| GPT4v | 1409.43 | 517.14 |
|
| 118 |
| Qwen-VL-PLUS | 1681.25 | 502.14 |
|
| 119 |
+
| **CatVision** | 1560.90 | 366.43 |
|
| 120 |
| Qwen-VL-Chat | 1487.57 | 360.71 |
|
| 121 |
|
| 122 |
- **Open Compress**
|