huizhang0110
/

CatVision

@@ -4,12 +4,15 @@ license: apache-2.0
 # CatVision
 ## Introduction
-A multimodal large-scale model, characterized by its open-source nature, closely emulates the functionalities of the GPT4V/Qwen-VL-Plus model. Built upon the foundation of Qwen-72b-Chat, CatVision in handling inputs that combine both images and text. This model is designed to effectively follow instructions for output formats, benefiting from the strengths of Qwen72b.
-一个开源多模态大模型，紧密模拟了GPT4V/Qwen-VL系列模型的功能。该模型建立在Qwen-72b-Chat的基础上，CatVision可以处理包含交错的图像/文本输入。该模型旨在有效地遵循输出格式的指令，从Qwen72b的优势中受益。
 - Our training approach consisted of two stages, inspired by LLava1.5. In the initial stage, we trained the visual encoder + perceptual resampler, and in the second stage, we focused on training the large language model + perceptual resampler with instructional data. To overcome limited computational resources (32xA100-80G), we used Lora for training in both stages.
@@ -68,7 +71,7 @@ Our model achieved favorable results on the many leaderboards.
 | Gemini Pro                     |   47.9    |     ----     |
 | Yi-VL-34B                      |   45.9    |     41.6     |
 | Qwen-VL-PLUS                   |   45.2    |     40.8     |
-| **CatVision**                    |   45.9    |     40.1     |
 | Macro-VL                       |   41.2    |     40.4     |
 | InfiMM-Zephyr-7B                |   39.4    |     35.5     |
 | Yi-VL-6B                       |   39.1    |     37.8     |
@@ -79,11 +82,11 @@ Our model achieved favorable results on the many leaderboards.
 - **[CMMMU](https://github.com/CMMMU-Benchmark/CMMMU/blob/main/README.md)**
-| Model                          | Val (900) | Test (11K) |
 |--------------------------------|:---------:|:------------:|
-| GPT-4V(ision) (Playground)     |   42.5    |     43.7   |
 | Qwen-VL-PLUS*                  |   39.5    |     36.8     |
-| **CatVision**                    |   39.6    |     ----     |
 | Yi-VL-34B                      |   36.2    |     36.5     |
 | Yi-VL-6B                       |   35.8    |     35.0     |
 | Qwen-VL-7B-Chat                |   30.7    |     31.3     |
@@ -104,7 +107,7 @@ Our model achieved favorable results on the many leaderboards.
 | Qwen-VL-PLUS(BASE)  | 83.3              | 83.2             | 82.7              | 81.5             | 77.6    |
 | GPT4v               | 77.0              | 75.1             | 74.4              | 75.0             | 46.5    |
 | Qwen-VL-PLUS        | 67.0              | 66.2             | 70.7              | 69.6             | 55.1    |
-| **CatVision**         | 70.9              | 71.8             | 70.2              | 71.6             | 49.8    |
 | Qwen-VL-Chat        | 61.8              | 60.6             | 56.3              | 56.7             | 41.2    |
 - **[MME](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models)**
@@ -113,7 +116,7 @@ Our model achieved favorable results on the many leaderboards.
 |---------------|:----------:|:---------:|
 | GPT4v         | 1409.43    | 517.14    |
 | Qwen-VL-PLUS  | 1681.25    | 502.14    |
-| **CatVision**   | 1560.90    | 366.43    |
 | Qwen-VL-Chat  | 1487.57    | 360.71    |
 - **Open Compress**

 # CatVision
 ## Introduction
+A multimodal large-scale model, characterized by its open-source nature, closely emulates the functionalities of the GPT4V/Qwen-VL-Plus model. Built upon the foundation of Qwen-72b-Chat, CatVision in handling inputs that combine both images and text. This model is designed to effectively follow instructions for output formats, benefiting from the strengths of Qwen72b.
+一个开源多模态大模型，紧密模拟了GPT4V/Qwen-VL-PLUS系列模型的功能。该模型建立在Qwen-72b-Chat的基础上，可以处理包含交错的图文输入。该模型从Qwen72b的优势中受益，旨在有效地遵循输出格式指令。
+Our model performs close to the closed-source Qwen-VL-PLUS on many datasets and significantly surpasses the performance of the open-source model Qwen-VL-7B-Chat.
+我们的模型在很多数据集上，接近闭源的Qwen-VL-PLUS的效果，并大幅超过开源模型Qwen-VL-7B-Chat的效果。
 - Our training approach consisted of two stages, inspired by LLava1.5. In the initial stage, we trained the visual encoder + perceptual resampler, and in the second stage, we focused on training the large language model + perceptual resampler with instructional data. To overcome limited computational resources (32xA100-80G), we used Lora for training in both stages.
 | Gemini Pro                     |   47.9    |     ----     |
 | Yi-VL-34B                      |   45.9    |     41.6     |
 | Qwen-VL-PLUS                   |   45.2    |     40.8     |
+| **CatVision**                  |   45.9    |     40.1     |
 | Macro-VL                       |   41.2    |     40.4     |
 | InfiMM-Zephyr-7B                |   39.4    |     35.5     |
 | Yi-VL-6B                       |   39.1    |     37.8     |
 - **[CMMMU](https://github.com/CMMMU-Benchmark/CMMMU/blob/main/README.md)**
+| Model                          | Val (900) | Test (11K)   |
 |--------------------------------|:---------:|:------------:|
+| GPT-4V(ision) (Playground)     |   42.5    |     43.7     |
 | Qwen-VL-PLUS*                  |   39.5    |     36.8     |
+| **CatVision**                  |   39.6    |     ----     |
 | Yi-VL-34B                      |   36.2    |     36.5     |
 | Yi-VL-6B                       |   35.8    |     35.0     |
 | Qwen-VL-7B-Chat                |   30.7    |     31.3     |
 | Qwen-VL-PLUS(BASE)  | 83.3              | 83.2             | 82.7              | 81.5             | 77.6    |
 | GPT4v               | 77.0              | 75.1             | 74.4              | 75.0             | 46.5    |
 | Qwen-VL-PLUS        | 67.0              | 66.2             | 70.7              | 69.6             | 55.1    |
+| **CatVision**       | 70.9              | 71.8             | 70.2              | 71.6             | 49.8    |
 | Qwen-VL-Chat        | 61.8              | 60.6             | 56.3              | 56.7             | 41.2    |
 - **[MME](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models)**
 |---------------|:----------:|:---------:|
 | GPT4v         | 1409.43    | 517.14    |
 | Qwen-VL-PLUS  | 1681.25    | 502.14    |
+| **CatVision** | 1560.90    | 366.43    |
 | Qwen-VL-Chat  | 1487.57    | 360.71    |
 - **Open Compress**