Zhang Hui

update readme

ff8d687 about 2 years ago

7.18 kB

	---
	license: apache-2.0
	---

	# CatVision

	## Introduction

	A multimodal large-scale model, characterized by its open-source nature, closely emulates the functionalities of the GPT4V/Qwen-VL-Plus model. Built upon the foundation of Qwen-72b-Chat, CatVision in handling inputs that combine both images and text. This model is designed to effectively follow instructions for output formats, benefiting from the strengths of Qwen72b.

	一个开源多模态大模型，紧密模拟了GPT4V/Qwen-VL-PLUS系列模型的功能。该模型建立在Qwen-72b-Chat的基础上，可以处理包含交错的图文输入。该模型从Qwen72b的优势中受益，旨在有效地遵循输出格式指令。

	Our model performs close to the closed-source Qwen-VL-PLUS on many datasets and significantly surpasses the performance of the open-source model Qwen-VL-7B-Chat.

	我们的模型在很多数据集上，接近闭源的Qwen-VL-PLUS的效果，并大幅超过开源模型Qwen-VL-7B-Chat的效果。

	- Our training approach consisted of two stages, inspired by LLava1.5. In the initial stage, we trained the visual encoder + perceptual resampler, and in the second stage, we focused on training the large language model + perceptual resampler with instructional data. To overcome limited computational resources (32xA100-80G), we used Lora for training in both stages.

	受LLava1.5启发，我们的训练分为两个阶段：在初始阶段，我们训练了视觉编码器+感知重采样器；在第二阶段，我们专注于使用视觉指令数据训练大型语言模型+感知重采样器。为了克服有限的计算资源(32xA100-80G)，我们在两个阶段都使用了Lora进行培训。

	- During the first stage, our training data included samples from ShareGPT4V and CC12M. As we progressed to the second stage, our training dataset encompassed ShareGPT4V fine-tune data, LVIS Instruct4V, OCR data, InforGraphics/Chart QA data, and data sourced from region descriptions in VG.

	在第一阶段，我们的训练数据包括来自ShareGPT4V和CC12M的样本。第二阶段，我们的训练数据集包括ShareGPT4V微调数据、LVIS Instruct4V、OCR数据、信息图表问答数据以及从VG区域描述中获取的数据。

	- The visual encoding part is inherited from Qwen-VL-Chat, i.e., Openclip ViT-bigG.

	视觉编码部分继承自Qwen-VL-Chat，即Openclip ViT-bigG。

	- We are continuously collecting instruction data, optimizing the model, and looking forward to supporting more tasks.

	我们正在持续收集指令数据，优化模型，期待能支持更多的功能。

	## Quick Start

	```
	from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig
	tokenizer = AutoTokenizer.from_pretrained(
	pretrained_model_name_or_path="huizhang0110/CatVision",
	model_max_length=8192,
	padding_side="left",
	trust_remote_code=True
	)
	config = AutoConfig.from_pretrained(
	pretrained_model_name_or_path="huizhang0110/CatVision",
	trust_remote_code=True
	)
	model = AutoModelForCausalLM.from_pretrained(
	pretrained_model_name_or_path="huizhang0110/CatVision",
	config=config,
	device_map="auto",
	trust_remote_code=True
	).eval()
	query = "<img>demo.jpg</img>\n介绍一下这张图像！"
	response, history = model.chat(
	tokenizer,
	query=query,
	history=None,
	)
	```

	## Benchmark

	Our model achieved favorable results on the many leaderboards.

	- [MMMU](https://eval.ai/web/challenges/challenge-page/2179/leaderboard/5377)

	\| Model \| Val (900) \| Test (11K) \|
	\|--------------------------------\|:---------:\|:------------:\|
	\| Gemini Ultra \| 59.4 \| ---- \|
	\| GPT4V \| 56.8 \| 55.7 \|
	\| Gemini Pro \| 47.9 \| ---- \|
	\| Yi-VL-34B \| 45.9 \| 41.6 \|
	\| Qwen-VL-PLUS \| 45.2 \| 40.8 \|
	\| CatVision \| 45.9 \| 40.1 \|
	\| Macro-VL \| 41.2 \| 40.4 \|
	\| InfiMM-Zephyr-7B \| 39.4 \| 35.5 \|
	\| Yi-VL-6B \| 39.1 \| 37.8 \|
	\| SVIT \| 38.0 \| 34.1 \|
	\| LLaVA-1.5-13B \| 36.4 \| 33.6 \|
	\| Emu2-Chat \| 36.3 \| 34.1 \|
	\| Qwen-VL-7B-Chat \| 35.9 \| 32.9 \|

	- [CMMMU](https://github.com/CMMMU-Benchmark/CMMMU/blob/main/README.md)

	\| Model \| Val (900) \| Test (11K) \|
	\|--------------------------------\|:---------:\|:------------:\|
	\| GPT-4V(ision) (Playground) \| 42.5 \| 43.7 \|
	\| Qwen-VL-PLUS* \| 39.5 \| 36.8 \|
	\| CatVision \| 39.6 \| ---- \|
	\| Yi-VL-34B \| 36.2 \| 36.5 \|
	\| Yi-VL-6B \| 35.8 \| 35.0 \|
	\| Qwen-VL-7B-Chat \| 30.7 \| 31.3 \|
	\| InternVL-Chat-ViT-6B-Vicuna-7B \| 26.4 \| 26.7 \|
	\| InternVL-Chat-ViT-6B-Vicuna-13B\| 27.4 \| 26.1 \|
	\| CogAgent-Chat \| 24.6 \| 23.6 \|
	\| Emu2-Chat \| 23.8 \| 24.5 \|
	\| Chinese-LLaVA \| 25.5 \| 23.4 \|
	\| VisCPM \| 25.2 \| 22.7 \|
	\| mPLUG-OWL2 \| 20.8 \| 22.2 \|
	\| Frequent Choice \| 24.1 \| 26.0 \|
	\| Random Choice \| 21.6 \| 21.6 \|

	- [MMBench](https://mmbench.opencompass.org.cn/leaderboard)

	\| Model \| mmbench_cn (test) \| mmbench_cn (dev) \| mmbench_en (test) \| mmbench_zh (dev) \| ccbench \|
	\|---------------------\|:-----------------:\|:----------------:\|:-----------------:\|:----------------:\|:-------:\|
	\| Qwen-VL-PLUS(BASE) \| 83.3 \| 83.2 \| 82.7 \| 81.5 \| 77.6 \|
	\| GPT4v \| 77.0 \| 75.1 \| 74.4 \| 75.0 \| 46.5 \|
	\| Qwen-VL-PLUS \| 67.0 \| 66.2 \| 70.7 \| 69.6 \| 55.1 \|
	\| CatVision \| 70.9 \| 71.8 \| 70.2 \| 71.6 \| 49.8 \|
	\| Qwen-VL-Chat \| 61.8 \| 60.6 \| 56.3 \| 56.7 \| 41.2 \|

	- [MME](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models)

	\| Model \| Perception \| Cognition \|
	\|---------------\|:----------:\|:---------:\|
	\| GPT4v \| 1409.43 \| 517.14 \|
	\| Qwen-VL-PLUS \| 1681.25 \| 502.14 \|
	\| CatVision \| 1560.90 \| 366.43 \|
	\| Qwen-VL-Chat \| 1487.57 \| 360.71 \|

	- Open Compress

	wait

	- Show Case

	图像描述

	![图像描述](image.png)

	信息图表

	![图像问答](image-1.png)

	区域理解

	![区域理解](image-2.png)

	## Citation

	```
	@misc{CatVision,
	author = {zhanghui@4paradigm.com},
	title = {CatVision},
	year = {2024},
	publisher = {huggingface},
	howpublished = {\url{https://huggingface.co/huizhang0110/CatVision}}
	}
	```

	![logo](./catvision.png)