Image-Text-to-Text
Transformers
Safetensors
English
Chinese
multimodal

Improve language tag

#1
by lbourdois - opened
Files changed (1) hide show
  1. README.md +128 -117
README.md CHANGED
@@ -1,117 +1,128 @@
1
- ---
2
- license: apache-2.0
3
- language:
4
- - en
5
- - zh
6
- tags:
7
- - multimodal
8
- library_name: transformers
9
- datasets:
10
- - BAAI/Infinity-MM
11
- - BAAI/Infinity-Instruct
12
- - BAAI/Infinity-Preference
13
- base_model:
14
- - Qwen/Qwen2.5-1.5B-Instruct
15
- - google/siglip-so400m-patch14-384
16
- pipeline_tag: image-text-to-text
17
- ---
18
-
19
- # Introduction
20
-
21
- The [**Aquila-VL-2B**](https://huggingface.co/BAAI/Aquila-VL-2B-llava-qwen) model is a vision-language model (VLM) trained with open-sourced dataset [**Infinity-MM**](https://huggingface.co/datasets/BAAI/Infinity-MM).
22
-
23
- This repository is used to release intermediate checkpoints obtained during different stages of training. Please feel free to use these models for analysis and experimentation.
24
-
25
- # Evaluation
26
-
27
- We evaluated the model using the [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) tool. Whenever possible, we prioritized using the OpenAI API for test sets that support API-based evaluation.
28
-
29
-
30
- | benchmark | 2-a | 2-b | 2-c | 3 | [4 (final_model)](https://huggingface.co/BAAI/Aquila-VL-2B-llava-qwen) |
31
- | :--------------------------: | :---: | ----- | :---: | :---: | :---: |
32
- | MMMU<sub>val</sub> | 42.89 | 42.44 | 44.78 | 46.22 | 47.4 |
33
- | MMStar | 45.80 | 49.33 | 51.73 | 53.73 | 54.9 |
34
- | MMBench_V1.1<sub>test</sub> | 65.41 | 67.53 | 68.03 | 73.40 | 75.2 |
35
- | MathVista<sub>testmini</sub> | 48.60 | 52.40 | 54.30 | 60.10 | 59.0 |
36
- | HallusionBench | 37.53 | 39.65 | 38.23 | 40.21 | 43.0 |
37
- | OCRBench | 57.50 | 58.90 | 62.50 | 76.70 | 77.2 |
38
- | AI2D<sub>test</sub> | 64.31 | 66.74 | 68.13 | 75.55 | 75.0 |
39
- | MMVet | 36.24 | 36.97 | 39.68 | 38.35 | 44.3 |
40
- | Average | 49.78 | 51.75 | 53.42 | 58.03 | 59.51 |
41
-
42
-
43
- # How to use
44
-
45
- ```python
46
- # pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git
47
- from llava.model.builder import load_pretrained_model
48
- from llava.mm_utils import process_images, tokenizer_image_token
49
- from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN
50
- from llava.conversation import conv_templates
51
- from PIL import Image
52
- import requests
53
- import copy
54
- import torch
55
- import warnings
56
-
57
- warnings.filterwarnings("ignore")
58
-
59
- pretrained = "BAAI/Aquila-VL-2B-llava-qwen"
60
-
61
- model_name = "llava_qwen"
62
- device = "cuda"
63
- device_map = "auto"
64
- tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map) # Add any other thing you want to pass in llava_model_args
65
-
66
- model.eval()
67
-
68
- # load image from url
69
- url = "https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_v1_5_radar.jpg?raw=true"
70
- image = Image.open(requests.get(url, stream=True).raw)
71
-
72
- # load image from local environment
73
- # url = "./local_image.jpg"
74
- # image = Image.open(url)
75
-
76
- image_tensor = process_images([image], image_processor, model.config)
77
- image_tensor = [_image.to(dtype=torch.float16, device=device) for _image in image_tensor]
78
-
79
- conv_template = "qwen_1_5" # Make sure you use correct chat template for different models
80
- question = DEFAULT_IMAGE_TOKEN + "\nWhat is shown in this image?"
81
- conv = copy.deepcopy(conv_templates[conv_template])
82
- conv.append_message(conv.roles[0], question)
83
- conv.append_message(conv.roles[1], None)
84
- prompt_question = conv.get_prompt()
85
-
86
- input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
87
- image_sizes = [image.size]
88
-
89
- cont = model.generate(
90
- input_ids,
91
- images=image_tensor,
92
- image_sizes=image_sizes,
93
- do_sample=False,
94
- temperature=0,
95
- max_new_tokens=4096,
96
- )
97
-
98
- text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)
99
-
100
- print(text_outputs)
101
- ```
102
-
103
-
104
- ## **Citation**
105
- If you find this useful, please cite the following work
106
- ```
107
- @misc{gu2024infinitymmscalingmultimodalperformance,
108
- title={Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data},
109
- author={Shuhao Gu and Jialing Zhang and Siyuan Zhou and Kevin Yu and Zhaohu Xing and Liangdong Wang and Zhou Cao and Jintao Jia and Zhuoyi Zhang and Yixuan Wang and Zhenchong Hu and Bo-Wen Zhang and Jijie Li and Dong Liang and Yingli Zhao and Yulong Ao and Yaoqi Liu and Fangxiang Feng and Guang Liu},
110
- year={2024},
111
- eprint={2410.18558},
112
- archivePrefix={arXiv},
113
- primaryClass={cs.CL},
114
- url={https://arxiv.org/abs/2410.18558},
115
- }
116
- ```
117
-
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - zho
5
+ - eng
6
+ - fra
7
+ - spa
8
+ - por
9
+ - deu
10
+ - ita
11
+ - rus
12
+ - jpn
13
+ - kor
14
+ - vie
15
+ - tha
16
+ - ara
17
+ tags:
18
+ - multimodal
19
+ library_name: transformers
20
+ datasets:
21
+ - BAAI/Infinity-MM
22
+ - BAAI/Infinity-Instruct
23
+ - BAAI/Infinity-Preference
24
+ base_model:
25
+ - Qwen/Qwen2.5-1.5B-Instruct
26
+ - google/siglip-so400m-patch14-384
27
+ pipeline_tag: image-text-to-text
28
+ ---
29
+
30
+ # Introduction
31
+
32
+ The [**Aquila-VL-2B**](https://huggingface.co/BAAI/Aquila-VL-2B-llava-qwen) model is a vision-language model (VLM) trained with open-sourced dataset [**Infinity-MM**](https://huggingface.co/datasets/BAAI/Infinity-MM).
33
+
34
+ This repository is used to release intermediate checkpoints obtained during different stages of training. Please feel free to use these models for analysis and experimentation.
35
+
36
+ # Evaluation
37
+
38
+ We evaluated the model using the [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) tool. Whenever possible, we prioritized using the OpenAI API for test sets that support API-based evaluation.
39
+
40
+
41
+ | benchmark | 2-a | 2-b | 2-c | 3 | [4 (final_model)](https://huggingface.co/BAAI/Aquila-VL-2B-llava-qwen) |
42
+ | :--------------------------: | :---: | ----- | :---: | :---: | :---: |
43
+ | MMMU<sub>val</sub> | 42.89 | 42.44 | 44.78 | 46.22 | 47.4 |
44
+ | MMStar | 45.80 | 49.33 | 51.73 | 53.73 | 54.9 |
45
+ | MMBench_V1.1<sub>test</sub> | 65.41 | 67.53 | 68.03 | 73.40 | 75.2 |
46
+ | MathVista<sub>testmini</sub> | 48.60 | 52.40 | 54.30 | 60.10 | 59.0 |
47
+ | HallusionBench | 37.53 | 39.65 | 38.23 | 40.21 | 43.0 |
48
+ | OCRBench | 57.50 | 58.90 | 62.50 | 76.70 | 77.2 |
49
+ | AI2D<sub>test</sub> | 64.31 | 66.74 | 68.13 | 75.55 | 75.0 |
50
+ | MMVet | 36.24 | 36.97 | 39.68 | 38.35 | 44.3 |
51
+ | Average | 49.78 | 51.75 | 53.42 | 58.03 | 59.51 |
52
+
53
+
54
+ # How to use
55
+
56
+ ```python
57
+ # pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git
58
+ from llava.model.builder import load_pretrained_model
59
+ from llava.mm_utils import process_images, tokenizer_image_token
60
+ from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN
61
+ from llava.conversation import conv_templates
62
+ from PIL import Image
63
+ import requests
64
+ import copy
65
+ import torch
66
+ import warnings
67
+
68
+ warnings.filterwarnings("ignore")
69
+
70
+ pretrained = "BAAI/Aquila-VL-2B-llava-qwen"
71
+
72
+ model_name = "llava_qwen"
73
+ device = "cuda"
74
+ device_map = "auto"
75
+ tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map) # Add any other thing you want to pass in llava_model_args
76
+
77
+ model.eval()
78
+
79
+ # load image from url
80
+ url = "https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_v1_5_radar.jpg?raw=true"
81
+ image = Image.open(requests.get(url, stream=True).raw)
82
+
83
+ # load image from local environment
84
+ # url = "./local_image.jpg"
85
+ # image = Image.open(url)
86
+
87
+ image_tensor = process_images([image], image_processor, model.config)
88
+ image_tensor = [_image.to(dtype=torch.float16, device=device) for _image in image_tensor]
89
+
90
+ conv_template = "qwen_1_5" # Make sure you use correct chat template for different models
91
+ question = DEFAULT_IMAGE_TOKEN + "\nWhat is shown in this image?"
92
+ conv = copy.deepcopy(conv_templates[conv_template])
93
+ conv.append_message(conv.roles[0], question)
94
+ conv.append_message(conv.roles[1], None)
95
+ prompt_question = conv.get_prompt()
96
+
97
+ input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
98
+ image_sizes = [image.size]
99
+
100
+ cont = model.generate(
101
+ input_ids,
102
+ images=image_tensor,
103
+ image_sizes=image_sizes,
104
+ do_sample=False,
105
+ temperature=0,
106
+ max_new_tokens=4096,
107
+ )
108
+
109
+ text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)
110
+
111
+ print(text_outputs)
112
+ ```
113
+
114
+
115
+ ## **Citation**
116
+ If you find this useful, please cite the following work
117
+ ```
118
+ @misc{gu2024infinitymmscalingmultimodalperformance,
119
+ title={Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data},
120
+ author={Shuhao Gu and Jialing Zhang and Siyuan Zhou and Kevin Yu and Zhaohu Xing and Liangdong Wang and Zhou Cao and Jintao Jia and Zhuoyi Zhang and Yixuan Wang and Zhenchong Hu and Bo-Wen Zhang and Jijie Li and Dong Liang and Yingli Zhao and Yulong Ao and Yaoqi Liu and Fangxiang Feng and Guang Liu},
121
+ year={2024},
122
+ eprint={2410.18558},
123
+ archivePrefix={arXiv},
124
+ primaryClass={cs.CL},
125
+ url={https://arxiv.org/abs/2410.18558},
126
+ }
127
+ ```
128
+