Improve language tag

#1
by lbourdois - opened
Files changed (1) hide show
  1. README.md +187 -176
README.md CHANGED
@@ -1,177 +1,188 @@
1
- ---
2
- license: apache-2.0
3
- datasets:
4
- - lmms-lab/LLaVA-OneVision-Data
5
- - BAAI/Infinity-MM
6
- language:
7
- - en
8
- - zh
9
- base_model:
10
- - google/siglip2-so400m-patch14-384
11
- - Qwen/Qwen2.5-3B-Instruct
12
- pipeline_tag: image-text-to-text
13
- library_name: transformers
14
- ---
15
-
16
- ## Introduction
17
-
18
- We are excited to introduce **Ristretto**, our newest Vision language model (VLM) that represents a significant step forward in the field. Ristretto features a capability to deploy dynamic image tokens, enables flexible adjustment of image token quantities based on task requirements while enhancing the projector architecture to support dynamic token configurations. This new model delivers improved performance and versatility compared to its predecessors through its refined architecture and advanced training approach.
19
-
20
- **Key Innovations**
21
-
22
- Coming soon...
23
-
24
- ### Environment Setup
25
-
26
- ```bash
27
- pip install torch>=2.3.0
28
- pip install transformers==4.37.0
29
- ```
30
-
31
-
32
- ### How to use?
33
-
34
- ```python
35
- import torch
36
- import torchvision.transforms as T
37
- from PIL import Image
38
- from torchvision.transforms.functional import InterpolationMode
39
- from transformers import AutoModel, AutoTokenizer
40
- import requests
41
- from io import BytesIO
42
-
43
- IMAGENET_MEAN = (0.5, 0.5, 0.5)
44
- IMAGENET_STD = (0.5, 0.5, 0.5)
45
-
46
- def build_transform(input_size):
47
- MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
48
- transform = T.Compose([
49
- T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
50
- T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
51
- T.ToTensor(),
52
- T.Normalize(mean=MEAN, std=STD)
53
- ])
54
- return transform
55
-
56
- def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
57
- best_ratio_diff = float('inf')
58
- best_ratio = (1, 1)
59
- area = width * height
60
- for ratio in target_ratios:
61
- target_aspect_ratio = ratio[0] / ratio[1]
62
- ratio_diff = abs(aspect_ratio - target_aspect_ratio)
63
- if ratio_diff < best_ratio_diff:
64
- best_ratio_diff = ratio_diff
65
- best_ratio = ratio
66
- elif ratio_diff == best_ratio_diff:
67
- if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
68
- best_ratio = ratio
69
- return best_ratio
70
-
71
- def dynamic_preprocess(image, min_num=1, max_num=10, image_size=448, use_thumbnail=False):
72
- orig_width, orig_height = image.size
73
- aspect_ratio = orig_width / orig_height
74
-
75
- # calculate the existing image aspect ratio
76
- target_ratios = set(
77
- (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
78
- i * j <= max_num and i * j >= min_num)
79
- target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])
80
-
81
- # find the closest aspect ratio to the target
82
- target_aspect_ratio = find_closest_aspect_ratio(
83
- aspect_ratio, target_ratios, orig_width, orig_height, image_size)
84
-
85
- # calculate the target width and height
86
- target_width = image_size * target_aspect_ratio[0]
87
- target_height = image_size * target_aspect_ratio[1]
88
- blocks = target_aspect_ratio[0] * target_aspect_ratio[1]
89
-
90
- # resize the image
91
- resized_img = image.resize((target_width, target_height))
92
- processed_images = []
93
- for i in range(blocks):
94
- box = (
95
- (i % (target_width // image_size)) * image_size,
96
- (i // (target_width // image_size)) * image_size,
97
- ((i % (target_width // image_size)) + 1) * image_size,
98
- ((i // (target_width // image_size)) + 1) * image_size
99
- )
100
- # split the image
101
- split_img = resized_img.crop(box)
102
- processed_images.append(split_img)
103
- assert len(processed_images) == blocks
104
- if use_thumbnail and len(processed_images) != 1:
105
- thumbnail_img = image.resize((image_size, image_size))
106
- processed_images.append(thumbnail_img)
107
- return processed_images
108
-
109
- def load_image(image_data, input_size=384, max_num=10):
110
- image = Image.open(image_data).convert('RGB')
111
- transform = build_transform(input_size=input_size)
112
- images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
113
- pixel_values = [transform(image) for image in images]
114
- pixel_values = torch.stack(pixel_values)
115
- return pixel_values
116
-
117
- model_path = 'LiAutoAD/Ristretto-3B'
118
- model = AutoModel.from_pretrained(
119
- path,
120
- torch_dtype=torch.bfloat16,
121
- trust_remote_code=True).eval().cuda()
122
- tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)
123
-
124
-
125
-
126
- image_url = 'https://github.com/user-attachments/assets/83258e94-5d61-48ef-a87f-80dd9d895524'
127
- response = requests.get(image_url)
128
- image_data = BytesIO(response.content)
129
- pixel_values = load_image(image_data, max_num=10).to(torch.bfloat16).cuda()
130
- generation_config = dict(max_new_tokens=1024, do_sample=True)
131
-
132
- # The recommended range for `num_image_token` is 64 to 576, and the value can be adjusted based on task requirements.
133
- num_image_token = 256
134
-
135
- # pure-text conversation
136
- question = 'Hello, who are you?'
137
- response, history = model.chat(tokenizer, None, question, generation_config, history=None, return_history=True)
138
- print(f'User: {question} Assistant: {response}')
139
-
140
- # text-image conversation && multi-round conversation
141
- question = '<image> Please describe the image.'
142
- response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
143
- print(f'User: {question} Assistant: {response}')
144
-
145
-
146
- question = 'What is best title for the image?'
147
- response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)
148
- print(f'User: {question} Assistant: {response}')
149
-
150
- ```
151
-
152
- ### Evaluation
153
-
154
- | Benchmark | Qwen2.5-VL-3B | InternVL2.5-4B | Ristretto-3B |
155
- | :-------: | :----------: | :-------------: | :----: |
156
- | MMBench-TEST-avg | 76.8 | 78.2 | 80.1 |
157
- | MMStar | 56.3 | 58.7 | 62.6 |
158
- | MMMU-VAL | 51.2 | 51.8 | 49.1 |
159
- | MathVista-MINI-test | 61.2 | 60.8 | 67.9 |
160
- | HallucinationBench | 46.6 | 46.6 | 50.2 |
161
- | AI2D | 81.4 | 81.4 | 84.3 |
162
- | OCRBench | 82.8 | 82.0 | 84.0 |
163
- | MMVet | 60.0 | 61.5 | 61.8 |
164
- | Average | 64.5 | 65.1 | 67.6 |
165
-
166
- We use [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) to evaluate Ristretto-3B. Other results are taken from [OpenCompass](https://rank.opencompass.org.cn/leaderboard-multimodal)
167
-
168
-
169
- ## License Agreement
170
-
171
- All of our open-source models are licensed under the Apache-2.0 license.
172
-
173
-
174
-
175
- ## Citation
176
-
 
 
 
 
 
 
 
 
 
 
 
177
  <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - lmms-lab/LLaVA-OneVision-Data
5
+ - BAAI/Infinity-MM
6
+ language:
7
+ - zho
8
+ - eng
9
+ - fra
10
+ - spa
11
+ - por
12
+ - deu
13
+ - ita
14
+ - rus
15
+ - jpn
16
+ - kor
17
+ - vie
18
+ - tha
19
+ - ara
20
+ base_model:
21
+ - google/siglip2-so400m-patch14-384
22
+ - Qwen/Qwen2.5-3B-Instruct
23
+ pipeline_tag: image-text-to-text
24
+ library_name: transformers
25
+ ---
26
+
27
+ ## Introduction
28
+
29
+ We are excited to introduce **Ristretto**, our newest Vision language model (VLM) that represents a significant step forward in the field. Ristretto features a capability to deploy dynamic image tokens, enables flexible adjustment of image token quantities based on task requirements while enhancing the projector architecture to support dynamic token configurations. This new model delivers improved performance and versatility compared to its predecessors through its refined architecture and advanced training approach.
30
+
31
+ **Key Innovations**
32
+
33
+ Coming soon...
34
+
35
+ ### Environment Setup
36
+
37
+ ```bash
38
+ pip install torch>=2.3.0
39
+ pip install transformers==4.37.0
40
+ ```
41
+
42
+
43
+ ### How to use?
44
+
45
+ ```python
46
+ import torch
47
+ import torchvision.transforms as T
48
+ from PIL import Image
49
+ from torchvision.transforms.functional import InterpolationMode
50
+ from transformers import AutoModel, AutoTokenizer
51
+ import requests
52
+ from io import BytesIO
53
+
54
+ IMAGENET_MEAN = (0.5, 0.5, 0.5)
55
+ IMAGENET_STD = (0.5, 0.5, 0.5)
56
+
57
+ def build_transform(input_size):
58
+ MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
59
+ transform = T.Compose([
60
+ T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
61
+ T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
62
+ T.ToTensor(),
63
+ T.Normalize(mean=MEAN, std=STD)
64
+ ])
65
+ return transform
66
+
67
+ def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
68
+ best_ratio_diff = float('inf')
69
+ best_ratio = (1, 1)
70
+ area = width * height
71
+ for ratio in target_ratios:
72
+ target_aspect_ratio = ratio[0] / ratio[1]
73
+ ratio_diff = abs(aspect_ratio - target_aspect_ratio)
74
+ if ratio_diff < best_ratio_diff:
75
+ best_ratio_diff = ratio_diff
76
+ best_ratio = ratio
77
+ elif ratio_diff == best_ratio_diff:
78
+ if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
79
+ best_ratio = ratio
80
+ return best_ratio
81
+
82
+ def dynamic_preprocess(image, min_num=1, max_num=10, image_size=448, use_thumbnail=False):
83
+ orig_width, orig_height = image.size
84
+ aspect_ratio = orig_width / orig_height
85
+
86
+ # calculate the existing image aspect ratio
87
+ target_ratios = set(
88
+ (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
89
+ i * j <= max_num and i * j >= min_num)
90
+ target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])
91
+
92
+ # find the closest aspect ratio to the target
93
+ target_aspect_ratio = find_closest_aspect_ratio(
94
+ aspect_ratio, target_ratios, orig_width, orig_height, image_size)
95
+
96
+ # calculate the target width and height
97
+ target_width = image_size * target_aspect_ratio[0]
98
+ target_height = image_size * target_aspect_ratio[1]
99
+ blocks = target_aspect_ratio[0] * target_aspect_ratio[1]
100
+
101
+ # resize the image
102
+ resized_img = image.resize((target_width, target_height))
103
+ processed_images = []
104
+ for i in range(blocks):
105
+ box = (
106
+ (i % (target_width // image_size)) * image_size,
107
+ (i // (target_width // image_size)) * image_size,
108
+ ((i % (target_width // image_size)) + 1) * image_size,
109
+ ((i // (target_width // image_size)) + 1) * image_size
110
+ )
111
+ # split the image
112
+ split_img = resized_img.crop(box)
113
+ processed_images.append(split_img)
114
+ assert len(processed_images) == blocks
115
+ if use_thumbnail and len(processed_images) != 1:
116
+ thumbnail_img = image.resize((image_size, image_size))
117
+ processed_images.append(thumbnail_img)
118
+ return processed_images
119
+
120
+ def load_image(image_data, input_size=384, max_num=10):
121
+ image = Image.open(image_data).convert('RGB')
122
+ transform = build_transform(input_size=input_size)
123
+ images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
124
+ pixel_values = [transform(image) for image in images]
125
+ pixel_values = torch.stack(pixel_values)
126
+ return pixel_values
127
+
128
+ model_path = 'LiAutoAD/Ristretto-3B'
129
+ model = AutoModel.from_pretrained(
130
+ path,
131
+ torch_dtype=torch.bfloat16,
132
+ trust_remote_code=True).eval().cuda()
133
+ tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)
134
+
135
+
136
+
137
+ image_url = 'https://github.com/user-attachments/assets/83258e94-5d61-48ef-a87f-80dd9d895524'
138
+ response = requests.get(image_url)
139
+ image_data = BytesIO(response.content)
140
+ pixel_values = load_image(image_data, max_num=10).to(torch.bfloat16).cuda()
141
+ generation_config = dict(max_new_tokens=1024, do_sample=True)
142
+
143
+ # The recommended range for `num_image_token` is 64 to 576, and the value can be adjusted based on task requirements.
144
+ num_image_token = 256
145
+
146
+ # pure-text conversation
147
+ question = 'Hello, who are you?'
148
+ response, history = model.chat(tokenizer, None, question, generation_config, history=None, return_history=True)
149
+ print(f'User: {question} Assistant: {response}')
150
+
151
+ # text-image conversation && multi-round conversation
152
+ question = '<image> Please describe the image.'
153
+ response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
154
+ print(f'User: {question} Assistant: {response}')
155
+
156
+
157
+ question = 'What is best title for the image?'
158
+ response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)
159
+ print(f'User: {question} Assistant: {response}')
160
+
161
+ ```
162
+
163
+ ### Evaluation
164
+
165
+ | Benchmark | Qwen2.5-VL-3B | InternVL2.5-4B | Ristretto-3B |
166
+ | :-------: | :----------: | :-------------: | :----: |
167
+ | MMBench-TEST-avg | 76.8 | 78.2 | 80.1 |
168
+ | MMStar | 56.3 | 58.7 | 62.6 |
169
+ | MMMU-VAL | 51.2 | 51.8 | 49.1 |
170
+ | MathVista-MINI-test | 61.2 | 60.8 | 67.9 |
171
+ | HallucinationBench | 46.6 | 46.6 | 50.2 |
172
+ | AI2D | 81.4 | 81.4 | 84.3 |
173
+ | OCRBench | 82.8 | 82.0 | 84.0 |
174
+ | MMVet | 60.0 | 61.5 | 61.8 |
175
+ | Average | 64.5 | 65.1 | 67.6 |
176
+
177
+ We use [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) to evaluate Ristretto-3B. Other results are taken from [OpenCompass](https://rank.opencompass.org.cn/leaderboard-multimodal)
178
+
179
+
180
+ ## License Agreement
181
+
182
+ All of our open-source models are licensed under the Apache-2.0 license.
183
+
184
+
185
+
186
+ ## Citation
187
+
188
  <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->