Update README.md
Browse files
README.md
CHANGED
|
@@ -1,5 +1,5 @@
|
|
| 1 |
---
|
| 2 |
-
license:
|
| 3 |
datasets:
|
| 4 |
- lmms-lab/LLaVA-OneVision-Data
|
| 5 |
- BAAI/Infinity-MM
|
|
@@ -9,198 +9,168 @@ language:
|
|
| 9 |
base_model:
|
| 10 |
- google/siglip2-so400m-patch14-384
|
| 11 |
- Qwen/Qwen2.5-3B-Instruct
|
|
|
|
|
|
|
| 12 |
---
|
| 13 |
-
# Model Card for Model ID
|
| 14 |
|
| 15 |
-
|
| 16 |
-
|
| 17 |
-
|
| 18 |
-
|
| 19 |
-
|
| 20 |
-
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
|
| 50 |
-
|
| 51 |
-
|
| 52 |
-
|
| 53 |
-
|
| 54 |
-
|
| 55 |
-
|
| 56 |
-
|
| 57 |
-
|
| 58 |
-
|
| 59 |
-
|
| 60 |
-
|
| 61 |
-
|
| 62 |
-
|
| 63 |
-
|
| 64 |
-
|
| 65 |
-
|
| 66 |
-
|
| 67 |
-
|
| 68 |
-
|
| 69 |
-
|
| 70 |
-
|
| 71 |
-
|
| 72 |
-
|
| 73 |
-
|
| 74 |
-
|
| 75 |
-
|
| 76 |
-
|
| 77 |
-
|
| 78 |
-
|
| 79 |
-
|
| 80 |
-
|
| 81 |
-
|
| 82 |
-
|
| 83 |
-
|
| 84 |
-
|
| 85 |
-
|
| 86 |
-
|
| 87 |
-
|
| 88 |
-
|
| 89 |
-
|
| 90 |
-
|
| 91 |
-
|
| 92 |
-
|
| 93 |
-
|
| 94 |
-
|
| 95 |
-
|
| 96 |
-
|
| 97 |
-
|
| 98 |
-
|
| 99 |
-
|
| 100 |
-
|
| 101 |
-
|
| 102 |
-
|
| 103 |
-
|
| 104 |
-
|
| 105 |
-
|
| 106 |
-
|
| 107 |
-
|
| 108 |
-
|
| 109 |
-
|
| 110 |
-
|
| 111 |
-
|
| 112 |
-
|
| 113 |
-
|
| 114 |
-
|
| 115 |
-
|
| 116 |
-
|
| 117 |
-
|
| 118 |
-
|
| 119 |
-
|
| 120 |
-
|
| 121 |
-
|
| 122 |
-
|
| 123 |
-
|
| 124 |
-
|
| 125 |
-
|
| 126 |
-
|
| 127 |
-
|
| 128 |
-
|
| 129 |
-
|
| 130 |
-
|
| 131 |
-
|
| 132 |
-
|
| 133 |
-
|
| 134 |
-
|
| 135 |
-
|
| 136 |
-
|
| 137 |
-
|
| 138 |
-
|
| 139 |
-
|
| 140 |
-
|
| 141 |
-
|
| 142 |
-
|
| 143 |
-
|
| 144 |
-
|
| 145 |
-
|
| 146 |
-
|
| 147 |
-
|
| 148 |
-
|
| 149 |
-
|
| 150 |
-
|
| 151 |
-
|
| 152 |
-
|
| 153 |
-
|
| 154 |
-
-
|
| 155 |
-
|
| 156 |
-
-
|
| 157 |
-
-
|
| 158 |
-
|
| 159 |
-
|
| 160 |
-
|
| 161 |
-
|
| 162 |
-
|
| 163 |
-
|
| 164 |
-
[
|
| 165 |
-
|
| 166 |
-
|
| 167 |
-
|
| 168 |
-
|
| 169 |
-
|
| 170 |
-
|
| 171 |
-
|
| 172 |
-
|
| 173 |
-
|
| 174 |
-
|
| 175 |
-
|
| 176 |
-
[More Information Needed]
|
| 177 |
-
|
| 178 |
-
## Citation [optional]
|
| 179 |
-
|
| 180 |
-
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
|
| 181 |
-
|
| 182 |
-
**BibTeX:**
|
| 183 |
-
|
| 184 |
-
[More Information Needed]
|
| 185 |
-
|
| 186 |
-
**APA:**
|
| 187 |
-
|
| 188 |
-
[More Information Needed]
|
| 189 |
-
|
| 190 |
-
## Glossary [optional]
|
| 191 |
-
|
| 192 |
-
<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
|
| 193 |
-
|
| 194 |
-
[More Information Needed]
|
| 195 |
-
|
| 196 |
-
## More Information [optional]
|
| 197 |
-
|
| 198 |
-
[More Information Needed]
|
| 199 |
-
|
| 200 |
-
## Model Card Authors [optional]
|
| 201 |
-
|
| 202 |
-
[More Information Needed]
|
| 203 |
-
|
| 204 |
-
## Model Card Contact
|
| 205 |
-
|
| 206 |
-
[More Information Needed]
|
|
|
|
| 1 |
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
datasets:
|
| 4 |
- lmms-lab/LLaVA-OneVision-Data
|
| 5 |
- BAAI/Infinity-MM
|
|
|
|
| 9 |
base_model:
|
| 10 |
- google/siglip2-so400m-patch14-384
|
| 11 |
- Qwen/Qwen2.5-3B-Instruct
|
| 12 |
+
pipeline_tag: image-text-to-text
|
| 13 |
+
library_name: transformers
|
| 14 |
---
|
|
|
|
| 15 |
|
| 16 |
+
## Introduction
|
| 17 |
+
|
| 18 |
+
We are excited to introduce **Ristretto**, our newest Vision language model (VLM) that represents a significant step forward in the field. Ristretto features a capability to deploy dynamic image tokens, enables flexible adjustment of image token quantities based on task requirements while enhancing the projector architecture to support dynamic token configurations. This new model delivers improved performance and versatility compared to its predecessors through its refined architecture and advanced training approach.
|
| 19 |
+
|
| 20 |
+
**Key Innovations**
|
| 21 |
+
|
| 22 |
+
Coming soon...
|
| 23 |
+
|
| 24 |
+
### Environment Setup
|
| 25 |
+
|
| 26 |
+
```bash
|
| 27 |
+
pip install torch>=2.3.0
|
| 28 |
+
pip install transformers==4.37.0
|
| 29 |
+
```
|
| 30 |
+
|
| 31 |
+
|
| 32 |
+
### How to use?
|
| 33 |
+
|
| 34 |
+
```python
|
| 35 |
+
import torch
|
| 36 |
+
import torchvision.transforms as T
|
| 37 |
+
from PIL import Image
|
| 38 |
+
from torchvision.transforms.functional import InterpolationMode
|
| 39 |
+
from transformers import AutoModel, AutoTokenizer
|
| 40 |
+
import requests
|
| 41 |
+
from io import BytesIO
|
| 42 |
+
|
| 43 |
+
IMAGENET_MEAN = (0.485, 0.456, 0.406)
|
| 44 |
+
IMAGENET_STD = (0.229, 0.224, 0.225)
|
| 45 |
+
|
| 46 |
+
def build_transform(input_size):
|
| 47 |
+
MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
|
| 48 |
+
transform = T.Compose([
|
| 49 |
+
T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
|
| 50 |
+
T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
|
| 51 |
+
T.ToTensor(),
|
| 52 |
+
T.Normalize(mean=MEAN, std=STD)
|
| 53 |
+
])
|
| 54 |
+
return transform
|
| 55 |
+
|
| 56 |
+
def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
|
| 57 |
+
best_ratio_diff = float('inf')
|
| 58 |
+
best_ratio = (1, 1)
|
| 59 |
+
area = width * height
|
| 60 |
+
for ratio in target_ratios:
|
| 61 |
+
target_aspect_ratio = ratio[0] / ratio[1]
|
| 62 |
+
ratio_diff = abs(aspect_ratio - target_aspect_ratio)
|
| 63 |
+
if ratio_diff < best_ratio_diff:
|
| 64 |
+
best_ratio_diff = ratio_diff
|
| 65 |
+
best_ratio = ratio
|
| 66 |
+
elif ratio_diff == best_ratio_diff:
|
| 67 |
+
if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
|
| 68 |
+
best_ratio = ratio
|
| 69 |
+
return best_ratio
|
| 70 |
+
|
| 71 |
+
def dynamic_preprocess(image, min_num=1, max_num=10, image_size=448, use_thumbnail=False):
|
| 72 |
+
orig_width, orig_height = image.size
|
| 73 |
+
aspect_ratio = orig_width / orig_height
|
| 74 |
+
|
| 75 |
+
# calculate the existing image aspect ratio
|
| 76 |
+
target_ratios = set(
|
| 77 |
+
(i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
|
| 78 |
+
i * j <= max_num and i * j >= min_num)
|
| 79 |
+
target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])
|
| 80 |
+
|
| 81 |
+
# find the closest aspect ratio to the target
|
| 82 |
+
target_aspect_ratio = find_closest_aspect_ratio(
|
| 83 |
+
aspect_ratio, target_ratios, orig_width, orig_height, image_size)
|
| 84 |
+
|
| 85 |
+
# calculate the target width and height
|
| 86 |
+
target_width = image_size * target_aspect_ratio[0]
|
| 87 |
+
target_height = image_size * target_aspect_ratio[1]
|
| 88 |
+
blocks = target_aspect_ratio[0] * target_aspect_ratio[1]
|
| 89 |
+
|
| 90 |
+
# resize the image
|
| 91 |
+
resized_img = image.resize((target_width, target_height))
|
| 92 |
+
processed_images = []
|
| 93 |
+
for i in range(blocks):
|
| 94 |
+
box = (
|
| 95 |
+
(i % (target_width // image_size)) * image_size,
|
| 96 |
+
(i // (target_width // image_size)) * image_size,
|
| 97 |
+
((i % (target_width // image_size)) + 1) * image_size,
|
| 98 |
+
((i // (target_width // image_size)) + 1) * image_size
|
| 99 |
+
)
|
| 100 |
+
# split the image
|
| 101 |
+
split_img = resized_img.crop(box)
|
| 102 |
+
processed_images.append(split_img)
|
| 103 |
+
assert len(processed_images) == blocks
|
| 104 |
+
if use_thumbnail and len(processed_images) != 1:
|
| 105 |
+
thumbnail_img = image.resize((image_size, image_size))
|
| 106 |
+
processed_images.append(thumbnail_img)
|
| 107 |
+
return processed_images
|
| 108 |
+
|
| 109 |
+
def load_image(image_data, input_size=384, max_num=10):
|
| 110 |
+
image = Image.open(image_data).convert('RGB')
|
| 111 |
+
transform = build_transform(input_size=input_size)
|
| 112 |
+
images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
|
| 113 |
+
pixel_values = [transform(image) for image in images]
|
| 114 |
+
pixel_values = torch.stack(pixel_values)
|
| 115 |
+
return pixel_values
|
| 116 |
+
|
| 117 |
+
model_path = 'LiAutoAD/Ristretto-3B'
|
| 118 |
+
model = AutoModel.from_pretrained(
|
| 119 |
+
path,
|
| 120 |
+
torch_dtype=torch.bfloat16,
|
| 121 |
+
trust_remote_code=True).eval().cuda()
|
| 122 |
+
tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)
|
| 123 |
+
|
| 124 |
+
|
| 125 |
+
|
| 126 |
+
image_url = 'https://github.com/user-attachments/assets/83258e94-5d61-48ef-a87f-80dd9d895524'
|
| 127 |
+
response = requests.get(image_url)
|
| 128 |
+
image_data = BytesIO(response.content)
|
| 129 |
+
pixel_values = load_image(image_data, max_num=10).to(torch.bfloat16).cuda()
|
| 130 |
+
generation_config = dict(max_new_tokens=1024, do_sample=True)
|
| 131 |
+
# optimal alpha ranges from 64 to 576
|
| 132 |
+
num_image_token = 256
|
| 133 |
+
|
| 134 |
+
# pure-text conversation
|
| 135 |
+
question = 'Hello, who are you?'
|
| 136 |
+
response, history = model.chat(tokenizer, None, question, generation_config, history=None, return_history=True)
|
| 137 |
+
print(f'User: {question} Assistant: {response}')
|
| 138 |
+
|
| 139 |
+
# single-image single-round conversation
|
| 140 |
+
question = '<image> Please describe the image shortly.'
|
| 141 |
+
response = model.chat(tokenizer, pixel_values, question, generation_config, num_image_token=256)
|
| 142 |
+
print(f'User: {question} Assistant: {response}')
|
| 143 |
+
|
| 144 |
+
# single-image multi-round conversation
|
| 145 |
+
question = '<image> Please describe the image in detail.'
|
| 146 |
+
response, history = model.chat(tokenizer, pixel_values, question, generation_config, num_image_token=256, history=None, return_history=True)
|
| 147 |
+
print(f'User: {question} Assistant: {response}')
|
| 148 |
+
|
| 149 |
+
```
|
| 150 |
+
|
| 151 |
+
### Evaluation
|
| 152 |
+
|
| 153 |
+
| Benchmark | Qwen2.5-VL-3B | InternVL2.5-4B | Ristretto-3B |
|
| 154 |
+
| :-------: | :----------: | :-------------: | :----: |
|
| 155 |
+
| MMBench-TEST-avg | 76.8 | 78.2 | 82.7 |
|
| 156 |
+
| MMStar | 56.3 | 58.7 | 62.6 |
|
| 157 |
+
| MMMU-VAL | 51.2 | 51.8 | 49.1 |
|
| 158 |
+
| MathVista-mini-test | 61.2 | 60.8 | 67.9 |
|
| 159 |
+
| HallucinationBench | 46.6 | 46.6 | 50.2 |
|
| 160 |
+
| AI2D | 81.4 | 81.4 | 84.3 |
|
| 161 |
+
| OCRBench | 82.8 | 82.0 | 84.0 |
|
| 162 |
+
| MMVet | 60.0 | 61.5 | 61.8 |
|
| 163 |
+
| Average | 64.5 | 65.1 | 67.8 |
|
| 164 |
+
|
| 165 |
+
We use [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) to evaluate Ristretto-3B. Other results are taken from [OpenCompass](https://rank.opencompass.org.cn/leaderboard-multimodal)
|
| 166 |
+
|
| 167 |
+
|
| 168 |
+
## License Agreement
|
| 169 |
+
|
| 170 |
+
All of our open-source models are licensed under the Apache-2.0 license.
|
| 171 |
+
|
| 172 |
+
|
| 173 |
+
|
| 174 |
+
## Citation
|
| 175 |
+
|
| 176 |
+
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|