Add comprehensive model card for MENTOR

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +196 -0
README.md ADDED
@@ -0,0 +1,196 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ pipeline_tag: text-to-image
3
+ library_name: transformers
4
+ license: mit
5
+ tags:
6
+ - multimodal
7
+ - autoregressive-models
8
+ ---
9
+
10
+ # MENTOR: Efficient Multimodal-Conditioned Tuning for Autoregressive Vision Generation
11
+
12
+ **MENTOR** is a novel autoregressive (AR) framework for efficient Multimodal-conditioned Tuning for Autoregressive multimodal image generation. It aims to address the limitations of existing text-to-image models in precise visual control, balancing multimodal inputs, and the extensive training required for complex multimodal image generation.
13
+
14
+ MENTOR combines an AR image generator with a two-stage training paradigm, enabling fine-grained, token-level alignment between multimodal inputs and image outputs without relying on auxiliary adapters or cross-attention modules. The two-stage training consists of: (1) a multimodal alignment stage that establishes robust pixel- and semantic-level alignment, followed by (2) a multimodal instruction tuning stage that balances the integration of multimodal inputs and enhances generation controllability.
15
+
16
+ Despite modest model size, suboptimal base components, and limited training resources, MENTOR achieves strong performance on the DreamBench++ benchmark, outperforming competitive baselines in concept preservation and prompt following. Additionally, our method delivers superior image reconstruction fidelity, broad task adaptability, and improved training efficiency compared to diffusion-based methods.
17
+
18
+ - **Paper:** [MENTOR: Efficient Multimodal-Conditioned Tuning for Autoregressive Vision Generation Models](https://huggingface.co/papers/2507.09574)
19
+ - **Project Page:** [https://haozhezhao.github.io/MENTOR.page](https://haozhezhao.github.io/MENTOR.page)
20
+ - **Code:** [https://github.com/HaozheZhao/MENTOR](https://github.com/HaozheZhao/MENTOR)
21
+
22
+ <p align="center">
23
+ <img src="https://github.com/HaozheZhao/MENTOR/blob/main/figures/teasarv3.png" width="100%" alt="MENTOR Overview" />
24
+ </p>
25
+
26
+ ### πŸ† Efficient Autoregressive Multimodal Image Generation with 10Γ— Less Data
27
+
28
+ MENTOR demonstrates competitive multimodal image generation capabilities, achieving superior results with dramatically reduced resources thanks to an efficient tuning paradigm. While competitors like Emu2 require 37 billion parameters and vast datasets, MENTOR surpasses their performance with only 2.3 billion parameters and significantly less training data in an autoregressive vision generation framework.
29
+
30
+ ---
31
+
32
+ ## ✨ Key Features
33
+
34
+ | Feature | MENTOR | Diffusion-Based Models |
35
+ |:--------|:-------|:-------------------|
36
+ | **Training Efficiency** | βœ… 1.5 days on 8 GPUs | ❌ 3+ days on 256 GPUs |
37
+ | **Deterministic Control** | βœ… Precise AR generation | ❌ Stochastic sampling |
38
+ | **Modality Balance** | βœ… Lowest CP/PF ratio (0.65) | ❌ High imbalance (>1.0) |
39
+ | **Architecture** | βœ… Simple unified transformer | ❌ Complex auxiliary modules |
40
+
41
+ ---
42
+
43
+ ## πŸ“Š Main Results
44
+
45
+ ### πŸ… DreamBench++ Benchmark Leadership
46
+
47
+ <p align="center">
48
+ <img src="https://github.com/HaozheZhao/MENTOR/blob/main/figures/Figure.png" width="60%" alt="Performance Comparison">
49
+ </p>
50
+
51
+ | Method | Model Size | Training Data | CP↑ | PF↑ | **CPΒ·PF↑** | **CP/PF↓** |
52
+ |:-------|:----------:|:-------------:|:---:|:---:|:----------:|:------:|
53
+ | DreamEngine | 10.5B | 21M | 0.68 | 0.37 | 0.26 | 1.84 |
54
+ | Kosmos-G | 3B | 200M | 0.54 | 0.51 | 0.28 | 1.06 |
55
+ | Emu2 | 37B | 16M | 0.53 | 0.69 | 0.36 | 0.77 |
56
+ | IP-Adapter ViT-G | 2.5B | 10M | 0.59 | 0.64 | 0.38 | 0.92 |
57
+ | **MENTOR** | **2.3B** | **3M** | 0.55 | 0.84 | **0.47** | **0.65** |
58
+
59
+ > **CP**: Concept Preservation | **PF**: Prompt Following | **Lower CP/PF = Better Balance**
60
+
61
+ ### 🎨 Superior Image Reconstruction
62
+
63
+ | Method | COCO L2↓ | JourneyDB L2↓ | Improvement |
64
+ |:---------------|:--------:|:-------------:|:----------------:|
65
+ | SeedTokenizer | 0.5102 | 0.5291 | \ |
66
+ | SEED-X | 0.4317 | 0.4352 | \ |
67
+ | EMU2-Gen | 0.3828 | 0.2869 | \ |
68
+ | DreamEngine | 0.2065 | 0.2052 | Baseline |
69
+ | **MENTOR** | **0.1008** | **0.0867** | **~50% Better** |
70
+
71
+ ---
72
+
73
+ ## 🎯 Usage Examples
74
+
75
+ This model can be loaded and used with the Hugging Face `transformers` library by setting `trust_remote_code=True`.
76
+
77
+ ### Basic Generation
78
+
79
+ ```python
80
+ import numpy as np
81
+ import torch
82
+ import torchvision.transforms as T
83
+ from PIL import Image
84
+ from torchvision.transforms.functional import InterpolationMode
85
+ from transformers import AutoModel, AutoTokenizer
86
+
87
+ IMAGENET_MEAN = (0.485, 0.456, 0.406)
88
+ IMAGENET_STD = (0.229, 0.224, 0.225)
89
+
90
+ def build_transform(input_size):
91
+ MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
92
+ transform = T.Compose([
93
+ T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
94
+ T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
95
+ T.ToTensor(),
96
+ T.Normalize(mean=MEAN, std=STD)
97
+ ])
98
+ return transform
99
+
100
+ def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
101
+ best_ratio_diff = float('inf')
102
+ best_ratio = (1, 1)
103
+ area = width * height
104
+ for ratio in target_ratios:
105
+ target_aspect_ratio = ratio[0] / ratio[1]
106
+ ratio_diff = abs(aspect_ratio - target_aspect_ratio)
107
+ if ratio_diff < best_ratio_diff:
108
+ best_ratio_diff = ratio_diff
109
+ best_ratio = ratio
110
+ elif ratio_diff == best_ratio_diff:
111
+ if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
112
+ best_ratio = ratio
113
+ return best_ratio
114
+
115
+ def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
116
+ orig_width, orig_height = image.size
117
+ aspect_ratio = orig_width / orig_height
118
+
119
+ target_ratios = set(
120
+ (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
121
+ i * j <= max_num and i * j >= min_num)
122
+ target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])
123
+
124
+ target_aspect_ratio = find_closest_aspect_ratio(
125
+ aspect_ratio, target_ratios, orig_width, orig_height, image_size)
126
+
127
+ target_width = image_size * target_aspect_ratio[0]
128
+ target_height = image_size * target_aspect_ratio[1]
129
+ blocks = target_aspect_ratio[0] * target_aspect_ratio[1]
130
+
131
+ resized_img = image.resize((target_width, target_height))
132
+ processed_images = []
133
+ for i in range(blocks):
134
+ box = (
135
+ (i % (target_width // image_size)) * image_size,
136
+ (i // (target_width // image_size)) * image_size,
137
+ ((i % (target_width // image_size)) + 1) * image_size,
138
+ ((i // (target_width // image_size)) + 1) * image_size
139
+ )
140
+ split_img = resized_img.crop(box)
141
+ processed_images.append(split_img)
142
+ assert len(processed_images) == blocks
143
+ if use_thumbnail and len(processed_images) != 1:
144
+ thumbnail_img = image.resize((image_size, image_size))
145
+ processed_images.append(thumbnail_img)
146
+ return processed_images
147
+
148
+ def load_image(image_file, input_size=448, max_num=12):
149
+ image = Image.open(image_file).convert('RGB')
150
+ transform = build_transform(input_size=input_size)
151
+ images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
152
+ pixel_values = [transform(image) for image in images]
153
+ pixel_values = torch.stack(pixel_values)
154
+ return pixel_values
155
+
156
+ # Load model and tokenizer
157
+ # Note: You may need to download the model checkpoint locally first using
158
+ # `huggingface-cli download BleachNick/Mentor --local-dir Mentor`
159
+ # And specify the path to your downloaded model folder.
160
+ model_name = "BleachNick/Mentor" # Or your local path to the downloaded model
161
+ model = AutoModel.from_pretrained(
162
+ model_name,
163
+ torch_dtype=torch.bfloat16,
164
+ low_cpu_mem_usage=True,
165
+ trust_remote_code=True
166
+ ).eval().cuda()
167
+ tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
168
+
169
+ # Example usage (Text-to-Image with optional image conditioning)
170
+ # Ensure 'cat.jpg' is available locally (e.g., download from the GitHub repo's figures folder).
171
+ image_path = "./figures/cat.jpg" # Example image from the repo.
172
+ pixel_values = load_image(image_path, max_num=6).to(torch.bfloat16).cuda() if image_path else None
173
+
174
+ question = "A cat in <image>.
175
+ A cat in a 16-bit fantasy pixel-art scene"
176
+ generation_config = dict(max_new_tokens=1024, do_sample=True)
177
+
178
+ response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
179
+ print(f'User: {question}
180
+ Assistant: {response}')
181
+ ```
182
+
183
+ ---
184
+
185
+ ## πŸ“š Citation
186
+
187
+ If you find MENTOR useful, please cite our paper:
188
+
189
+ ```bibtex
190
+ @inproceedings{zhao2024mentor,
191
+ title={MENTOR: Efficient Multimodal-Conditioned Tuning for Autoregressive Vision Generation Models},
192
+ author={Zhao, Haozhe* and Cai, Zefan* and Si, Shuzheng and Chen, Liang and
193
+ Gu, Jiuxiang and Xiao, Wen and Hu, Junjie},
194
+ year={2024}
195
+ }
196
+ ```