Add comprehensive model card for MENTOR
#1
by
nielsr
HF Staff
- opened
README.md
ADDED
|
@@ -0,0 +1,196 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
pipeline_tag: text-to-image
|
| 3 |
+
library_name: transformers
|
| 4 |
+
license: mit
|
| 5 |
+
tags:
|
| 6 |
+
- multimodal
|
| 7 |
+
- autoregressive-models
|
| 8 |
+
---
|
| 9 |
+
|
| 10 |
+
# MENTOR: Efficient Multimodal-Conditioned Tuning for Autoregressive Vision Generation
|
| 11 |
+
|
| 12 |
+
**MENTOR** is a novel autoregressive (AR) framework for efficient Multimodal-conditioned Tuning for Autoregressive multimodal image generation. It aims to address the limitations of existing text-to-image models in precise visual control, balancing multimodal inputs, and the extensive training required for complex multimodal image generation.
|
| 13 |
+
|
| 14 |
+
MENTOR combines an AR image generator with a two-stage training paradigm, enabling fine-grained, token-level alignment between multimodal inputs and image outputs without relying on auxiliary adapters or cross-attention modules. The two-stage training consists of: (1) a multimodal alignment stage that establishes robust pixel- and semantic-level alignment, followed by (2) a multimodal instruction tuning stage that balances the integration of multimodal inputs and enhances generation controllability.
|
| 15 |
+
|
| 16 |
+
Despite modest model size, suboptimal base components, and limited training resources, MENTOR achieves strong performance on the DreamBench++ benchmark, outperforming competitive baselines in concept preservation and prompt following. Additionally, our method delivers superior image reconstruction fidelity, broad task adaptability, and improved training efficiency compared to diffusion-based methods.
|
| 17 |
+
|
| 18 |
+
- **Paper:** [MENTOR: Efficient Multimodal-Conditioned Tuning for Autoregressive Vision Generation Models](https://huggingface.co/papers/2507.09574)
|
| 19 |
+
- **Project Page:** [https://haozhezhao.github.io/MENTOR.page](https://haozhezhao.github.io/MENTOR.page)
|
| 20 |
+
- **Code:** [https://github.com/HaozheZhao/MENTOR](https://github.com/HaozheZhao/MENTOR)
|
| 21 |
+
|
| 22 |
+
<p align="center">
|
| 23 |
+
<img src="https://github.com/HaozheZhao/MENTOR/blob/main/figures/teasarv3.png" width="100%" alt="MENTOR Overview" />
|
| 24 |
+
</p>
|
| 25 |
+
|
| 26 |
+
### π Efficient Autoregressive Multimodal Image Generation with 10Γ Less Data
|
| 27 |
+
|
| 28 |
+
MENTOR demonstrates competitive multimodal image generation capabilities, achieving superior results with dramatically reduced resources thanks to an efficient tuning paradigm. While competitors like Emu2 require 37 billion parameters and vast datasets, MENTOR surpasses their performance with only 2.3 billion parameters and significantly less training data in an autoregressive vision generation framework.
|
| 29 |
+
|
| 30 |
+
---
|
| 31 |
+
|
| 32 |
+
## β¨ Key Features
|
| 33 |
+
|
| 34 |
+
| Feature | MENTOR | Diffusion-Based Models |
|
| 35 |
+
|:--------|:-------|:-------------------|
|
| 36 |
+
| **Training Efficiency** | β
1.5 days on 8 GPUs | β 3+ days on 256 GPUs |
|
| 37 |
+
| **Deterministic Control** | β
Precise AR generation | β Stochastic sampling |
|
| 38 |
+
| **Modality Balance** | β
Lowest CP/PF ratio (0.65) | β High imbalance (>1.0) |
|
| 39 |
+
| **Architecture** | β
Simple unified transformer | β Complex auxiliary modules |
|
| 40 |
+
|
| 41 |
+
---
|
| 42 |
+
|
| 43 |
+
## π Main Results
|
| 44 |
+
|
| 45 |
+
### π
DreamBench++ Benchmark Leadership
|
| 46 |
+
|
| 47 |
+
<p align="center">
|
| 48 |
+
<img src="https://github.com/HaozheZhao/MENTOR/blob/main/figures/Figure.png" width="60%" alt="Performance Comparison">
|
| 49 |
+
</p>
|
| 50 |
+
|
| 51 |
+
| Method | Model Size | Training Data | CPβ | PFβ | **CPΒ·PFβ** | **CP/PFβ** |
|
| 52 |
+
|:-------|:----------:|:-------------:|:---:|:---:|:----------:|:------:|
|
| 53 |
+
| DreamEngine | 10.5B | 21M | 0.68 | 0.37 | 0.26 | 1.84 |
|
| 54 |
+
| Kosmos-G | 3B | 200M | 0.54 | 0.51 | 0.28 | 1.06 |
|
| 55 |
+
| Emu2 | 37B | 16M | 0.53 | 0.69 | 0.36 | 0.77 |
|
| 56 |
+
| IP-Adapter ViT-G | 2.5B | 10M | 0.59 | 0.64 | 0.38 | 0.92 |
|
| 57 |
+
| **MENTOR** | **2.3B** | **3M** | 0.55 | 0.84 | **0.47** | **0.65** |
|
| 58 |
+
|
| 59 |
+
> **CP**: Concept Preservation | **PF**: Prompt Following | **Lower CP/PF = Better Balance**
|
| 60 |
+
|
| 61 |
+
### π¨ Superior Image Reconstruction
|
| 62 |
+
|
| 63 |
+
| Method | COCO L2β | JourneyDB L2β | Improvement |
|
| 64 |
+
|:---------------|:--------:|:-------------:|:----------------:|
|
| 65 |
+
| SeedTokenizer | 0.5102 | 0.5291 | \ |
|
| 66 |
+
| SEED-X | 0.4317 | 0.4352 | \ |
|
| 67 |
+
| EMU2-Gen | 0.3828 | 0.2869 | \ |
|
| 68 |
+
| DreamEngine | 0.2065 | 0.2052 | Baseline |
|
| 69 |
+
| **MENTOR** | **0.1008** | **0.0867** | **~50% Better** |
|
| 70 |
+
|
| 71 |
+
---
|
| 72 |
+
|
| 73 |
+
## π― Usage Examples
|
| 74 |
+
|
| 75 |
+
This model can be loaded and used with the Hugging Face `transformers` library by setting `trust_remote_code=True`.
|
| 76 |
+
|
| 77 |
+
### Basic Generation
|
| 78 |
+
|
| 79 |
+
```python
|
| 80 |
+
import numpy as np
|
| 81 |
+
import torch
|
| 82 |
+
import torchvision.transforms as T
|
| 83 |
+
from PIL import Image
|
| 84 |
+
from torchvision.transforms.functional import InterpolationMode
|
| 85 |
+
from transformers import AutoModel, AutoTokenizer
|
| 86 |
+
|
| 87 |
+
IMAGENET_MEAN = (0.485, 0.456, 0.406)
|
| 88 |
+
IMAGENET_STD = (0.229, 0.224, 0.225)
|
| 89 |
+
|
| 90 |
+
def build_transform(input_size):
|
| 91 |
+
MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
|
| 92 |
+
transform = T.Compose([
|
| 93 |
+
T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
|
| 94 |
+
T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
|
| 95 |
+
T.ToTensor(),
|
| 96 |
+
T.Normalize(mean=MEAN, std=STD)
|
| 97 |
+
])
|
| 98 |
+
return transform
|
| 99 |
+
|
| 100 |
+
def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
|
| 101 |
+
best_ratio_diff = float('inf')
|
| 102 |
+
best_ratio = (1, 1)
|
| 103 |
+
area = width * height
|
| 104 |
+
for ratio in target_ratios:
|
| 105 |
+
target_aspect_ratio = ratio[0] / ratio[1]
|
| 106 |
+
ratio_diff = abs(aspect_ratio - target_aspect_ratio)
|
| 107 |
+
if ratio_diff < best_ratio_diff:
|
| 108 |
+
best_ratio_diff = ratio_diff
|
| 109 |
+
best_ratio = ratio
|
| 110 |
+
elif ratio_diff == best_ratio_diff:
|
| 111 |
+
if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
|
| 112 |
+
best_ratio = ratio
|
| 113 |
+
return best_ratio
|
| 114 |
+
|
| 115 |
+
def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
|
| 116 |
+
orig_width, orig_height = image.size
|
| 117 |
+
aspect_ratio = orig_width / orig_height
|
| 118 |
+
|
| 119 |
+
target_ratios = set(
|
| 120 |
+
(i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
|
| 121 |
+
i * j <= max_num and i * j >= min_num)
|
| 122 |
+
target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])
|
| 123 |
+
|
| 124 |
+
target_aspect_ratio = find_closest_aspect_ratio(
|
| 125 |
+
aspect_ratio, target_ratios, orig_width, orig_height, image_size)
|
| 126 |
+
|
| 127 |
+
target_width = image_size * target_aspect_ratio[0]
|
| 128 |
+
target_height = image_size * target_aspect_ratio[1]
|
| 129 |
+
blocks = target_aspect_ratio[0] * target_aspect_ratio[1]
|
| 130 |
+
|
| 131 |
+
resized_img = image.resize((target_width, target_height))
|
| 132 |
+
processed_images = []
|
| 133 |
+
for i in range(blocks):
|
| 134 |
+
box = (
|
| 135 |
+
(i % (target_width // image_size)) * image_size,
|
| 136 |
+
(i // (target_width // image_size)) * image_size,
|
| 137 |
+
((i % (target_width // image_size)) + 1) * image_size,
|
| 138 |
+
((i // (target_width // image_size)) + 1) * image_size
|
| 139 |
+
)
|
| 140 |
+
split_img = resized_img.crop(box)
|
| 141 |
+
processed_images.append(split_img)
|
| 142 |
+
assert len(processed_images) == blocks
|
| 143 |
+
if use_thumbnail and len(processed_images) != 1:
|
| 144 |
+
thumbnail_img = image.resize((image_size, image_size))
|
| 145 |
+
processed_images.append(thumbnail_img)
|
| 146 |
+
return processed_images
|
| 147 |
+
|
| 148 |
+
def load_image(image_file, input_size=448, max_num=12):
|
| 149 |
+
image = Image.open(image_file).convert('RGB')
|
| 150 |
+
transform = build_transform(input_size=input_size)
|
| 151 |
+
images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
|
| 152 |
+
pixel_values = [transform(image) for image in images]
|
| 153 |
+
pixel_values = torch.stack(pixel_values)
|
| 154 |
+
return pixel_values
|
| 155 |
+
|
| 156 |
+
# Load model and tokenizer
|
| 157 |
+
# Note: You may need to download the model checkpoint locally first using
|
| 158 |
+
# `huggingface-cli download BleachNick/Mentor --local-dir Mentor`
|
| 159 |
+
# And specify the path to your downloaded model folder.
|
| 160 |
+
model_name = "BleachNick/Mentor" # Or your local path to the downloaded model
|
| 161 |
+
model = AutoModel.from_pretrained(
|
| 162 |
+
model_name,
|
| 163 |
+
torch_dtype=torch.bfloat16,
|
| 164 |
+
low_cpu_mem_usage=True,
|
| 165 |
+
trust_remote_code=True
|
| 166 |
+
).eval().cuda()
|
| 167 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
|
| 168 |
+
|
| 169 |
+
# Example usage (Text-to-Image with optional image conditioning)
|
| 170 |
+
# Ensure 'cat.jpg' is available locally (e.g., download from the GitHub repo's figures folder).
|
| 171 |
+
image_path = "./figures/cat.jpg" # Example image from the repo.
|
| 172 |
+
pixel_values = load_image(image_path, max_num=6).to(torch.bfloat16).cuda() if image_path else None
|
| 173 |
+
|
| 174 |
+
question = "A cat in <image>.
|
| 175 |
+
A cat in a 16-bit fantasy pixel-art scene"
|
| 176 |
+
generation_config = dict(max_new_tokens=1024, do_sample=True)
|
| 177 |
+
|
| 178 |
+
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
|
| 179 |
+
print(f'User: {question}
|
| 180 |
+
Assistant: {response}')
|
| 181 |
+
```
|
| 182 |
+
|
| 183 |
+
---
|
| 184 |
+
|
| 185 |
+
## π Citation
|
| 186 |
+
|
| 187 |
+
If you find MENTOR useful, please cite our paper:
|
| 188 |
+
|
| 189 |
+
```bibtex
|
| 190 |
+
@inproceedings{zhao2024mentor,
|
| 191 |
+
title={MENTOR: Efficient Multimodal-Conditioned Tuning for Autoregressive Vision Generation Models},
|
| 192 |
+
author={Zhao, Haozhe* and Cai, Zefan* and Si, Shuzheng and Chen, Liang and
|
| 193 |
+
Gu, Jiuxiang and Xiao, Wen and Hu, Junjie},
|
| 194 |
+
year={2024}
|
| 195 |
+
}
|
| 196 |
+
```
|