liuchanglab commited on
Commit
6fd7c15
·
verified ·
1 Parent(s): 41378cc

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +238 -3
README.md CHANGED
@@ -1,3 +1,238 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <div align="center">
2
+
3
+
4
+ # MammothModa2: A Unified AR-Diffusion Framework for Multimodal Understanding and Generation
5
+ <img src='./doc/logo.png' alt="MammothModa Logo" width="100" style="max-width: 100px; height: auto;">
6
+
7
+ [![GitHub](https://img.shields.io/badge/MammothModa2-GitHub-blue)](https://github.com/bytedance/mammothmoda)
8
+ [![Project Page](https://img.shields.io/badge/MammothModa2-Project_Page-green)](https://mammothmoda2.github.io/)
9
+ [![HuggingFace](https://img.shields.io/badge/MammothModa2-HuggingFace_Model-yellow)](https://huggingface.co/bytedance-research/MammothModa2-Preview)
10
+ [![Report](https://img.shields.io/badge/MammothModa2-arXiv-red)](https://arxiv.org/abs/2511.18262)
11
+
12
+ </div>
13
+
14
+ ## Introduction
15
+
16
+ MammothModa2 (Mammoth2) is a unified autoregressive-diffusion (AR-Diffusion) framework that seamlessly integrates multimodal understanding and generation within a single model. Mammoth2 effectively couples autoregressive semantic planning with diffusion-based generation, enabling high-quality text-to-image generation, instruction-based editing, and comprehensive multimodal understanding.
17
+
18
+ **Key Features:**
19
+ - **Serial AR-Diffusion Architecture**: An AR path performs global semantic modeling over discrete tokens, while a single-stream Diffusion Transformer (DiT) decoder handles high-fidelity image synthesis
20
+ - **Unified Joint Training**: End-to-end training with joint Next-Token Prediction and Flow Matching objectives, followed by supervised fine-tuning and reinforcement learning
21
+ - **Feature Alignment Module**: A carefully designed AR-Diffusion feature alignment module combines multi-layer feature aggregation, unified condition encoding, and in-context conditioning to stably align AR's representations with the diffusion decoder's continuous latents
22
+ - **Strong Performance**: Achieves 0.87 on GenEval, 87.2 on DPGBench, and 4.06 on ImgEdit, while remaining competitive with understanding-only backbones on multimodal understanding tasks
23
+
24
+ Trained with roughly 60M supervised generation samples and no reliance on pre-trained generators, Mammoth2 demonstrates that a carefully coupled AR-Diffusion architecture can provide high-fidelity generation and editing while maintaining strong multimodal comprehension within a single, parameter- and data-efficient model.
25
+
26
+ ## Show cases
27
+ <!-- <div align="center">
28
+ <img src='./mammoth.png' alt="MammothModa Overview" width="80%">
29
+ </div> -->
30
+
31
+ <div align="center">
32
+ <img src='./doc/mammoth.png' alt="MammothModa2 Show cases" style="max-width: 80%; height: auto;">
33
+ </div>
34
+
35
+ ## 🎉 News
36
+ - 2025-12-10: 🔥MammothModa2-Dev models are now available at [HuggingFace](https://huggingface.co/bytedance-research/MammothModa2-Dev).
37
+ - 2025-10-01: 🔥MammothModa2-Preview models are now available at [HuggingFace](https://huggingface.co/bytedance-research/MammothModa2-Preview). **Note: To use the Preview version, please switch to the `qwen25vl` branch.**
38
+
39
+
40
+ ## 🪄 Models
41
+ | Model | Download Link | License |
42
+ |-------|---------------|----------|
43
+ | MammothModa2-Dev | [🤗 HuggingFace](https://huggingface.co/bytedance-research/MammothModa2-Dev) | [Apache-2.0](https://opensource.org/licenses/Apache-2.0) |
44
+ | MammothModa2-Preview | [🤗 HuggingFace](https://huggingface.co/bytedance-research/MammothModa2-Preview) | [Apache-2.0](https://opensource.org/licenses/Apache-2.0) |
45
+
46
+ ## ⚙️ Installation
47
+
48
+ The codebase has been tested with Python 3.11.9, CUDA 12.4, and PyTorch 2.6.0. You can set up the environment using uv with the following command:
49
+
50
+ ```bash
51
+ # Clone the repository
52
+ git clone https://github.com/bytedance/mammothmoda.git
53
+ cd mammothmoda
54
+
55
+ # Install dependencies
56
+ uv sync --frozen
57
+ ```
58
+
59
+ ## 🚀 Usage
60
+
61
+ ### Text-to-Image Generation
62
+
63
+ ```python
64
+ import torch
65
+ from qwen_vl_utils import process_vision_info
66
+ from transformers import AutoProcessor
67
+ from mammothmoda2.model import DEFAULT_NEGATIVE_PROMPT, Mammothmoda2Model
68
+ from mammothmoda2.utils import decode_diffusion_image
69
+
70
+ # Mammothmoda2 model and processor loading.
71
+ model = Mammothmoda2Model.from_pretrained(
72
+ "bytedance-research/MammothModa2-Preview",
73
+ attn_implementation="flash_attention_2",
74
+ torch_dtype="bfloat16",
75
+ t2i_generate=True,
76
+ ).to("cuda")
77
+ processor = AutoProcessor.from_pretrained(
78
+ "bytedance-research/MammothModa2-Preview",
79
+ t2i_generate=True,
80
+ ar_height=32,
81
+ ar_width=32,
82
+ )
83
+
84
+ # Mammothmoda2 inputs preprocessing.
85
+ messages = [
86
+ {
87
+ "role": "user",
88
+ "content": [
89
+ {
90
+ "type": "text",
91
+ "text": "这张图片展示了一座现代化城市的美丽景象。画面中最显眼的是一座高耸入云的摩天大楼,其外立面在夕阳余晖的映照下显得格外醒目。周围环绕着多栋风格各异的高楼大厦,这些大楼的窗户透出点点灯光,显示出城市的繁华。左侧有一座带有绿色圆顶的建筑,造型独特。在建筑物前方的水面上,有几艘白色的帆船正在航行,给城市增添了一份灵动的气息。���空呈现出浪漫的粉色,可能是日出或日落时分,整个画面色彩柔和,充满了宁静与美好的氛围。",
92
+ },
93
+ ],
94
+ }
95
+ ]
96
+ text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
97
+ image_inputs, video_inputs = process_vision_info(messages)
98
+ inputs = processor(
99
+ text=[text],
100
+ images=image_inputs,
101
+ videos=video_inputs,
102
+ num_images_per_prompt=4,
103
+ cfg_scale=7.0,
104
+ negative_prompt=DEFAULT_NEGATIVE_PROMPT,
105
+ padding=True,
106
+ padding_side="left",
107
+ return_tensors="pt",
108
+ return_token_type_ids=False, # Or generate would raise error.
109
+ ).to("cuda")
110
+
111
+ # Mammothmoda2 t2i generate.
112
+ with torch.inference_mode(), torch.autocast(device_type="cuda", dtype=torch.bfloat16):
113
+ generated_ids, attention_mask = model.generate(**inputs)
114
+ diff_return_info = decode_diffusion_image(
115
+ input_ids=inputs.input_ids,
116
+ generated_ids=generated_ids,
117
+ attention_mask=attention_mask,
118
+ negative_ids=inputs.get("negative_ids", None),
119
+ negative_mask=inputs.get("negative_mask", None),
120
+ model=model,
121
+ tokenizer=processor.tokenizer,
122
+ output_dir="./mammothmoda2_t2i_release",
123
+ num_images_per_prompt=4,
124
+ text_guidance_scale=9.0,
125
+ vae_scale_factor=16,
126
+ cfg_range=(0.0, 1.0),
127
+ num_inference_steps=50,
128
+ height=1024,
129
+ width=1024,
130
+ )
131
+ ```
132
+
133
+ ### Multi-modal Understanding
134
+
135
+ ```python
136
+ import torch
137
+ from qwen_vl_utils import process_vision_info
138
+ from transformers import AutoProcessor
139
+ from mammothmoda2.model import Mammothmoda2Model
140
+
141
+ # Mammothmoda2 model and processor loading.
142
+ model = Mammothmoda2Model.from_pretrained(
143
+ "bytedance-research/MammothModa2-Preview",
144
+ attn_implementation="flash_attention_2",
145
+ torch_dtype="bfloat16",
146
+ ).to("cuda")
147
+ print(f"model.device={model.device}")
148
+ processor = AutoProcessor.from_pretrained("bytedance-research/MammothModa2-Preview")
149
+
150
+ # Mammothmoda2 inputs preprocessing.
151
+ messages = [
152
+ {
153
+ "role": "user",
154
+ "content": [
155
+ {
156
+ "type": "image",
157
+ "image": "doc/example0.png",
158
+ },
159
+ {"type": "text", "text": "这个场景中,根据这位男士的面部表情和身体语言,我们能推断出他的情绪状态吗?"},
160
+ ],
161
+ }
162
+ ]
163
+ text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
164
+ image_inputs, video_inputs = process_vision_info(messages)
165
+ inputs = processor(
166
+ text=[text],
167
+ images=image_inputs,
168
+ videos=video_inputs,
169
+ padding=True,
170
+ padding_side="left",
171
+ return_tensors="pt",
172
+ return_token_type_ids=False,
173
+ ).to("cuda")
174
+
175
+ # Mammothmoda2 model generation and decoding.
176
+ with torch.inference_mode(), torch.autocast(dtype=torch.bfloat16):
177
+ generated_ids = model.generate(**inputs)
178
+ generated_ids_trimmed = [out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]
179
+ output_texts = processor.batch_decode(
180
+ generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
181
+ )
182
+ print(output_texts)
183
+ ```
184
+
185
+ ## 📊 Benchmark Results
186
+
187
+ | Model | Model Size | GenEval | DPGBench |
188
+ |-------|------------|---------|----------|
189
+ | **Generation** |
190
+ | SDXL | - | 0.55 | 74.65 |
191
+ | DALL-E 3 | - | 0.67 | 83.50 |
192
+ | FLUX.1-dev | - | 0.67 | 84.00 |
193
+ | SD3.5-Medium* | - | 0.65 | 83.86 |
194
+ | **Unified** |
195
+ | Emu3 | 8B | 0.66 | 80.60 |
196
+ | Janus-Pro | 7B | 0.80 | 84.19 |
197
+ | MetaQuery-XL | 7B + 1.6B | 0.80 | 82.05 |
198
+ | UniWorld-V1 | 7B + 12B | 0.84 | 81.38 |
199
+ | Blip3-o-8B | 7B + 1.4B | 0.84 | 81.60 |
200
+ | OmniGen2 | 3B + 4B | 0.86 | 83.57 |
201
+ | Ovis-U1 | 2.4B + 1.2B | 0.89 | 83.72 |
202
+ | UniPic2 | 7B + 2B | 0.90 | 83.79 |
203
+ | BAGEL | 7B + 7B | 0.88 | 85.07 |
204
+ | Show-o2 | 7B | 0.76 | 86.14 |
205
+ | GPT-4o | - | 0.84 | 86.23 |
206
+ | MammothModa2 | 8B + (3B + 2B) | 0.87 | 87.2 |
207
+
208
+ **Note**: Model sizes in "A + B" format indicate separate understanding (A) and generation (B) parameters. Models without "+" share parameters for both tasks. MammothModa2 uses a 8B + (3B + 2B) architecture, where the 8B parameters are for understanding, and the generation part consists of 3B parameters in the AR (MLLM backbone) and 2B parameters in the DiT component.
209
+
210
+
211
+ ## Acknowledgement
212
+
213
+ We are grateful to the following open-source projects:
214
+
215
+ - [OmniGen2](https://github.com/VectorSpaceLab/OmniGen2)
216
+ - [Qwen3-VL](https://github.com/QwenLM/Qwen3-VL)
217
+
218
+
219
+ ## Citation
220
+
221
+ If you find MammothModa2 useful in your research, please cite:
222
+
223
+ ```bibtex
224
+ @article{shen2025mammothmoda2,
225
+ title={MammothModa2: A Unified AR-Diffusion Framework for Multimodal Understanding and Generation},
226
+ author={Shen, Tao and Wan, Xin and Chen, Taicai and Zhang, Rui and Pan, Junwen and Lu, Dawei and Lei, Fanding and Lu, Zhilin and Yang, Yunfei and Cheng, Chen and She, Qi and Liu, Chang and Sun, Zhenbang},
227
+ journal={arXiv preprint arXiv:2511.18262},
228
+ year={2025},
229
+ url={https://arxiv.org/abs/2511.18262}
230
+ }
231
+ ```
232
+ ## 🎯 Join Our Team
233
+
234
+ **Moderation LLM Team @ ByteDance** - We're hiring talented individuals passionate about multimodal AI, computer vision, and LLM development!
235
+
236
+ We develop leading LLMs for content moderation, building infrastructure including model benchmarking, data pipelines, efficient architectures, and training methodologies.
237
+
238
+ **Contact:** zhangna.2020@bytedance.com