luckychao commited on
Commit
d9db579
Β·
verified Β·
1 Parent(s): 1880343

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +63 -81
README.md CHANGED
@@ -1,118 +1,100 @@
1
  ---
2
  license: apache-2.0
3
  base_model:
4
- - Qwen/Qwen2.5-7B-Instruct
5
  pipeline_tag: any-to-any
6
- library_name: bagel-mot
7
  ---
8
 
9
-
10
- <p align="left">
11
- <img src="https://lf3-static.bytednsdoc.com/obj/eden-cn/nuhojubrps/banner.png" alt="BAGEL" width="480"/>
12
  </p>
13
 
14
 
15
- # πŸ₯― BAGEL β€’ Unified Model for Multimodal Understanding and Generation
16
-
17
 
 
 
18
 
19
- <p align="left">
20
- <a href="https://bagel-ai.org/">
21
  <img
22
- src="https://img.shields.io/badge/BAGEL-Website-0A66C2?logo=safari&logoColor=white" style="display: inline-block; vertical-align: middle;"
23
- alt="BAGEL Website"
24
  />
25
  </a>
26
- <a href="https://arxiv.org/abs/2505.14683">
27
  <img
28
- src="https://img.shields.io/badge/BAGEL-Paper-red?logo=arxiv&logoColor=red" style="display: inline-block; vertical-align: middle;"
29
- alt="BAGEL Paper on arXiv"
30
  />
31
  </a>
32
- <a href="https://github.com/bytedance-seed/BAGEL" target="_blank" style="margin: 2px;">
33
  <img
34
- alt="Github" src="https://img.shields.io/badge/BAGEL-Codebase-536af5?color=536af5&logo=github" style="display: inline-block; vertical-align: middle;"
35
- alt="BAGEL Codebase"
36
  />
37
  </a>
38
- <a href="https://demo.bagel-ai.org/">
39
- <img
40
- src="https://img.shields.io/badge/BAGEL-Demo-blue?logo=googleplay&logoColor=white" style="display: inline-block; vertical-align: middle;"
41
- alt="BAGEL Demo"
42
  />
43
  </a>
44
- <a href="https://discord.com/invite/Z836xxzy">
45
  <img
46
- src="https://img.shields.io/badge/BAGEL-Discord-green?logo=discord&logoColor=white" style="display: inline-block; vertical-align: middle;"
47
- alt="BAGEL Discord"
48
  />
49
- </a>
50
-
51
-
52
  </p>
53
 
 
 
 
54
 
55
- > We present **BAGEL**, an open‑source multimodal foundation model with 7B active parameters (14B total) trained on large‑scale interleaved multimodal data. BAGEL outperforms the current top‑tier open‑source VLMs like Qwen2.5-VL and InternVL-2.5 on standard multimodal understanding leaderboards, and delivers text‑to‑image quality that is competitive with strong specialist generators such as SD3.
56
- Moreover, BAGEL demonstrates superior qualitative results in classical image‑editing scenarios than the leading open-source models. More importantly, it extends to free-form visual manipulation, multiview synthesis, and world navigation, capabilities that constitute "world-modeling" tasks beyond the scope of previous image-editing models.
57
-
58
-
59
- This repository hosts the model weights for **BAGEL**. For installation, usage instructions, and further documentation, please visit our [GitHub repository](https://github.com/bytedance-seed/BAGEL).
60
-
61
-
62
-
63
- <p align="left"><img src="https://github.com/ByteDance-Seed/Bagel/raw/main/assets/teaser.webp" width="80%"></p>
64
-
65
 
 
 
 
 
 
 
 
 
 
66
 
 
 
67
 
 
 
 
 
 
68
 
69
 
70
- ## 🧠 Method
71
- BAGEL adopts a Mixture-of-Transformer-Experts (MoT) architecture to maximize the model’s capacity to learn from richly diverse multimodal information. Following the same principle of capacity maximization, it utilizes two separate encoders to capture pixel-level and semantic-level features of an image. The overall framework follows a Next Group of Token Prediction paradigm, where the model is trained to predict the next group of language or visual tokens as a compression target.
72
-
73
- BAGEL scales MoT’s capacity through Pre-training, Continued Training, and Supervised Finetuning on trillions of interleaved multimodal tokens spanning language, image, video, and web data. It surpasses open models on standard understanding and generation benchmarks and demonstrates advanced in-context multimodal abilities like free-form image editing, future frame prediction, 3D manipulation, world navigation, and sequential reasoning.
74
-
75
- <p align="left"><img src="https://github.com/ByteDance-Seed/Bagel/raw/main/assets/arch.png" width="50%"></p>
76
-
77
-
78
- ## 🌱 Emerging Properties
79
- <p align="left"><img src="https://github.com/ByteDance-Seed/Bagel/raw/main/assets/emerging_curves.png" width="50%"></p>
80
-
81
- As we scale up BAGEL’s pretraining with more multimodal tokens, we observe consistent performance gains across understanding, generation, and editing tasks. Different capabilities emerge at distinct training stagesβ€”multimodal understanding and generation appear early, followed by basic editing, while complex, intelligent editing emerges later. This staged progression suggests an emergent pattern, where advanced multimodal reasoning builds on well-formed foundational skills. Ablation studies further show that combining VAE and ViT features significantly improves intelligent editing, underscoring the importance of visual-semantic context in enabling complex multimodal reasoning and further supporting its role in the emergence of advanced capabilities.
82
-
83
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
84
 
85
- ## πŸ“Š Benchmarks
86
- ### 1. Visual Understanding
87
- | Model | MME ↑ | MMBench ↑ | MMMU ↑ | MM-Vet ↑ | MathVista ↑ |
88
- | ------------------- | ----------: | ----------: | -------: | -------: | ----------: |
89
- | Janus-Pro-7B | - | 79.2 | 41.0 | 50.0 | – |
90
- | Qwen2.5-VL-7B | 2347 | 83.5 | **58.6** | 67.1 | 68.2 |
91
- | **BAGEL** | **2388** | **85.0** | 55.3 | **67.2** | **73.1** |
92
- ### 2. Text-to-Image Generation Β· GenEval
93
- | Model | Overall ↑ |
94
- | ------------ | --------- |
95
- | FLUX-1-dev | 0.82 |
96
- | SD3-Medium | 0.74 |
97
- | Janus-Pro-7B | 0.80 |
98
- | **BAGEL** | **0.88** |
99
- ### 3. Image Editing
100
- | Model | GEdit-Bench-EN (SC) ↑ | GEdit-Bench-EN (PQ) ↑ | GEdit-Bench-EN (O) ↑ | IntelligentBench ↑ |
101
- | ------------- | --------------------- | --------------------- | ------------------- | ------------------ |
102
- | Step1X-Edit | 7.09 | 6.76 | **6.70** | 14.9 |
103
- | Gemini-2-exp. | 6.73 | 6.61 | 6.32 | **57.6** |
104
- | **BAGEL** | **7.36** | **6.83** | 6.52 | 44.0 |
105
- | **BAGEL+CoT** | – | – | – | 55.3 |
106
-
107
- ## License
108
- BAGEL is licensed under the Apache 2.0 license. It is finetuned from [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) and [siglip-so400m-14-384-flash-attn2](https://huggingface.co/HuggingFaceM4/siglip-so400m-14-384-flash-attn2) model, and uses the [FLUX.1-schnell VAE model](https://huggingface.co/black-forest-labs/FLUX.1-schnell), all under Apache 2.0.
109
 
110
  ## ✍️ Citation
 
111
  ```bibtex
112
- @article{deng2025bagel,
113
- title = {Emerging Properties in Unified Multimodal Pretraining},
114
- author = {Deng, Chaorui and Zhu, Deyao and Li, Kunchang and Gou, Chenhui and Li, Feng and Wang, Zeyu and Zhong, Shu and Yu, Weihao and Nie, Xiaonan and Song, Ziang and Shi, Guang and Fan, Haoqi},
115
- journal = {arXiv preprint arXiv:2505.14683},
116
- year = {2025}
117
- }
118
- ```
 
1
  ---
2
  license: apache-2.0
3
  base_model:
4
+ - ByteDance-Seed/BAGEL-7B-MoT
5
  pipeline_tag: any-to-any
6
+ library_name: ThinkMorph-7B
7
  ---
8
 
9
+ <p align="center">
10
+ <img src="https://github.com/ThinkMorph/ThinkMorph/blob/main/assets/logo.png" width="40%"> <br>
 
11
  </p>
12
 
13
 
14
+ ## Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning
 
15
 
16
+ 🌟 This is the official repository for the paper "[ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning]()", which contains model checkpoint of ThinkMorph.
17
+ For installation, usage instructions, and further documentation, please visit our [GitHub repository](https://github.com/ThinkMorph/ThinkMorph).
18
 
19
+ <p align="center">
20
+ <a href="">
21
  <img
22
+ src="https://img.shields.io/badge/ThinkMorph-Website-0A66C2?logo=safari&logoColor=white"
23
+ alt="ThinkMorph Website"
24
  />
25
  </a>
26
+ <a href="">
27
  <img
28
+ src="https://img.shields.io/badge/ThinkMorph-Paper-red?logo=arxiv&logoColor=red"
29
+ alt="ThinkMorph Paper on arXiv"
30
  />
31
  </a>
32
+ <a href="https://github.com/ThinkMorph/ThinkMorph" target="_blank" style="margin: 2px;">
33
  <img
34
+ alt="Github" src="https://img.shields.io/badge/ThinkMorph-Codebase-536af5?color=536af5&logo=github"
35
+ alt="ThinkMorph Codebase"
36
  />
37
  </a>
38
+ <a href="https://huggingface.co/ThinkMorph">
39
+ <img
40
+ src="https://img.shields.io/badge/ThinkMorph-Dataset-yellow?logo=huggingface&logoColor=yellow"
41
+ alt="ThinkMorph Dataset"
42
  />
43
  </a>
44
+ <!-- <a href="https://demo.bagel-ai.org/">
45
  <img
46
+ src="https://img.shields.io/badge/BAGEL-Demo-blue?logo=googleplay&logoColor=blue"
47
+ alt="BAGEL Demo"
48
  />
49
+ </a> -->
 
 
50
  </p>
51
 
52
+ ## πŸ’₯ News
53
+ - **[2025.10.29]** Our model checkpoint and training data are now accessible at [Huggingface](https://huggingface.co/ThinkMorph).
54
+ - **[2025.10.29]** Our paper is now accessible at .
55
 
56
+ ## πŸ‘€ About ThinkMorph
 
 
 
 
 
 
 
 
 
57
 
58
+ Multimodal reasoning demands synergistic coordination of language and vision. However, determining what constitutes meaningful interleaved reasoning is non-trivial, and current approaches lack a generalizable recipe.
59
+ We present **ThinkMorph**, a unified model that enables such generalization through a principled approach: treating text and images as complementary modalities that mutually advance reasoning.
60
+ <p align="center">
61
+ <img src="https://github.com/ThinkMorph/ThinkMorph/blob/main/assets/interleaved_design.jpg" width="100%"> <br>
62
+ </p>
63
+ Guided by this principle, we identify tasks requiring concrete, verifiable visual engagement and design a high-quality data pipeline that trains models to generate interleaved images and text as progressive reasoning traces.
64
+ <p align="center">
65
+ <img src="https://github.com/ThinkMorph/ThinkMorph/blob/main/assets/thinkmorph_main.jpg" width="100%"> <br>
66
+ </p>
67
 
68
+ ThinkMorph delivers substantial gains on **vision-centric** tasks, achieving an average improvement of 34.74% over the base model while consistently surpassing text-only and image-only modes.
69
+ By fine-tuning with **merely ~24K** samples, it achieves out-of-domain performance that rivals or even surpasses leading large-scale, proprietary VLMs.
70
 
71
+ Intriguingly, ThinkMorph unlocks emergent properties that represent a *hallmark of multimodal intelligence*: the elicitation of unseen visual manipulation skills, the self-adaptive switching between reasoning modes according to task complexity, and better test-time scaling via diversified thoughts.
72
+ <p align="center">
73
+ <img src="https://github.com/ThinkMorph/ThinkMorph/blob/main/assets/emrging_prop.jpg" width="100%"> <br>
74
+ </p>
75
+ These findings suggest promising directions for future work to characterize the emergent capabilities of unified models for multimodal reasoning.
76
 
77
 
78
+ ## πŸ“Š Benchmarks
 
 
 
 
 
 
 
 
 
 
 
 
79
 
80
+ | Model | Size | | VSP | VisPuzzle | ChartQA | VStar | BLINK-J | MMVP | SAT | BLINK | CV-Bench |
81
+ | --- | --- | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
82
+ | GPT-4o | – | | 33.50 | 43.75 | 76.34 | 61.78 | 72.67 | 84.67 | 28.00 | 60.28 | 75.61 |
83
+ | GPT-5 | – | | 57.33 | 78.00 | 80.85 | 71.73 | 77.33 | 86.33 | 73.30 | 69.86 | 85.46 |
84
+ | Gemini 2.5 Flash | – | | 59.33 | 47.00 | 83.79 | 70.68 | 66.00 | 80.33 | 56.00 | 67.49 | 85.07 |
85
+ | InternVL3.5 | 8B | | 8.17 | 34.75 | 76.26 | 68.59 | 71.33 | 76.33 | 45.33 | 59.60 | 81.99 |
86
+ | | 38B | | 20.16 | 36.50 | 80.44 | 76.96 | 80.67 | 80.33 | 49.33 | 62.65 | 85.96 |
87
+ | Qwen2.5-VL | 7B | | 2.16 | 34.75 | 78.12 | 76.44 | 59.33 | 77.33 | 51.33 | 55.92 | 75.20 |
88
+ | | 72B | | 41.83 | 40.00 | 82.03 | 85.86 | 61.33 | 82.00 | 64.67 | 61.91 | 82.54 |
89
+ | Janus-pro | 7B | | 0.00 | 33.50 | 43.08 | 38.22 | 50.67 | 63.33 | 22.00 | 38.51 | 67.83 |
90
+ | Chameleon | 7B | | 0.83 | 30.50 | 5.74 | 28.27 | 0.67 | 47.67 | 10.67 | 16.52 | 36.52 |
91
+ | Bagel | 7B | | 0.83* | 35.00* | 61.82 | 55.49 | 67.33 | 70.33 | 44.67 | 47.66 | 76.03 |
92
+ | **ThinkMorph** | **7B** | | **75.83** | **79.00** | **78.10** | **67.02** | **72.00** | **80.33** | **52.67** | **60.07** | **80.82** |
93
+ | Ξ” (vs Bagel) | | | +75.00 | +44.00 | +16.28 | +11.53 | +4.67 | +10.00 | +8.00 | +12.41 | +4.79 |
94
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
95
 
96
  ## ✍️ Citation
97
+
98
  ```bibtex
99
+
100
+ ```