xxt-ssr commited on
Commit
8d368b2
·
verified ·
1 Parent(s): 032a51e

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +209 -0
README.md ADDED
@@ -0,0 +1,209 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <div align='center'>
2
+ <h1>OmniBridge: Unified Multimodal Understanding, Generation, and Retrieval via Latent Space Alignment</h1h1>
3
+ <h3></h3>
4
+
5
+ <!-- [Emu3 Team, BAAI](https://www.baai.ac.cn/english.html) -->
6
+
7
+ | [Github](https://github.com/xiao-xt/OmniBridge) | [Paper](https://arxiv.org/abs/2509.19018) | [🤗HF Models](https://huggingface.co/collections/) | [Modelscope](https://www.modelscope.cn/models/xxtssr/OmniBridge/summary) |
8
+
9
+
10
+ </div>
11
+
12
+ <div align='center'>
13
+ <img src="./assets/arch.png" class="interpolation-image" alt="arch." height="80%" width="70%" />
14
+ </div>
15
+
16
+
17
+ we propose **OmniBridge**, a unified and modular multimodal framework that supports vision-language understanding, generation, and retrieval within a unified architecture. OmniBridge adopts a language-centric design that reuses pretrained LLMs and introduces a lightweight bidirectional latent alignment module for decoupling visual generation, multimodal retrieval, and latent space alignment from the core LLM.
18
+
19
+ <div align='center'>
20
+ <img src="./assets/stage.png" class="interpolation-image" alt="arch." height="80%" width="70%" />
21
+ </div>
22
+
23
+
24
+ ### OmniBridge excels in both generation and perception
25
+ **OmniBridge** demonstrate the effectiveness of our framework through extensive experiments on standard vision-language benchmarks, validating that OmniBridge has achieved state-of-the-art or competitive performance in multimodal understanding, generation, and retrieval tasks.
26
+
27
+ <div align='center'>
28
+ <img src="./assets/comparison_understanding.png" class="interpolation-image" alt="comparison." height="65%" width="65%" />
29
+ </div>
30
+
31
+ <div align='center'>
32
+ <img src="./assets/comparison_generation.png" class="interpolation-image" alt="comparison." height="80%" width="80%" />
33
+ </div>
34
+
35
+ ### Highlights
36
+
37
+ - **OmniBridge** is a unified and modular multimodal framework that supports understanding, generation, and retrieval tasks within a single architecture.
38
+ - **OmniBridge** introduce a two-stage decoupled training strategy that separates behavioral alignment from latent-level alignment, enabling efficient and stable adaptation across diverse multimodal tasks
39
+ - **OmniBridge** design a novel semantic-guided diffusion training mechanism that gradually replaces text conditioning with learnable query embeddings, enabling fine-grained, controllable latent space alignment.
40
+ - **OmniBridge** demonstrate the effectiveness of our framework through extensive experiments on standard vision-language benchmarks, validating that OmniBridge has achieved state-of-the-art or competitive performance in multimodal understanding, generation, and retrieval tasks.
41
+
42
+
43
+ ## Performance
44
+
45
+ ### Vision-Language Understanding
46
+
47
+ #### Multimodal Reasoning and Mathematics
48
+
49
+ <div align='center'>
50
+ <img src="./assets/understanding_1.png" class="interpolation-image" alt="comparison." height="80%" width="80%" />
51
+ </div>
52
+
53
+
54
+ <div align='center'>
55
+ <img src="./assets/understanding_2.png" class="interpolation-image" alt="comparison." height="70%" width="70%" />
56
+ </div>
57
+
58
+
59
+ #### OCR, Chart, and Document Understanding
60
+
61
+ <div align='center'>
62
+ <img src="./assets/understanding_3.png" class="interpolation-image" alt="comparison." height="80%" width="80%" />
63
+ </div>
64
+
65
+ #### Multi-Image Understanding
66
+
67
+ <div align='center'>
68
+ <img src="./assets/understanding_4.png" class="interpolation-image" alt="comparison." height="50%" width="50%" />
69
+ </div>
70
+
71
+
72
+ #### Real-World Comprehension
73
+
74
+ <div align='center'>
75
+ <img src="./assets/understanding_5.png" class="interpolation-image" alt="comparison." height="55%" width="55%" />
76
+ </div>
77
+
78
+
79
+ #### Comprehensive Multimodal Evaluation & Multimodal Hallucination Evaluation
80
+
81
+ <div align='center'>
82
+ <img src="./assets/understanding_6.png" class="interpolation-image" alt="comparison." height="60%" width="60%" />
83
+ </div>
84
+
85
+ #### Multimodal Understanding Cases
86
+
87
+ <div align='center'>
88
+ <img src="./assets/understanding_case.png" class="interpolation-image" alt="comparison." height="80%" width="80%" />
89
+ </div>
90
+
91
+ ### Image Generation
92
+
93
+ #### Performance on Geneval banchmark
94
+
95
+ <div align='center'>
96
+ <img src="./assets/gen_1.png" class="interpolation-image" alt="comparison." height="80%" width="80%" />
97
+ </div>
98
+
99
+ #### Performance on DPG-Bench
100
+
101
+ <div align='center'>
102
+ <img src="./assets/gen_2.png" class="interpolation-image" alt="comparison." height="65%" width="65%" />
103
+ </div>
104
+
105
+
106
+ #### Image Generation Cases
107
+
108
+ <div align='center'>
109
+ <img src="./assets/gen_case_1.png" class="interpolation-image" alt="comparison." height="80%" width="80%" />
110
+ </div>
111
+
112
+ <div align='center'>
113
+ <img src="./assets/gen_case.png" class="interpolation-image" alt="comparison." height="80%" width="80%" />
114
+ </div>
115
+
116
+
117
+ ### Image Editing
118
+
119
+ #### Performance on IMGEDIT-BENCH
120
+
121
+ <div align='center'>
122
+ <img src="./assets/editing_2.png" class="interpolation-image" alt="comparison." height="80%" width="80%" />
123
+ </div>
124
+
125
+ #### Image Editing Cases
126
+
127
+ <div align='center'>
128
+ <img src="./assets/editing_1.png" class="interpolation-image" alt="comparison." height="60%" width="60%" />
129
+ </div>
130
+
131
+ ### Multimodal Retrieval
132
+
133
+ <div align='center'>
134
+ <img src="./assets/retrieval.png" class="interpolation-image" alt="comparison." height="65%" width="65%" />
135
+ </div>
136
+
137
+
138
+ ## News
139
+ - 2025.09 We relase **[OmniBridge](https://huggingface.co/)** which is a unified and modular multimodal framework that combines a language-centric design with efficient cross-modal alignment.
140
+ - 2025.08 We introduce OmniBridge, a unified and modular multimodal framework that supports vision-language understanding, generation, and retrieval within a unified architecture.
141
+
142
+
143
+ ### TODO
144
+
145
+ - [X] Release model weights of OmniBridge.
146
+
147
+
148
+
149
+
150
+
151
+ ### Setup
152
+
153
+ Clone this repository and install required packages:
154
+
155
+ ```shell
156
+ git clone https://github.com/xiao-xt/OmniBridge
157
+
158
+ pip install -r requirements.txt
159
+ ```
160
+
161
+ And you need to download the weights of the Decoder of HunyuanDiT for image generation: https://huggingface.co/Tencent-Hunyuan/HunyuanDiT-v1.2
162
+
163
+ ### Model Weights
164
+
165
+ | Model name | HF Weight | Modelscope |
166
+ | ------------------------ | -------------------------------------------------------------- | ------------------------------------------------------------------------- |
167
+ | **OmniBridge** | [🤗 HF link]() | [Modelscope link]() |
168
+ | **OmniBridge-Retrieval-Finetuned** | [🤗 HF link](https://huggingface.co/) | [Modelscope link](https://www.modelscope.cn/models/xxtssr/OmniBridge/summary) |
169
+
170
+
171
+
172
+ ### Quickstart
173
+
174
+ #### Use 🤗Transformers to run OmniBridge for vision-language understanding
175
+ ```shell
176
+ python ./multimodal_understanding.py
177
+ ```
178
+
179
+ #### Use 🤗Transformers to run OmniBridge for image generation
180
+ ```shell
181
+ python ./image_generation.py
182
+ ```
183
+
184
+ #### Use 🤗Transformers to run OmniBridge for image editing
185
+ ```shell
186
+ python ./image_editing.py
187
+ ```
188
+
189
+ #### Use 🤗Transformers to run OmniBridge for multimodal retrieval
190
+ ```shell
191
+ python ./multimodal_retrieval.py
192
+ ```
193
+
194
+
195
+
196
+
197
+
198
+ ## Citation
199
+
200
+ If you find Emu3 useful for your research and applications, please consider starring this repository and citing:
201
+
202
+ ```
203
+ @article{xiao2025omnibridge,
204
+ title={OmniBridge: Unified Multimodal Understanding, Generation, and Retrieval via Latent Space Alignment},
205
+ author={Xiao, Teng and Li, Zuchao and Zhang, Lefei},
206
+ journal={arXiv preprint arXiv:2509.19018},
207
+ year={2025}
208
+ }
209
+ ```