File size: 7,448 Bytes
8d368b2 8d36ee7 8d368b2 8d36ee7 8d368b2 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 |
<div align='center'>
<h1>OmniBridge: Unified Multimodal Understanding, Generation, and Retrieval via Latent Space Alignment</h1h1>
<h3></h3>
<!-- [Emu3 Team, BAAI](https://www.baai.ac.cn/english.html) -->
| [Github](https://github.com/xiao-xt/OmniBridge) | [Paper](https://arxiv.org/abs/2509.19018) | [🤗HF Models](https://huggingface.co/xxt-ssr/Omnibridge-retrieval-finetuned) | [Modelscope](https://www.modelscope.cn/models/xxtssr/OmniBridge/summary) |
</div>
<div align='center'>
<img src="./assets/arch.png" class="interpolation-image" alt="arch." height="80%" width="70%" />
</div>
we propose **OmniBridge**, a unified and modular multimodal framework that supports vision-language understanding, generation, and retrieval within a unified architecture. OmniBridge adopts a language-centric design that reuses pretrained LLMs and introduces a lightweight bidirectional latent alignment module for decoupling visual generation, multimodal retrieval, and latent space alignment from the core LLM.
<div align='center'>
<img src="./assets/stage.png" class="interpolation-image" alt="arch." height="80%" width="70%" />
</div>
### OmniBridge excels in both generation and perception
**OmniBridge** demonstrate the effectiveness of our framework through extensive experiments on standard vision-language benchmarks, validating that OmniBridge has achieved state-of-the-art or competitive performance in multimodal understanding, generation, and retrieval tasks.
<div align='center'>
<img src="./assets/comparison_understanding.png" class="interpolation-image" alt="comparison." height="65%" width="65%" />
</div>
<div align='center'>
<img src="./assets/comparison_generation.png" class="interpolation-image" alt="comparison." height="80%" width="80%" />
</div>
### Highlights
- **OmniBridge** is a unified and modular multimodal framework that supports understanding, generation, and retrieval tasks within a single architecture.
- **OmniBridge** introduce a two-stage decoupled training strategy that separates behavioral alignment from latent-level alignment, enabling efficient and stable adaptation across diverse multimodal tasks
- **OmniBridge** design a novel semantic-guided diffusion training mechanism that gradually replaces text conditioning with learnable query embeddings, enabling fine-grained, controllable latent space alignment.
- **OmniBridge** demonstrate the effectiveness of our framework through extensive experiments on standard vision-language benchmarks, validating that OmniBridge has achieved state-of-the-art or competitive performance in multimodal understanding, generation, and retrieval tasks.
## Performance
### Vision-Language Understanding
#### Multimodal Reasoning and Mathematics
<div align='center'>
<img src="./assets/understanding_1.png" class="interpolation-image" alt="comparison." height="80%" width="80%" />
</div>
<div align='center'>
<img src="./assets/understanding_2.png" class="interpolation-image" alt="comparison." height="70%" width="70%" />
</div>
#### OCR, Chart, and Document Understanding
<div align='center'>
<img src="./assets/understanding_3.png" class="interpolation-image" alt="comparison." height="80%" width="80%" />
</div>
#### Multi-Image Understanding
<div align='center'>
<img src="./assets/understanding_4.png" class="interpolation-image" alt="comparison." height="50%" width="50%" />
</div>
#### Real-World Comprehension
<div align='center'>
<img src="./assets/understanding_5.png" class="interpolation-image" alt="comparison." height="55%" width="55%" />
</div>
#### Comprehensive Multimodal Evaluation & Multimodal Hallucination Evaluation
<div align='center'>
<img src="./assets/understanding_6.png" class="interpolation-image" alt="comparison." height="60%" width="60%" />
</div>
#### Multimodal Understanding Cases
<div align='center'>
<img src="./assets/understanding_case.png" class="interpolation-image" alt="comparison." height="80%" width="80%" />
</div>
### Image Generation
#### Performance on Geneval banchmark
<div align='center'>
<img src="./assets/gen_1.png" class="interpolation-image" alt="comparison." height="80%" width="80%" />
</div>
#### Performance on DPG-Bench
<div align='center'>
<img src="./assets/gen_2.png" class="interpolation-image" alt="comparison." height="65%" width="65%" />
</div>
#### Image Generation Cases
<div align='center'>
<img src="./assets/gen_case_1.png" class="interpolation-image" alt="comparison." height="80%" width="80%" />
</div>
<div align='center'>
<img src="./assets/gen_case.png" class="interpolation-image" alt="comparison." height="80%" width="80%" />
</div>
### Image Editing
#### Performance on IMGEDIT-BENCH
<div align='center'>
<img src="./assets/editing_2.png" class="interpolation-image" alt="comparison." height="80%" width="80%" />
</div>
#### Image Editing Cases
<div align='center'>
<img src="./assets/editing_1.png" class="interpolation-image" alt="comparison." height="60%" width="60%" />
</div>
### Multimodal Retrieval
<div align='center'>
<img src="./assets/retrieval.png" class="interpolation-image" alt="comparison." height="65%" width="65%" />
</div>
## News
- 2025.09 We relase **[OmniBridge](https://huggingface.co/)** which is a unified and modular multimodal framework that combines a language-centric design with efficient cross-modal alignment.
- 2025.08 We introduce OmniBridge, a unified and modular multimodal framework that supports vision-language understanding, generation, and retrieval within a unified architecture.
### TODO
- [X] Release model weights of OmniBridge.
### Setup
Clone this repository and install required packages:
```shell
git clone https://github.com/xiao-xt/OmniBridge
pip install -r requirements.txt
```
And you need to download the weights of the Decoder of HunyuanDiT for image generation: https://huggingface.co/Tencent-Hunyuan/HunyuanDiT-v1.2
### Model Weights
| Model name | HF Weight | Modelscope |
| ------------------------ | -------------------------------------------------------------- | ------------------------------------------------------------------------- |
| **OmniBridge** | [🤗 HF link]() | [Modelscope link]() |
| **OmniBridge-Retrieval-Finetuned** | [🤗 HF link](https://huggingface.co/xxt-ssr/Omnibridge-retrieval-finetuned) | [Modelscope link](https://www.modelscope.cn/models/xxtssr/OmniBridge/summary) |
### Quickstart
#### Use 🤗Transformers to run OmniBridge for vision-language understanding
```shell
python ./multimodal_understanding.py
```
#### Use 🤗Transformers to run OmniBridge for image generation
```shell
python ./image_generation.py
```
#### Use 🤗Transformers to run OmniBridge for image editing
```shell
python ./image_editing.py
```
#### Use 🤗Transformers to run OmniBridge for multimodal retrieval
```shell
python ./multimodal_retrieval.py
```
## Citation
If you find Emu3 useful for your research and applications, please consider starring this repository and citing:
```
@article{xiao2025omnibridge,
title={OmniBridge: Unified Multimodal Understanding, Generation, and Retrieval via Latent Space Alignment},
author={Xiao, Teng and Li, Zuchao and Zhang, Lefei},
journal={arXiv preprint arXiv:2509.19018},
year={2025}
}
```
|