Omnibridge-8B / README.md
xxt-ssr's picture
Update README.md
8d36ee7 verified
<div align='center'>
<h1>OmniBridge: Unified Multimodal Understanding, Generation, and Retrieval via Latent Space Alignment</h1h1>
<h3></h3>
<!-- [Emu3 Team, BAAI](https://www.baai.ac.cn/english.html) -->
| [Github](https://github.com/xiao-xt/OmniBridge) | [Paper](https://arxiv.org/abs/2509.19018) | [🤗HF Models](https://huggingface.co/xxt-ssr/Omnibridge-retrieval-finetuned) | [Modelscope](https://www.modelscope.cn/models/xxtssr/OmniBridge/summary) |
</div>
<div align='center'>
<img src="./assets/arch.png" class="interpolation-image" alt="arch." height="80%" width="70%" />
</div>
we propose **OmniBridge**, a unified and modular multimodal framework that supports vision-language understanding, generation, and retrieval within a unified architecture. OmniBridge adopts a language-centric design that reuses pretrained LLMs and introduces a lightweight bidirectional latent alignment module for decoupling visual generation, multimodal retrieval, and latent space alignment from the core LLM.
<div align='center'>
<img src="./assets/stage.png" class="interpolation-image" alt="arch." height="80%" width="70%" />
</div>
### OmniBridge excels in both generation and perception
**OmniBridge** demonstrate the effectiveness of our framework through extensive experiments on standard vision-language benchmarks, validating that OmniBridge has achieved state-of-the-art or competitive performance in multimodal understanding, generation, and retrieval tasks.
<div align='center'>
<img src="./assets/comparison_understanding.png" class="interpolation-image" alt="comparison." height="65%" width="65%" />
</div>
<div align='center'>
<img src="./assets/comparison_generation.png" class="interpolation-image" alt="comparison." height="80%" width="80%" />
</div>
### Highlights
- **OmniBridge** is a unified and modular multimodal framework that supports understanding, generation, and retrieval tasks within a single architecture.
- **OmniBridge** introduce a two-stage decoupled training strategy that separates behavioral alignment from latent-level alignment, enabling efficient and stable adaptation across diverse multimodal tasks
- **OmniBridge** design a novel semantic-guided diffusion training mechanism that gradually replaces text conditioning with learnable query embeddings, enabling fine-grained, controllable latent space alignment.
- **OmniBridge** demonstrate the effectiveness of our framework through extensive experiments on standard vision-language benchmarks, validating that OmniBridge has achieved state-of-the-art or competitive performance in multimodal understanding, generation, and retrieval tasks.
## Performance
### Vision-Language Understanding
#### Multimodal Reasoning and Mathematics
<div align='center'>
<img src="./assets/understanding_1.png" class="interpolation-image" alt="comparison." height="80%" width="80%" />
</div>
<div align='center'>
<img src="./assets/understanding_2.png" class="interpolation-image" alt="comparison." height="70%" width="70%" />
</div>
#### OCR, Chart, and Document Understanding
<div align='center'>
<img src="./assets/understanding_3.png" class="interpolation-image" alt="comparison." height="80%" width="80%" />
</div>
#### Multi-Image Understanding
<div align='center'>
<img src="./assets/understanding_4.png" class="interpolation-image" alt="comparison." height="50%" width="50%" />
</div>
#### Real-World Comprehension
<div align='center'>
<img src="./assets/understanding_5.png" class="interpolation-image" alt="comparison." height="55%" width="55%" />
</div>
#### Comprehensive Multimodal Evaluation & Multimodal Hallucination Evaluation
<div align='center'>
<img src="./assets/understanding_6.png" class="interpolation-image" alt="comparison." height="60%" width="60%" />
</div>
#### Multimodal Understanding Cases
<div align='center'>
<img src="./assets/understanding_case.png" class="interpolation-image" alt="comparison." height="80%" width="80%" />
</div>
### Image Generation
#### Performance on Geneval banchmark
<div align='center'>
<img src="./assets/gen_1.png" class="interpolation-image" alt="comparison." height="80%" width="80%" />
</div>
#### Performance on DPG-Bench
<div align='center'>
<img src="./assets/gen_2.png" class="interpolation-image" alt="comparison." height="65%" width="65%" />
</div>
#### Image Generation Cases
<div align='center'>
<img src="./assets/gen_case_1.png" class="interpolation-image" alt="comparison." height="80%" width="80%" />
</div>
<div align='center'>
<img src="./assets/gen_case.png" class="interpolation-image" alt="comparison." height="80%" width="80%" />
</div>
### Image Editing
#### Performance on IMGEDIT-BENCH
<div align='center'>
<img src="./assets/editing_2.png" class="interpolation-image" alt="comparison." height="80%" width="80%" />
</div>
#### Image Editing Cases
<div align='center'>
<img src="./assets/editing_1.png" class="interpolation-image" alt="comparison." height="60%" width="60%" />
</div>
### Multimodal Retrieval
<div align='center'>
<img src="./assets/retrieval.png" class="interpolation-image" alt="comparison." height="65%" width="65%" />
</div>
## News
- 2025.09 We relase **[OmniBridge](https://huggingface.co/)** which is a unified and modular multimodal framework that combines a language-centric design with efficient cross-modal alignment.
- 2025.08 We introduce OmniBridge, a unified and modular multimodal framework that supports vision-language understanding, generation, and retrieval within a unified architecture.
### TODO
- [X] Release model weights of OmniBridge.
### Setup
Clone this repository and install required packages:
```shell
git clone https://github.com/xiao-xt/OmniBridge
pip install -r requirements.txt
```
And you need to download the weights of the Decoder of HunyuanDiT for image generation: https://huggingface.co/Tencent-Hunyuan/HunyuanDiT-v1.2
### Model Weights
| Model name | HF Weight | Modelscope |
| ------------------------ | -------------------------------------------------------------- | ------------------------------------------------------------------------- |
| **OmniBridge** | [🤗 HF link]() | [Modelscope link]() |
| **OmniBridge-Retrieval-Finetuned** | [🤗 HF link](https://huggingface.co/xxt-ssr/Omnibridge-retrieval-finetuned) | [Modelscope link](https://www.modelscope.cn/models/xxtssr/OmniBridge/summary) |
### Quickstart
#### Use 🤗Transformers to run OmniBridge for vision-language understanding
```shell
python ./multimodal_understanding.py
```
#### Use 🤗Transformers to run OmniBridge for image generation
```shell
python ./image_generation.py
```
#### Use 🤗Transformers to run OmniBridge for image editing
```shell
python ./image_editing.py
```
#### Use 🤗Transformers to run OmniBridge for multimodal retrieval
```shell
python ./multimodal_retrieval.py
```
## Citation
If you find Emu3 useful for your research and applications, please consider starring this repository and citing:
```
@article{xiao2025omnibridge,
title={OmniBridge: Unified Multimodal Understanding, Generation, and Retrieval via Latent Space Alignment},
author={Xiao, Teng and Li, Zuchao and Zhang, Lefei},
journal={arXiv preprint arXiv:2509.19018},
year={2025}
}
```