Omnibridge-8B / README.md

Update README.md

8d36ee7 verified 3 months ago

7.45 kB

	<div align='center'>
	<h1>OmniBridge: Unified Multimodal Understanding, Generation, and Retrieval via Latent Space Alignment</h1h1>
	<h3></h3>

	<!-- [Emu3 Team, BAAI](https://www.baai.ac.cn/english.html) -->

	\| [Github](https://github.com/xiao-xt/OmniBridge) \| [Paper](https://arxiv.org/abs/2509.19018) \| [🤗HF Models](https://huggingface.co/xxt-ssr/Omnibridge-retrieval-finetuned) \| [Modelscope](https://www.modelscope.cn/models/xxtssr/OmniBridge/summary) \|


	</div>

	<div align='center'>
	<img src="./assets/arch.png" class="interpolation-image" alt="arch." height="80%" width="70%" />
	</div>


	we propose OmniBridge, a unified and modular multimodal framework that supports vision-language understanding, generation, and retrieval within a unified architecture. OmniBridge adopts a language-centric design that reuses pretrained LLMs and introduces a lightweight bidirectional latent alignment module for decoupling visual generation, multimodal retrieval, and latent space alignment from the core LLM.

	<div align='center'>
	<img src="./assets/stage.png" class="interpolation-image" alt="arch." height="80%" width="70%" />
	</div>


	### OmniBridge excels in both generation and perception
	OmniBridge demonstrate the effectiveness of our framework through extensive experiments on standard vision-language benchmarks, validating that OmniBridge has achieved state-of-the-art or competitive performance in multimodal understanding, generation, and retrieval tasks.

	<div align='center'>
	<img src="./assets/comparison_understanding.png" class="interpolation-image" alt="comparison." height="65%" width="65%" />
	</div>

	<div align='center'>
	<img src="./assets/comparison_generation.png" class="interpolation-image" alt="comparison." height="80%" width="80%" />
	</div>

	### Highlights

	- OmniBridge is a unified and modular multimodal framework that supports understanding, generation, and retrieval tasks within a single architecture.
	- OmniBridge introduce a two-stage decoupled training strategy that separates behavioral alignment from latent-level alignment, enabling efficient and stable adaptation across diverse multimodal tasks
	- OmniBridge design a novel semantic-guided diffusion training mechanism that gradually replaces text conditioning with learnable query embeddings, enabling fine-grained, controllable latent space alignment.
	- OmniBridge demonstrate the effectiveness of our framework through extensive experiments on standard vision-language benchmarks, validating that OmniBridge has achieved state-of-the-art or competitive performance in multimodal understanding, generation, and retrieval tasks.


	## Performance

	### Vision-Language Understanding

	#### Multimodal Reasoning and Mathematics

	<div align='center'>
	<img src="./assets/understanding_1.png" class="interpolation-image" alt="comparison." height="80%" width="80%" />
	</div>


	<div align='center'>
	<img src="./assets/understanding_2.png" class="interpolation-image" alt="comparison." height="70%" width="70%" />
	</div>


	#### OCR, Chart, and Document Understanding

	<div align='center'>
	<img src="./assets/understanding_3.png" class="interpolation-image" alt="comparison." height="80%" width="80%" />
	</div>

	#### Multi-Image Understanding

	<div align='center'>
	<img src="./assets/understanding_4.png" class="interpolation-image" alt="comparison." height="50%" width="50%" />
	</div>


	#### Real-World Comprehension

	<div align='center'>
	<img src="./assets/understanding_5.png" class="interpolation-image" alt="comparison." height="55%" width="55%" />
	</div>


	#### Comprehensive Multimodal Evaluation & Multimodal Hallucination Evaluation

	<div align='center'>
	<img src="./assets/understanding_6.png" class="interpolation-image" alt="comparison." height="60%" width="60%" />
	</div>

	#### Multimodal Understanding Cases

	<div align='center'>
	<img src="./assets/understanding_case.png" class="interpolation-image" alt="comparison." height="80%" width="80%" />
	</div>

	### Image Generation

	#### Performance on Geneval banchmark

	<div align='center'>
	<img src="./assets/gen_1.png" class="interpolation-image" alt="comparison." height="80%" width="80%" />
	</div>

	#### Performance on DPG-Bench

	<div align='center'>
	<img src="./assets/gen_2.png" class="interpolation-image" alt="comparison." height="65%" width="65%" />
	</div>


	#### Image Generation Cases

	<div align='center'>
	<img src="./assets/gen_case_1.png" class="interpolation-image" alt="comparison." height="80%" width="80%" />
	</div>

	<div align='center'>
	<img src="./assets/gen_case.png" class="interpolation-image" alt="comparison." height="80%" width="80%" />
	</div>


	### Image Editing

	#### Performance on IMGEDIT-BENCH

	<div align='center'>
	<img src="./assets/editing_2.png" class="interpolation-image" alt="comparison." height="80%" width="80%" />
	</div>

	#### Image Editing Cases

	<div align='center'>
	<img src="./assets/editing_1.png" class="interpolation-image" alt="comparison." height="60%" width="60%" />
	</div>

	### Multimodal Retrieval

	<div align='center'>
	<img src="./assets/retrieval.png" class="interpolation-image" alt="comparison." height="65%" width="65%" />
	</div>


	## News
	- 2025.09 We relase [OmniBridge](https://huggingface.co/) which is a unified and modular multimodal framework that combines a language-centric design with efficient cross-modal alignment.
	- 2025.08 We introduce OmniBridge, a unified and modular multimodal framework that supports vision-language understanding, generation, and retrieval within a unified architecture.


	### TODO

	- [X] Release model weights of OmniBridge.





	### Setup

	Clone this repository and install required packages:

	```shell
	git clone https://github.com/xiao-xt/OmniBridge

	pip install -r requirements.txt
	```

	And you need to download the weights of the Decoder of HunyuanDiT for image generation: https://huggingface.co/Tencent-Hunyuan/HunyuanDiT-v1.2

	### Model Weights

	\| Model name \| HF Weight \| Modelscope \|
	\| ------------------------ \| -------------------------------------------------------------- \| ------------------------------------------------------------------------- \|
	\| OmniBridge \| [🤗 HF link]() \| [Modelscope link]() \|
	\| OmniBridge-Retrieval-Finetuned \| [🤗 HF link](https://huggingface.co/xxt-ssr/Omnibridge-retrieval-finetuned) \| [Modelscope link](https://www.modelscope.cn/models/xxtssr/OmniBridge/summary) \|



	### Quickstart

	#### Use 🤗Transformers to run OmniBridge for vision-language understanding
	```shell
	python ./multimodal_understanding.py
	```

	#### Use 🤗Transformers to run OmniBridge for image generation
	```shell
	python ./image_generation.py
	```

	#### Use 🤗Transformers to run OmniBridge for image editing
	```shell
	python ./image_editing.py
	```

	#### Use 🤗Transformers to run OmniBridge for multimodal retrieval
	```shell
	python ./multimodal_retrieval.py
	```





	## Citation

	If you find Emu3 useful for your research and applications, please consider starring this repository and citing:

	```
	@article{xiao2025omnibridge,
	title={OmniBridge: Unified Multimodal Understanding, Generation, and Retrieval via Latent Space Alignment},
	author={Xiao, Teng and Li, Zuchao and Zhang, Lefei},
	journal={arXiv preprint arXiv:2509.19018},
	year={2025}
	}
	```