lijiang
/

Omni-Diffusion

@@ -1,6 +1,42 @@
 ---
 license: apache-2.0
 ---
-This repository contains the model of the paper [Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion](https://arxiv.org/abs/2603.06577).
-Code: https://github.com/VITA-MLLM/Omni-Diffusion

 ---
 license: apache-2.0
+pipeline_tag: any-to-any
+library_name: transformers
 ---
+# Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion
+Omni-Diffusion is the first any-to-any multimodal language model built entirely on a mask-based discrete diffusion model. It unifies understanding and generation across text, speech, and images by modeling a joint distribution over discrete multimodal tokens.
+- **Paper:** [Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion](https://arxiv.org/abs/2603.06577)
+- **Project Page:** [https://omni-diffusion.github.io](https://omni-diffusion.github.io)
+- **Repository:** [https://github.com/VITA-MLLM/Omni-Diffusion](https://github.com/VITA-MLLM/Omni-Diffusion)
+## Model Description
+Omni-Diffusion employs a unified mask-based discrete diffusion model to capture the joint distribution over discrete multimodal tokens. This approach supports not only bimodal tasks (such as text-to-image or speech-to-text) but also more complex scenarios involving multiple modalities simultaneously, such as spoken visual question answering. On a diverse set of benchmarks, the method outperforms or performs on par with existing multimodal systems, highlighting the potential of diffusion models for multimodal foundation models.
+## Usage
+As the model uses a custom architecture, it can be loaded using the `transformers` library with `trust_remote_code=True`:
+```python
+from transformers import AutoModel
+model = AutoModel.from_pretrained("lijiang/Omni-Diffusion", trust_remote_code=True)
+```
+For detailed inference instructions and environment setup (including required image and audio tokenizers), please refer to the [official GitHub repository](https://github.com/VITA-MLLM/Omni-Diffusion).
+## Citation
+If you find this work helpful for your research, please consider citing:
+```bibtex
+@article{li2026omni,
+  title={Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion},
+  author={Li, Lijiang and Long, Zuwei) and Shen, Yunhang and Gao, Heting and Cao, Haoyu and Sun, Xing and Shan, Caifeng and He, Ran and Fu, Chaoyou},
+  journal={arXiv preprint arXiv:2603.06577},
+  year={2026}
+}
+```