Omni-Diffusion / README.md
nielsr's picture
nielsr HF Staff
Improve model card with metadata and project links
f448f27 verified
|
raw
history blame
2.25 kB
metadata
license: apache-2.0
pipeline_tag: any-to-any
library_name: transformers

Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion

Omni-Diffusion is the first any-to-any multimodal language model built entirely on a mask-based discrete diffusion model. It unifies understanding and generation across text, speech, and images by modeling a joint distribution over discrete multimodal tokens.

Model Description

Omni-Diffusion employs a unified mask-based discrete diffusion model to capture the joint distribution over discrete multimodal tokens. This approach supports not only bimodal tasks (such as text-to-image or speech-to-text) but also more complex scenarios involving multiple modalities simultaneously, such as spoken visual question answering. On a diverse set of benchmarks, the method outperforms or performs on par with existing multimodal systems, highlighting the potential of diffusion models for multimodal foundation models.

Usage

As the model uses a custom architecture, it can be loaded using the transformers library with trust_remote_code=True:

from transformers import AutoModel

model = AutoModel.from_pretrained("lijiang/Omni-Diffusion", trust_remote_code=True)

For detailed inference instructions and environment setup (including required image and audio tokenizers), please refer to the official GitHub repository.

Citation

If you find this work helpful for your research, please consider citing:

@article{li2026omni,
  title={Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion},
  author={Li, Lijiang and Long, Zuwei) and Shen, Yunhang and Gao, Heting and Cao, Haoyu and Sun, Xing and Shan, Caifeng and He, Ran and Fu, Chaoyou},
  journal={arXiv preprint arXiv:2603.06577},
  year={2026}
}