Improve model card with metadata and project links

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +38 -2
README.md CHANGED
@@ -1,6 +1,42 @@
1
  ---
2
  license: apache-2.0
 
 
3
  ---
4
- This repository contains the model of the paper [Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion](https://arxiv.org/abs/2603.06577).
5
 
6
- Code: https://github.com/VITA-MLLM/Omni-Diffusion
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ pipeline_tag: any-to-any
4
+ library_name: transformers
5
  ---
 
6
 
7
+ # Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion
8
+
9
+ Omni-Diffusion is the first any-to-any multimodal language model built entirely on a mask-based discrete diffusion model. It unifies understanding and generation across text, speech, and images by modeling a joint distribution over discrete multimodal tokens.
10
+
11
+ - **Paper:** [Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion](https://arxiv.org/abs/2603.06577)
12
+ - **Project Page:** [https://omni-diffusion.github.io](https://omni-diffusion.github.io)
13
+ - **Repository:** [https://github.com/VITA-MLLM/Omni-Diffusion](https://github.com/VITA-MLLM/Omni-Diffusion)
14
+
15
+ ## Model Description
16
+
17
+ Omni-Diffusion employs a unified mask-based discrete diffusion model to capture the joint distribution over discrete multimodal tokens. This approach supports not only bimodal tasks (such as text-to-image or speech-to-text) but also more complex scenarios involving multiple modalities simultaneously, such as spoken visual question answering. On a diverse set of benchmarks, the method outperforms or performs on par with existing multimodal systems, highlighting the potential of diffusion models for multimodal foundation models.
18
+
19
+ ## Usage
20
+
21
+ As the model uses a custom architecture, it can be loaded using the `transformers` library with `trust_remote_code=True`:
22
+
23
+ ```python
24
+ from transformers import AutoModel
25
+
26
+ model = AutoModel.from_pretrained("lijiang/Omni-Diffusion", trust_remote_code=True)
27
+ ```
28
+
29
+ For detailed inference instructions and environment setup (including required image and audio tokenizers), please refer to the [official GitHub repository](https://github.com/VITA-MLLM/Omni-Diffusion).
30
+
31
+ ## Citation
32
+
33
+ If you find this work helpful for your research, please consider citing:
34
+
35
+ ```bibtex
36
+ @article{li2026omni,
37
+ title={Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion},
38
+ author={Li, Lijiang and Long, Zuwei) and Shen, Yunhang and Gao, Heting and Cao, Haoyu and Sun, Xing and Shan, Caifeng and He, Ran and Fu, Chaoyou},
39
+ journal={arXiv preprint arXiv:2603.06577},
40
+ year={2026}
41
+ }
42
+ ```