nielsr HF Staff commited on
Commit
f448f27
·
verified ·
1 Parent(s): 8075439

Improve model card with metadata and project links

Browse files

Hi! I'm Niels from the Hugging Face community team.

I've opened this PR to improve the discoverability and documentation of Omni-Diffusion. Key changes include:
- Adding the `any-to-any` pipeline tag to reflect its multimodal capabilities.
- Adding `library_name: transformers` metadata since the model uses `auto_map` for compatibility with the Transformers library.
- Including links to the paper, project page, and official code repository.
- Adding the BibTeX citation for researchers.

These updates help users find the model more easily and provide quick access to the research artifacts.

Files changed (1) hide show
  1. README.md +38 -2
README.md CHANGED
@@ -1,6 +1,42 @@
1
  ---
2
  license: apache-2.0
 
 
3
  ---
4
- This repository contains the model of the paper [Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion](https://arxiv.org/abs/2603.06577).
5
 
6
- Code: https://github.com/VITA-MLLM/Omni-Diffusion
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ pipeline_tag: any-to-any
4
+ library_name: transformers
5
  ---
 
6
 
7
+ # Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion
8
+
9
+ Omni-Diffusion is the first any-to-any multimodal language model built entirely on a mask-based discrete diffusion model. It unifies understanding and generation across text, speech, and images by modeling a joint distribution over discrete multimodal tokens.
10
+
11
+ - **Paper:** [Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion](https://arxiv.org/abs/2603.06577)
12
+ - **Project Page:** [https://omni-diffusion.github.io](https://omni-diffusion.github.io)
13
+ - **Repository:** [https://github.com/VITA-MLLM/Omni-Diffusion](https://github.com/VITA-MLLM/Omni-Diffusion)
14
+
15
+ ## Model Description
16
+
17
+ Omni-Diffusion employs a unified mask-based discrete diffusion model to capture the joint distribution over discrete multimodal tokens. This approach supports not only bimodal tasks (such as text-to-image or speech-to-text) but also more complex scenarios involving multiple modalities simultaneously, such as spoken visual question answering. On a diverse set of benchmarks, the method outperforms or performs on par with existing multimodal systems, highlighting the potential of diffusion models for multimodal foundation models.
18
+
19
+ ## Usage
20
+
21
+ As the model uses a custom architecture, it can be loaded using the `transformers` library with `trust_remote_code=True`:
22
+
23
+ ```python
24
+ from transformers import AutoModel
25
+
26
+ model = AutoModel.from_pretrained("lijiang/Omni-Diffusion", trust_remote_code=True)
27
+ ```
28
+
29
+ For detailed inference instructions and environment setup (including required image and audio tokenizers), please refer to the [official GitHub repository](https://github.com/VITA-MLLM/Omni-Diffusion).
30
+
31
+ ## Citation
32
+
33
+ If you find this work helpful for your research, please consider citing:
34
+
35
+ ```bibtex
36
+ @article{li2026omni,
37
+ title={Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion},
38
+ author={Li, Lijiang and Long, Zuwei) and Shen, Yunhang and Gao, Heting and Cao, Haoyu and Sun, Xing and Shan, Caifeng and He, Ran and Fu, Chaoyou},
39
+ journal={arXiv preprint arXiv:2603.06577},
40
+ year={2026}
41
+ }
42
+ ```