Improve model card with metadata and project links

Hi! I'm Niels from the Hugging Face community team.

I've opened this PR to improve the discoverability and documentation of Omni-Diffusion. Key changes include:
- Adding the `any-to-any` pipeline tag to reflect its multimodal capabilities.
- Adding `library_name: transformers` metadata since the model uses `auto_map` for compatibility with the Transformers library.
- Including links to the paper, project page, and official code repository.
- Adding the BibTeX citation for researchers.

These updates help users find the model more easily and provide quick access to the research artifacts.

Files changed (1) hide show

README.md +38 -2

README.md CHANGED Viewed

@@ -1,6 +1,42 @@
 ---
 license: apache-2.0
 ---
-This repository contains the model of the paper [Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion](https://arxiv.org/abs/2603.06577).
-Code: https://github.com/VITA-MLLM/Omni-Diffusion

 ---
 license: apache-2.0
+pipeline_tag: any-to-any
+library_name: transformers
 ---
+# Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion
+Omni-Diffusion is the first any-to-any multimodal language model built entirely on a mask-based discrete diffusion model. It unifies understanding and generation across text, speech, and images by modeling a joint distribution over discrete multimodal tokens.
+- **Paper:** [Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion](https://arxiv.org/abs/2603.06577)
+- **Project Page:** [https://omni-diffusion.github.io](https://omni-diffusion.github.io)
+- **Repository:** [https://github.com/VITA-MLLM/Omni-Diffusion](https://github.com/VITA-MLLM/Omni-Diffusion)
+## Model Description
+Omni-Diffusion employs a unified mask-based discrete diffusion model to capture the joint distribution over discrete multimodal tokens. This approach supports not only bimodal tasks (such as text-to-image or speech-to-text) but also more complex scenarios involving multiple modalities simultaneously, such as spoken visual question answering. On a diverse set of benchmarks, the method outperforms or performs on par with existing multimodal systems, highlighting the potential of diffusion models for multimodal foundation models.
+## Usage
+As the model uses a custom architecture, it can be loaded using the `transformers` library with `trust_remote_code=True`:
+```python
+from transformers import AutoModel
+model = AutoModel.from_pretrained("lijiang/Omni-Diffusion", trust_remote_code=True)
+```
+For detailed inference instructions and environment setup (including required image and audio tokenizers), please refer to the [official GitHub repository](https://github.com/VITA-MLLM/Omni-Diffusion).
+## Citation
+If you find this work helpful for your research, please consider citing:
+```bibtex
+@article{li2026omni,
+  title={Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion},
+  author={Li, Lijiang and Long, Zuwei) and Shen, Yunhang and Gao, Heting and Cao, Haoyu and Sun, Xing and Shan, Caifeng and He, Ran and Fu, Chaoyou},
+  journal={arXiv preprint arXiv:2603.06577},
+  year={2026}
+}
+```