|
|
--- |
|
|
base_model: |
|
|
- Efficient-Large-Model/Sana_1600M_1024px_BF16 |
|
|
- VIPL-GENUN/Jodi |
|
|
tags: |
|
|
- Diffusion |
|
|
- Text-to-Image |
|
|
- Controllable-Generation |
|
|
- Image-Perception |
|
|
--- |
|
|
|
|
|
# Jodi |
|
|
|
|
|
We introduce Jodi, a diffusion framework that unifies visual generation and understanding by jointly modeling the image domain and multiple label domains. |
|
|
|
|
|
- **arXiv**: <https://arxiv.org/abs/2505.19084> |
|
|
- **Project page**: <https://VIPL-GENUN.github.io/Project-Jodi> |
|
|
- **GitHub**: <https://github.com/VIPL-GENUN/Jodi> |
|
|
- **Joint-1.6M Dataset**: <https://huggingface.co/datasets/VIPL-GENUN/Joint-1.6M-1024px> |
|
|
|
|
|
 |
|
|
|
|
|
<br> |
|
|
|
|
|
# Citation |
|
|
|
|
|
If you find this project helpful, please consider citing: |
|
|
|
|
|
```bibtex |
|
|
@article{xu2025jodi, |
|
|
title={Jodi: Unification of Visual Generation and Understanding via Joint Modeling}, |
|
|
author={Xu, Yifeng and He, Zhenliang and Kan, Meina and Shan, Shiguang and Chen, Xilin}, |
|
|
journal={arXiv preprint arXiv:2505.19084}, |
|
|
year={2025} |
|
|
} |
|
|
``` |