File size: 959 Bytes
fa0e1b9
 
 
 
 
6850d60
 
 
 
fa0e1b9
 
 
 
 
 
23dec74
fa0e1b9
23dec74
c3207c4
23dec74
 
c3207c4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
---
base_model:
- Efficient-Large-Model/Sana_1600M_1024px_BF16
- VIPL-GENUN/Jodi
tags:
- Diffusion
- Text-to-Image
- Controllable-Generation
- Image-Perception
---

# Jodi

We introduce Jodi, a diffusion framework that unifies visual generation and understanding by jointly modeling the image domain and multiple label domains.

- **arXiv**: <https://arxiv.org/abs/2505.19084>
- **Project page**: <https://VIPL-GENUN.github.io/Project-Jodi>
- **GitHub**: <https://github.com/VIPL-GENUN/Jodi>
- **Joint-1.6M Dataset**:  <https://huggingface.co/datasets/VIPL-GENUN/Joint-1.6M-1024px>

![](./assets/banner.jpg)

<br>

# Citation

If you find this project helpful, please consider citing:

```bibtex
@article{xu2025jodi,
  title={Jodi: Unification of Visual Generation and Understanding via Joint Modeling},
  author={Xu, Yifeng and He, Zhenliang and Kan, Meina and Shan, Shiguang and Chen, Xilin},
  journal={arXiv preprint arXiv:2505.19084},
  year={2025}
}
```