metadata
license: mit
language:
- en
base_model:
- black-forest-labs/FLUX.1-dev
pipeline_tag: text-to-image
library_name: diffusers
tags:
- diffusion-transformer
- multimodal
- joint-generation
- depth-estimation
- depth-to-image
- text-to-multimodal
JointDiT: Enhancing RGB-Depth Joint Modeling with Diffusion Transformers (ICCV 2025)
Overview
The preprint version is available on arXiv.
📑 Paper | 🤗 Project Page | 🤗 Code
JointDiT is a multimodal diffusion transformer that jointly models RGB and Depth.
It supports the following tasks:
- Text to joint RGB-Depth generation
- Depth estimation from RGB
- Depth-conditioned image generation
How to Use
JointDiT is built on top of black-forest-labs/FLUX.1-dev,
but requires additional modules and a custom pipeline implementation.
👉 Please visit the GitHub repository
for installation, training, and inference instructions.
Citation
If you find this work useful, please cite:
@article{byung2025jointdit,
title={JointDiT: Enhancing RGB-Depth Joint Modeling with Diffusion Transformers},
author={Byung-Ki, Kwon and Dai, Qi and Hyoseok, Lee and Luo, Chong and Oh, Tae-Hyun},
journal={arXiv preprint arXiv:2505.00482},
year={2025}
}
