JointDiT / README.md

byungki-kwon

Update README.md

2cc1b64 verified 8 months ago

preview code

raw

history blame contribute delete

1.54 kB

metadata

license: mit
language:
  - en
base_model:
  - black-forest-labs/FLUX.1-dev
pipeline_tag: text-to-image
library_name: diffusers
tags:
  - diffusion-transformer
  - multimodal
  - joint-generation
  - depth-estimation
  - depth-to-image
  - text-to-multimodal

JointDiT: Enhancing RGB-Depth Joint Modeling with Diffusion Transformers (ICCV 2025)

Overview

The preprint version is available on arXiv.
📑 Paper | 🤗 Project Page | 🤗 Code

JointDiT is a multimodal diffusion transformer that jointly models RGB and Depth.
It supports the following tasks:

Text to joint RGB-Depth generation
Depth estimation from RGB
Depth-conditioned image generation

How to Use

JointDiT is built on top of black-forest-labs/FLUX.1-dev,
but requires additional modules and a custom pipeline implementation.

👉 Please visit the GitHub repository
for installation, training, and inference instructions.

Citation

If you find this work useful, please cite:

@article{byung2025jointdit,
  title={JointDiT: Enhancing RGB-Depth Joint Modeling with Diffusion Transformers},
  author={Byung-Ki, Kwon and Dai, Qi and Hyoseok, Lee and Luo, Chong and Oh, Tae-Hyun},
  journal={arXiv preprint arXiv:2505.00482},
  year={2025}
}