JointDiT: Enhancing RGB-Depth Joint Modeling with Diffusion Transformers
Paper โข 2505.00482 โข Published โข 2
How to use byungki-kwon/JointDiT with Diffusers:
pip install -U diffusers transformers accelerate
import torch
from diffusers import DiffusionPipeline
# switch to "mps" for apple devices
pipe = DiffusionPipeline.from_pretrained("byungki-kwon/JointDiT", dtype=torch.bfloat16, device_map="cuda")
prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
image = pipe(prompt).images[0]The preprint version is available on arXiv.
๐ Paper | ๐ค Project Page | ๐ค Code
JointDiT is a multimodal diffusion transformer that jointly models RGB and Depth.
It supports the following tasks:
JointDiT is built on top of black-forest-labs/FLUX.1-dev,
but requires additional modules and a custom pipeline implementation.
๐ Please visit the GitHub repository
for installation, training, and inference instructions.
If you find this work useful, please cite:
@article{byung2025jointdit,
title={JointDiT: Enhancing RGB-Depth Joint Modeling with Diffusion Transformers},
author={Byung-Ki, Kwon and Dai, Qi and Hyoseok, Lee and Luo, Chong and Oh, Tae-Hyun},
journal={arXiv preprint arXiv:2505.00482},
year={2025}
}
Base model
black-forest-labs/FLUX.1-dev