modality_forcing / README.md
bartduis's picture
Initial public release
1da4105
|
Raw
History Blame Contribute Delete
1.99 kB
metadata
license: cc-by-nc-4.0
pipeline_tag: text-to-image
tags:
  - depth-estimation
  - text-to-3d
  - diffusion
  - flux
library_name: flux_rgbd

Modality Forcing for Scalable Spatial Generation

Joint text β†’ RGB + depth generation with a single diffusion transformer, built on FLUX.2. Modality Forcing assigns separate noise levels per modality during post-training, so one model supports joint generation (text β†’ RGB-D), image-to-depth, and depth-to-image at inference.

Files

File Description
model.safetensors FluxRGBD DiT (12B total β€” 9B-class FLUX.2 backbone + depth streams, bf16)
config.json Model variant config (flux_rgbd_9b_v2)
ae_encoder.safetensors / ae_decoder.safetensors FLUX.2 autoencoder

The Qwen3-8B text encoder is pulled separately from Qwen/Qwen3-8B.

Usage

git clone https://github.com/Duisterhof/modality-forcing.git
cd modality-forcing
bash install.sh
python scripts/joint.py --prompt "a cozy sunlit kitchen with wooden cabinets"

The scripts download these weights automatically (bartduis/modality_forcing is the default --model).

License

The model weights are released under CC BY-NC 4.0 (non-commercial). The inference code is Apache-2.0; see the GitHub repository.

Citation

@article{duisterhof2026mofo,
  title   = {Modality Forcing for Scalable Spatial Generation},
  author  = {Duisterhof, Bardienus Pieter and Ramanan, Deva and Ichnowski, Jeffrey and Johnson, Justin and Park, Keunhong},
  journal = {arXiv preprint arXiv:2606.13676},
  year    = {2026}
}