--- license: cc-by-nc-4.0 pipeline_tag: text-to-image tags: - depth-estimation - text-to-3d - diffusion - flux library_name: flux_rgbd --- # Modality Forcing for Scalable Spatial Generation Joint **text → RGB + depth** generation with a single diffusion transformer, built on FLUX.2. Modality Forcing assigns separate noise levels per modality during post-training, so one model supports joint generation (text → RGB-D), image-to-depth, and depth-to-image at inference. - 📄 Paper: [arXiv:2606.13676](https://arxiv.org/abs/2606.13676) - 💻 Code: [github.com/Duisterhof/modality-forcing](https://github.com/Duisterhof/modality-forcing) - 🚀 Demo: [Hugging Face Space](https://huggingface.co/spaces/bartduis/modality_forcing) - 🌐 Project page: [modality-forcing.github.io](https://modality-forcing.github.io/) ## Files | File | Description | |------|-------------| | `model.safetensors` | FluxRGBD DiT (12B total — 9B-class FLUX.2 backbone + depth streams, bf16) | | `config.json` | Model variant config (`flux_rgbd_9b_v2`) | | `ae_encoder.safetensors` / `ae_decoder.safetensors` | FLUX.2 autoencoder | The Qwen3-8B text encoder is pulled separately from [`Qwen/Qwen3-8B`](https://huggingface.co/Qwen/Qwen3-8B). ## Usage ```bash git clone https://github.com/Duisterhof/modality-forcing.git cd modality-forcing bash install.sh python scripts/joint.py --prompt "a cozy sunlit kitchen with wooden cabinets" ``` The scripts download these weights automatically (`bartduis/modality_forcing` is the default `--model`). ## License The model weights are released under **CC BY-NC 4.0** (non-commercial). The inference code is Apache-2.0; see the GitHub repository. ## Citation ```bibtex @article{duisterhof2026mofo, title = {Modality Forcing for Scalable Spatial Generation}, author = {Duisterhof, Bardienus Pieter and Ramanan, Deva and Ichnowski, Jeffrey and Johnson, Justin and Park, Keunhong}, journal = {arXiv preprint arXiv:2606.13676}, year = {2026} } ```