metadata
license: cc-by-nc-4.0
pipeline_tag: text-to-image
tags:
- depth-estimation
- text-to-3d
- diffusion
- flux
library_name: flux_rgbd
Modality Forcing for Scalable Spatial Generation
Joint text β RGB + depth generation with a single diffusion transformer, built on FLUX.2. Modality Forcing assigns separate noise levels per modality during post-training, so one model supports joint generation (text β RGB-D), image-to-depth, and depth-to-image at inference.
- π Paper: arXiv:2606.13676
- π» Code: github.com/Duisterhof/modality-forcing
- π Demo: Hugging Face Space
- π Project page: modality-forcing.github.io
Files
| File | Description |
|---|---|
model.safetensors |
FluxRGBD DiT (12B total β 9B-class FLUX.2 backbone + depth streams, bf16) |
config.json |
Model variant config (flux_rgbd_9b_v2) |
ae_encoder.safetensors / ae_decoder.safetensors |
FLUX.2 autoencoder |
The Qwen3-8B text encoder is pulled separately from
Qwen/Qwen3-8B.
Usage
git clone https://github.com/Duisterhof/modality-forcing.git
cd modality-forcing
bash install.sh
python scripts/joint.py --prompt "a cozy sunlit kitchen with wooden cabinets"
The scripts download these weights automatically (bartduis/modality_forcing
is the default --model).
License
The model weights are released under CC BY-NC 4.0 (non-commercial). The inference code is Apache-2.0; see the GitHub repository.
Citation
@article{duisterhof2026mofo,
title = {Modality Forcing for Scalable Spatial Generation},
author = {Duisterhof, Bardienus Pieter and Ramanan, Deva and Ichnowski, Jeffrey and Johnson, Justin and Park, Keunhong},
journal = {arXiv preprint arXiv:2606.13676},
year = {2026}
}