Spaces:
Running on Zero
Running on Zero
| title: Modality Forcing | |
| emoji: 🏢 | |
| colorFrom: blue | |
| colorTo: gray | |
| sdk: gradio | |
| sdk_version: 6.16.0 | |
| python_version: '3.12' | |
| app_file: app.py | |
| hardware: zero-h200 | |
| pinned: false | |
| license: apache-2.0 | |
| short_description: Text → RGB + depth + 3D point cloud. | |
| <div align="center"> | |
| <h1>Modality Forcing for Scalable<br>Spatial Generation</h1> | |
| [](https://modality-forcing.github.io/) | |
| [](https://github.com/Duisterhof/modality-forcing) | |
| [](https://huggingface.co/bartduis/modality_forcing) | |
| [](https://arxiv.org/abs/2606.13676) | |
| **[Bardienus Pieter Duisterhof](https://bart-ai.com)<sup>1,2</sup> · [Deva Ramanan](https://www.cs.cmu.edu/~deva/)<sup>1</sup> · [Jeffrey Ichnowski](https://ichnow.ski)<sup>1</sup> · [Justin Johnson](https://web.eecs.umich.edu/~justincj/)<sup>2</sup> · [Keunhong Park](https://keunhong.com)<sup>2</sup>** | |
| **<sup>1</sup> Carnegie Mellon University <sup>2</sup> World Labs** | |
| *Preprint, 2026* | |
| <em>Modality Forcing turns a pretrained text-to-image model into a joint image–depth generator with a simple post-training recipe.</em> | |
| </div> | |
| ## Overview | |
| This Space hosts the interactive demo of **Modality Forcing**: joint text → RGB + depth diffusion built on FLUX.2. A single DiT supports every permutation of conditional and joint generation by assigning a separate noise level to each modality: | |
| | Mode | Input | Output | | |
| |------|-------|--------| | |
| | **Joint** | text prompt | RGB + depth + 3D point cloud | | |
| | **Image → depth** | text + image | depth + 3D point cloud (any aspect ratio, letterbox resize) | | |
| ## Abstract | |
| Text-to-image (T2I) models contain rich spatial priors. | |
| Synthesizing photorealistic, cluttered scenes requires an understanding of geometry, including perspective and relative scale. | |
| Prior works adapt T2I models to leverage this prior for depth prediction, but they require dense depth data and involve complex recipes. | |
| We propose **Modality Forcing**, a simple, scalable post-training recipe for joint image-depth generation using a single DiT trained on sparse depth data. | |
| Modality Forcing enables conditional and joint generation of image and depth in any permutation by assigning separate noise levels per modality. | |
| Per-modality decoders let us train on sparse, real-world depth and achieve strong, generalizable depth prediction. | |
| We further show that Modality Forcing inherits the scalability of T2I pre-training: by training a set of T2I models from scratch (300M to 3B parameters), we find that larger models trained on more image data produce more accurate depth. | |
| Our strongest model is competitive with state-of-the-art monocular depth estimators and reduces AbsRel by **57%** relative to existing joint image-depth generative models. | |
| These results provide strong evidence that image generation is a scalable pre-training objective for spatial perception. | |
| ## Space configuration | |
| ### Weights | |
| All weights are pulled from public repos — no token required: | |
| [`bartduis/modality_forcing`](https://huggingface.co/bartduis/modality_forcing) | |
| (DiT + FLUX.2 autoencoder, CC BY-NC 4.0) and | |
| [`Qwen/Qwen3-8B`](https://huggingface.co/Qwen/Qwen3-8B) (text encoder). | |
| Override the model repo via the `WEIGHTS_REPO` Space variable. | |
| ### Hardware | |
| Pinned to `zero-h200`. The runner builds the DiT on the `meta` device and | |
| assigns checkpoint weights straight onto the GPU (`runner.from_pretrained`), | |
| so the BF16 model (~24 GB) loads in ~8 s on the first invocation — skipping | |
| ~45 s of throwaway random initialization — and stays resident for subsequent | |
| calls within the same Space instance. | |
| ### Avoiding the cold re-download | |
| A Space instance that has gone to sleep starts a fresh container, which | |
| re-downloads the weights (~24 GB DiT + ~16 GB Qwen3-8B + the FLUX.2 VAE) | |
| before the load above can run. To skip that, attach **persistent storage** | |
| (Settings → Storage, or mount an HF Bucket) and point the HF cache at it via | |
| the `HF_HOME` Space variable so the weights survive restarts. | |
| ## License | |
| Code: Apache-2.0 (files derived from the FLUX.2 reference implementation — | |
| `flux_rgbd/_flux2/`, `flux_rgbd/dit.py` — are Apache-2.0, Copyright Black | |
| Forest Labs, with World Labs modifications). Model weights: CC BY-NC 4.0. | |
| ## Citation | |
| If you find Modality Forcing useful, please consider citing: | |
| ```bibtex | |
| @article{duisterhof2026mofo, | |
| title = {Modality Forcing for Scalable Spatial Generation}, | |
| author = {Duisterhof, Bardienus Pieter and Ramanan, Deva and Ichnowski, Jeffrey and Johnson, Justin and Park, Keunhong}, | |
| journal = {arXiv preprint arXiv:2606.13676}, | |
| year = {2026} | |
| } | |
| ``` | |