Spaces:

bartduis
/

modality_forcing

Running on Zero

App Files Files Community

modality_forcing / README.md

bartduis

Initial public release

e298226 6 days ago

preview code

Raw

History Blame Contribute Delete

4.93 kB

	---
	title: Modality Forcing
	emoji: 🏢
	colorFrom: blue
	colorTo: gray
	sdk: gradio
	sdk_version: 6.16.0
	python_version: '3.12'
	app_file: app.py
	hardware: zero-h200
	pinned: false
	license: apache-2.0
	short_description: Text → RGB + depth + 3D point cloud.
	---

	<div align="center">

	<h1>Modality Forcing for Scalable<br>Spatial Generation</h1>

	[![Project Page](https://img.shields.io/badge/Project_Page-Modality_Forcing-blue?logo=googlechrome&logoColor=white)](https://modality-forcing.github.io/)
	[![Code](https://img.shields.io/badge/Code-GitHub-181717?logo=github)](https://github.com/Duisterhof/modality-forcing)
	[![Model](https://img.shields.io/badge/Model-modality__forcing-yellow?logo=huggingface)](https://huggingface.co/bartduis/modality_forcing)
	[![arXiv](https://img.shields.io/badge/arXiv-2606.13676-b31b1b?logo=arxiv&logoColor=white)](https://arxiv.org/abs/2606.13676)

	[Bardienus Pieter Duisterhof](https://bart-ai.com)<sup>1,2</sup> · [Deva Ramanan](https://www.cs.cmu.edu/~deva/)<sup>1</sup> · [Jeffrey Ichnowski](https://ichnow.ski)<sup>1</sup> · [Justin Johnson](https://web.eecs.umich.edu/~justincj/)<sup>2</sup> · [Keunhong Park](https://keunhong.com)<sup>2</sup>

	<sup>1</sup> Carnegie Mellon University    <sup>2</sup> World Labs

	Preprint, 2026

	<em>Modality Forcing turns a pretrained text-to-image model into a joint image–depth generator with a simple post-training recipe.</em>

	</div>

	## Overview

	This Space hosts the interactive demo of Modality Forcing: joint text → RGB + depth diffusion built on FLUX.2. A single DiT supports every permutation of conditional and joint generation by assigning a separate noise level to each modality:

	\| Mode \| Input \| Output \|
	\|------\|-------\|--------\|
	\| Joint \| text prompt \| RGB + depth + 3D point cloud \|
	\| Image → depth \| text + image \| depth + 3D point cloud (any aspect ratio, letterbox resize) \|

	## Abstract

	Text-to-image (T2I) models contain rich spatial priors.
	Synthesizing photorealistic, cluttered scenes requires an understanding of geometry, including perspective and relative scale.
	Prior works adapt T2I models to leverage this prior for depth prediction, but they require dense depth data and involve complex recipes.
	We propose Modality Forcing, a simple, scalable post-training recipe for joint image-depth generation using a single DiT trained on sparse depth data.
	Modality Forcing enables conditional and joint generation of image and depth in any permutation by assigning separate noise levels per modality.
	Per-modality decoders let us train on sparse, real-world depth and achieve strong, generalizable depth prediction.
	We further show that Modality Forcing inherits the scalability of T2I pre-training: by training a set of T2I models from scratch (300M to 3B parameters), we find that larger models trained on more image data produce more accurate depth.
	Our strongest model is competitive with state-of-the-art monocular depth estimators and reduces AbsRel by 57% relative to existing joint image-depth generative models.
	These results provide strong evidence that image generation is a scalable pre-training objective for spatial perception.

	## Space configuration

	### Weights

	All weights are pulled from public repos — no token required:
	[`bartduis/modality_forcing`](https://huggingface.co/bartduis/modality_forcing)
	(DiT + FLUX.2 autoencoder, CC BY-NC 4.0) and
	[`Qwen/Qwen3-8B`](https://huggingface.co/Qwen/Qwen3-8B) (text encoder).
	Override the model repo via the `WEIGHTS_REPO` Space variable.

	### Hardware

	Pinned to `zero-h200`. The runner builds the DiT on the `meta` device and
	assigns checkpoint weights straight onto the GPU (`runner.from_pretrained`),
	so the BF16 model (~24 GB) loads in ~8 s on the first invocation — skipping
	~45 s of throwaway random initialization — and stays resident for subsequent
	calls within the same Space instance.

	### Avoiding the cold re-download

	A Space instance that has gone to sleep starts a fresh container, which
	re-downloads the weights (~24 GB DiT + ~16 GB Qwen3-8B + the FLUX.2 VAE)
	before the load above can run. To skip that, attach persistent storage
	(Settings → Storage, or mount an HF Bucket) and point the HF cache at it via
	the `HF_HOME` Space variable so the weights survive restarts.

	## License

	Code: Apache-2.0 (files derived from the FLUX.2 reference implementation —
	`flux_rgbd/_flux2/`, `flux_rgbd/dit.py` — are Apache-2.0, Copyright Black
	Forest Labs, with World Labs modifications). Model weights: CC BY-NC 4.0.

	## Citation

	If you find Modality Forcing useful, please consider citing:

	```bibtex
	@article{duisterhof2026mofo,
	title = {Modality Forcing for Scalable Spatial Generation},
	author = {Duisterhof, Bardienus Pieter and Ramanan, Deva and Ichnowski, Jeffrey and Johnson, Justin and Park, Keunhong},
	journal = {arXiv preprint arXiv:2606.13676},
	year = {2026}
	}
	```