---
license: cc-by-nc-4.0
pipeline_tag: text-to-image
tags:
- depth-estimation
- text-to-3d
- diffusion
- flux
library_name: flux_rgbd
---

# Modality Forcing for Scalable Spatial Generation

Joint **text → RGB + depth** generation with a single diffusion
transformer, built on FLUX.2. Modality Forcing assigns separate noise levels
per modality during post-training, so one model supports joint generation
(text → RGB-D), image-to-depth, and depth-to-image at inference.

- 📄 Paper: [arXiv:2606.13676](https://arxiv.org/abs/2606.13676)
- 💻 Code: [github.com/Duisterhof/modality-forcing](https://github.com/Duisterhof/modality-forcing)
- 🚀 Demo: [Hugging Face Space](https://huggingface.co/spaces/bartduis/modality_forcing)
- 🌐 Project page: [modality-forcing.github.io](https://modality-forcing.github.io/)

## Files

| File | Description |
|------|-------------|
| `model.safetensors` | FluxRGBD DiT (12B total — 9B-class FLUX.2 backbone + depth streams, bf16) |
| `config.json` | Model variant config (`flux_rgbd_9b_v2`) |
| `ae_encoder.safetensors` / `ae_decoder.safetensors` | FLUX.2 autoencoder |

The Qwen3-8B text encoder is pulled separately from
[`Qwen/Qwen3-8B`](https://huggingface.co/Qwen/Qwen3-8B).

## Usage

```bash
git clone https://github.com/Duisterhof/modality-forcing.git
cd modality-forcing
bash install.sh
python scripts/joint.py --prompt "a cozy sunlit kitchen with wooden cabinets"
```

The scripts download these weights automatically (`bartduis/modality_forcing`
is the default `--model`).

## License

The model weights are released under **CC BY-NC 4.0** (non-commercial). The
inference code is Apache-2.0; see the GitHub repository.

## Citation

```bibtex
@article{duisterhof2026mofo,
  title   = {Modality Forcing for Scalable Spatial Generation},
  author  = {Duisterhof, Bardienus Pieter and Ramanan, Deva and Ichnowski, Jeffrey and Johnson, Justin and Park, Keunhong},
  journal = {arXiv preprint arXiv:2606.13676},
  year    = {2026}
}
```