File size: 1,994 Bytes
1da4105
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
---
license: cc-by-nc-4.0
pipeline_tag: text-to-image
tags:
- depth-estimation
- text-to-3d
- diffusion
- flux
library_name: flux_rgbd
---

# Modality Forcing for Scalable Spatial Generation

Joint **text β†’ RGB + depth** generation with a single diffusion
transformer, built on FLUX.2. Modality Forcing assigns separate noise levels
per modality during post-training, so one model supports joint generation
(text β†’ RGB-D), image-to-depth, and depth-to-image at inference.

- πŸ“„ Paper: [arXiv:2606.13676](https://arxiv.org/abs/2606.13676)
- πŸ’» Code: [github.com/Duisterhof/modality-forcing](https://github.com/Duisterhof/modality-forcing)
- πŸš€ Demo: [Hugging Face Space](https://huggingface.co/spaces/bartduis/modality_forcing)
- 🌐 Project page: [modality-forcing.github.io](https://modality-forcing.github.io/)

## Files

| File | Description |
|------|-------------|
| `model.safetensors` | FluxRGBD DiT (12B total β€” 9B-class FLUX.2 backbone + depth streams, bf16) |
| `config.json` | Model variant config (`flux_rgbd_9b_v2`) |
| `ae_encoder.safetensors` / `ae_decoder.safetensors` | FLUX.2 autoencoder |

The Qwen3-8B text encoder is pulled separately from
[`Qwen/Qwen3-8B`](https://huggingface.co/Qwen/Qwen3-8B).

## Usage

```bash
git clone https://github.com/Duisterhof/modality-forcing.git
cd modality-forcing
bash install.sh
python scripts/joint.py --prompt "a cozy sunlit kitchen with wooden cabinets"
```

The scripts download these weights automatically (`bartduis/modality_forcing`
is the default `--model`).

## License

The model weights are released under **CC BY-NC 4.0** (non-commercial). The
inference code is Apache-2.0; see the GitHub repository.

## Citation

```bibtex
@article{duisterhof2026mofo,
  title   = {Modality Forcing for Scalable Spatial Generation},
  author  = {Duisterhof, Bardienus Pieter and Ramanan, Deva and Ichnowski, Jeffrey and Johnson, Justin and Park, Keunhong},
  journal = {arXiv preprint arXiv:2606.13676},
  year    = {2026}
}
```