Commit Β·
1da4105
0
Parent(s):
Initial public release
Browse files- .gitattributes +35 -0
- README.md +61 -0
- ae_decoder.safetensors +3 -0
- ae_encoder.safetensors +3 -0
- config.json +5 -0
- model.safetensors +3 -0
.gitattributes
ADDED
|
@@ -0,0 +1,35 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
*.7z filter=lfs diff=lfs merge=lfs -text
|
| 2 |
+
*.arrow filter=lfs diff=lfs merge=lfs -text
|
| 3 |
+
*.bin filter=lfs diff=lfs merge=lfs -text
|
| 4 |
+
*.bz2 filter=lfs diff=lfs merge=lfs -text
|
| 5 |
+
*.ckpt filter=lfs diff=lfs merge=lfs -text
|
| 6 |
+
*.ftz filter=lfs diff=lfs merge=lfs -text
|
| 7 |
+
*.gz filter=lfs diff=lfs merge=lfs -text
|
| 8 |
+
*.h5 filter=lfs diff=lfs merge=lfs -text
|
| 9 |
+
*.joblib filter=lfs diff=lfs merge=lfs -text
|
| 10 |
+
*.lfs.* filter=lfs diff=lfs merge=lfs -text
|
| 11 |
+
*.mlmodel filter=lfs diff=lfs merge=lfs -text
|
| 12 |
+
*.model filter=lfs diff=lfs merge=lfs -text
|
| 13 |
+
*.msgpack filter=lfs diff=lfs merge=lfs -text
|
| 14 |
+
*.npy filter=lfs diff=lfs merge=lfs -text
|
| 15 |
+
*.npz filter=lfs diff=lfs merge=lfs -text
|
| 16 |
+
*.onnx filter=lfs diff=lfs merge=lfs -text
|
| 17 |
+
*.ot filter=lfs diff=lfs merge=lfs -text
|
| 18 |
+
*.parquet filter=lfs diff=lfs merge=lfs -text
|
| 19 |
+
*.pb filter=lfs diff=lfs merge=lfs -text
|
| 20 |
+
*.pickle filter=lfs diff=lfs merge=lfs -text
|
| 21 |
+
*.pkl filter=lfs diff=lfs merge=lfs -text
|
| 22 |
+
*.pt filter=lfs diff=lfs merge=lfs -text
|
| 23 |
+
*.pth filter=lfs diff=lfs merge=lfs -text
|
| 24 |
+
*.rar filter=lfs diff=lfs merge=lfs -text
|
| 25 |
+
*.safetensors filter=lfs diff=lfs merge=lfs -text
|
| 26 |
+
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
| 27 |
+
*.tar.* filter=lfs diff=lfs merge=lfs -text
|
| 28 |
+
*.tar filter=lfs diff=lfs merge=lfs -text
|
| 29 |
+
*.tflite filter=lfs diff=lfs merge=lfs -text
|
| 30 |
+
*.tgz filter=lfs diff=lfs merge=lfs -text
|
| 31 |
+
*.wasm filter=lfs diff=lfs merge=lfs -text
|
| 32 |
+
*.xz filter=lfs diff=lfs merge=lfs -text
|
| 33 |
+
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
+
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
+
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
README.md
ADDED
|
@@ -0,0 +1,61 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: cc-by-nc-4.0
|
| 3 |
+
pipeline_tag: text-to-image
|
| 4 |
+
tags:
|
| 5 |
+
- depth-estimation
|
| 6 |
+
- text-to-3d
|
| 7 |
+
- diffusion
|
| 8 |
+
- flux
|
| 9 |
+
library_name: flux_rgbd
|
| 10 |
+
---
|
| 11 |
+
|
| 12 |
+
# Modality Forcing for Scalable Spatial Generation
|
| 13 |
+
|
| 14 |
+
Joint **text β RGB + depth** generation with a single diffusion
|
| 15 |
+
transformer, built on FLUX.2. Modality Forcing assigns separate noise levels
|
| 16 |
+
per modality during post-training, so one model supports joint generation
|
| 17 |
+
(text β RGB-D), image-to-depth, and depth-to-image at inference.
|
| 18 |
+
|
| 19 |
+
- π Paper: [arXiv:2606.13676](https://arxiv.org/abs/2606.13676)
|
| 20 |
+
- π» Code: [github.com/Duisterhof/modality-forcing](https://github.com/Duisterhof/modality-forcing)
|
| 21 |
+
- π Demo: [Hugging Face Space](https://huggingface.co/spaces/bartduis/modality_forcing)
|
| 22 |
+
- π Project page: [modality-forcing.github.io](https://modality-forcing.github.io/)
|
| 23 |
+
|
| 24 |
+
## Files
|
| 25 |
+
|
| 26 |
+
| File | Description |
|
| 27 |
+
|------|-------------|
|
| 28 |
+
| `model.safetensors` | FluxRGBD DiT (12B total β 9B-class FLUX.2 backbone + depth streams, bf16) |
|
| 29 |
+
| `config.json` | Model variant config (`flux_rgbd_9b_v2`) |
|
| 30 |
+
| `ae_encoder.safetensors` / `ae_decoder.safetensors` | FLUX.2 autoencoder |
|
| 31 |
+
|
| 32 |
+
The Qwen3-8B text encoder is pulled separately from
|
| 33 |
+
[`Qwen/Qwen3-8B`](https://huggingface.co/Qwen/Qwen3-8B).
|
| 34 |
+
|
| 35 |
+
## Usage
|
| 36 |
+
|
| 37 |
+
```bash
|
| 38 |
+
git clone https://github.com/Duisterhof/modality-forcing.git
|
| 39 |
+
cd modality-forcing
|
| 40 |
+
bash install.sh
|
| 41 |
+
python scripts/joint.py --prompt "a cozy sunlit kitchen with wooden cabinets"
|
| 42 |
+
```
|
| 43 |
+
|
| 44 |
+
The scripts download these weights automatically (`bartduis/modality_forcing`
|
| 45 |
+
is the default `--model`).
|
| 46 |
+
|
| 47 |
+
## License
|
| 48 |
+
|
| 49 |
+
The model weights are released under **CC BY-NC 4.0** (non-commercial). The
|
| 50 |
+
inference code is Apache-2.0; see the GitHub repository.
|
| 51 |
+
|
| 52 |
+
## Citation
|
| 53 |
+
|
| 54 |
+
```bibtex
|
| 55 |
+
@article{duisterhof2026mofo,
|
| 56 |
+
title = {Modality Forcing for Scalable Spatial Generation},
|
| 57 |
+
author = {Duisterhof, Bardienus Pieter and Ramanan, Deva and Ichnowski, Jeffrey and Johnson, Justin and Park, Keunhong},
|
| 58 |
+
journal = {arXiv preprint arXiv:2606.13676},
|
| 59 |
+
year = {2026}
|
| 60 |
+
}
|
| 61 |
+
```
|
ae_decoder.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:393477cff9c15055e512525c7b90cda02b06e0378c93dfc75b261d10825842b1
|
| 3 |
+
size 198495428
|
ae_encoder.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:fde73aad493e1343b138660c87e6a376dc5b2d6bf9b061206b580088b78ff6e3
|
| 3 |
+
size 137714880
|
config.json
ADDED
|
@@ -0,0 +1,5 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"variant": "flux_rgbd_9b_v2",
|
| 3 |
+
"model_type": "flux_rgbd",
|
| 4 |
+
"torch_dtype": "bfloat16"
|
| 5 |
+
}
|
model.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:3da9bbd5e45ba92e22bfb7c338f6bbcce7ba11e9119c409381177396e17b1f0f
|
| 3 |
+
size 23972604108
|