Upload folder using huggingface_hub
Browse files- README.md +14 -25
- technical_report_mdiffae.md +22 -37
README.md
CHANGED
|
@@ -13,9 +13,8 @@ library_name: mdiffae
|
|
| 13 |
|
| 14 |
**mDiffAE** β **M**asked **Diff**usion **A**uto**E**ncoder.
|
| 15 |
A fast, single-GPU-trainable diffusion autoencoder with a **64-channel**
|
| 16 |
-
spatial bottleneck
|
| 17 |
-
|
| 18 |
-
flat decoder architecture (4 blocks, no skip connections).
|
| 19 |
|
| 20 |
This variant (mdiffae_v1): 81.4M parameters, 310.6 MB.
|
| 21 |
Bottleneck: **64 channels** at patch size 16
|
|
@@ -74,20 +73,14 @@ learned residual gates.
|
|
| 74 |
|
| 75 |
**Decoder**: VP diffusion conditioned on encoder latents and timestep via
|
| 76 |
shared-base + per-layer low-rank AdaLN-Zero. 4 flat
|
| 77 |
-
sequential blocks (no skip connections).
|
| 78 |
-
|
| 79 |
-
|
| 80 |
-
|
| 81 |
-
|
| 82 |
-
|
| 83 |
-
|
| 84 |
-
|
| 85 |
-
stack of 4 blocks with no skip connections or block groups.
|
| 86 |
-
PDG instead works at the token level: 75% of spatial tokens in the fused decoder
|
| 87 |
-
input are replaced with a learned mask feature, providing a much finer-grained
|
| 88 |
-
guidance signal. The bottleneck is also halved from 128 to 64
|
| 89 |
-
channels, giving a 12x
|
| 90 |
-
compression ratio vs iRDiffAE's 6x.
|
| 91 |
|
| 92 |
### Key Differences from iRDiffAE
|
| 93 |
|
|
@@ -97,22 +90,18 @@ compression ratio vs iRDiffAE's 6x.
|
|
| 97 |
| Decoder depth | 8 (2+4+2 skip-concat) | **4 (flat sequential)** |
|
| 98 |
| PDG mechanism | Block dropping | **Token masking** |
|
| 99 |
| Training regularizer | REPA + covariance reg | **Decoder token masking** |
|
| 100 |
-
| PDG sensitivity | Moderate (1.5β3.0) | **Very sensitive (1.05β1.2)** |
|
| 101 |
|
| 102 |
## Recommended Settings
|
| 103 |
|
| 104 |
-
Best quality is achieved with
|
| 105 |
-
|
| 106 |
-
|
| 107 |
-
PDG in mDiffAE is **very sensitive** β use tiny strengths (1.05β1.2)
|
| 108 |
-
if enabled. Higher values will cause artifacts.
|
| 109 |
|
| 110 |
| Setting | Default |
|
| 111 |
|---|---|
|
| 112 |
| Sampler | DDIM |
|
| 113 |
| Steps | 1 |
|
| 114 |
| PDG | Disabled |
|
| 115 |
-
| PDG strength (if enabled) | 1.
|
| 116 |
|
| 117 |
```python
|
| 118 |
from m_diffae import MDiffAEInferenceConfig
|
|
@@ -126,7 +115,7 @@ recon = model.decode(latents, height=H, width=W, inference_config=cfg)
|
|
| 126 |
|
| 127 |
```bibtex
|
| 128 |
@misc{m_diffae,
|
| 129 |
-
title = {mDiffAE: A Masked Diffusion Autoencoder
|
| 130 |
author = {data-archetype},
|
| 131 |
year = {2026},
|
| 132 |
month = mar,
|
|
|
|
| 13 |
|
| 14 |
**mDiffAE** β **M**asked **Diff**usion **A**uto**E**ncoder.
|
| 15 |
A fast, single-GPU-trainable diffusion autoencoder with a **64-channel**
|
| 16 |
+
spatial bottleneck and a flat 4-block decoder. Uses decoder token masking
|
| 17 |
+
as an implicit regularizer instead of REPA alignment.
|
|
|
|
| 18 |
|
| 19 |
This variant (mdiffae_v1): 81.4M parameters, 310.6 MB.
|
| 20 |
Bottleneck: **64 channels** at patch size 16
|
|
|
|
| 73 |
|
| 74 |
**Decoder**: VP diffusion conditioned on encoder latents and timestep via
|
| 75 |
shared-base + per-layer low-rank AdaLN-Zero. 4 flat
|
| 76 |
+
sequential blocks (no skip connections).
|
| 77 |
+
|
| 78 |
+
**Compared to iRDiffAE**: iRDiffAE uses an 8-block decoder (2 start + 4 middle
|
| 79 |
+
+ 2 end) with skip connections and 128 bottleneck channels (needed partly because
|
| 80 |
+
REPA occupies half the channels). mDiffAE uses 4 flat blocks
|
| 81 |
+
with no skip connections and 64 bottleneck channels
|
| 82 |
+
(12x compression vs
|
| 83 |
+
iRDiffAE's 6x), which gives better channel utilisation.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 84 |
|
| 85 |
### Key Differences from iRDiffAE
|
| 86 |
|
|
|
|
| 90 |
| Decoder depth | 8 (2+4+2 skip-concat) | **4 (flat sequential)** |
|
| 91 |
| PDG mechanism | Block dropping | **Token masking** |
|
| 92 |
| Training regularizer | REPA + covariance reg | **Decoder token masking** |
|
|
|
|
| 93 |
|
| 94 |
## Recommended Settings
|
| 95 |
|
| 96 |
+
Best quality is achieved with **1 DDIM step** and PDG disabled.
|
| 97 |
+
PDG can sharpen images but should be kept very low (1.01β1.05).
|
|
|
|
|
|
|
|
|
|
| 98 |
|
| 99 |
| Setting | Default |
|
| 100 |
|---|---|
|
| 101 |
| Sampler | DDIM |
|
| 102 |
| Steps | 1 |
|
| 103 |
| PDG | Disabled |
|
| 104 |
+
| PDG strength (if enabled) | 1.05 |
|
| 105 |
|
| 106 |
```python
|
| 107 |
from m_diffae import MDiffAEInferenceConfig
|
|
|
|
| 115 |
|
| 116 |
```bibtex
|
| 117 |
@misc{m_diffae,
|
| 118 |
+
title = {mDiffAE: A Fast Masked Diffusion Autoencoder},
|
| 119 |
author = {data-archetype},
|
| 120 |
year = {2026},
|
| 121 |
month = mar,
|
technical_report_mdiffae.md
CHANGED
|
@@ -1,32 +1,36 @@
|
|
| 1 |
-
# mDiffAE: Masked Diffusion
|
| 2 |
|
| 3 |
**Version 1** β March 2026
|
| 4 |
|
| 5 |
## 1. Introduction
|
| 6 |
|
| 7 |
-
mDiffAE (**M**asked **Diff**usion **A**uto**E**ncoder) builds on the [iRDiffAE](https://huggingface.co/data-archetype/irdiffae-v1/blob/main/technical_report.md) model family
|
| 8 |
|
| 9 |
-
|
| 10 |
|
| 11 |
### 1.1 Token Masking as Regularizer
|
| 12 |
|
| 13 |
-
|
| 14 |
|
| 15 |
-
The 50%
|
| 16 |
|
| 17 |
### 1.2 Latent Noise Regularization
|
| 18 |
|
| 19 |
-
|
| 20 |
|
| 21 |
### 1.3 Simplified Decoder
|
| 22 |
|
| 23 |
-
|
| 24 |
|
| 25 |
-
### 1.4
|
| 26 |
|
| 27 |
-
|
| 28 |
|
| 29 |
-
### 1.5
|
|
|
|
|
|
|
|
|
|
|
|
|
| 30 |
|
| 31 |
- He, K., Chen, X., Xie, S., Li, Y., DollΓ‘r, P., & Girshick, R. (2022). *Masked Autoencoders Are Scalable Vision Learners*. CVPR 2022.
|
| 32 |
- Li, T., Chang, H., Mishra, S.K., Zhang, H., Katabi, D., & Krishnan, D. (2023). *MAGE: MAsked Generative Encoder to Unify Representation Learning and Image Synthesis*. CVPR 2023.
|
|
@@ -40,7 +44,7 @@ Compared to the REPA-regularized iRDiffAE v1, mDiffAE achieves slightly higher P
|
|
| 40 |
| Decoder topology | START_MIDDLE_END_SKIP_CONCAT | **FLAT (no skip concat)** |
|
| 41 |
| Skip fusion | Yes (`fuse_skip` Conv1Γ1) | **No** |
|
| 42 |
| PDG mechanism | Drop middle blocks β mask_feature | **Token-level masking** (75% spatial tokens β mask_feature) |
|
| 43 |
-
| PDG sensitivity | Moderate (strength 1.5β3.0) | **Very sensitive** (strength 1.
|
| 44 |
| Training regularizer | REPA (half-channel DINOv2 alignment) + covreg | **Decoder token masking** (75% ratio, 50% apply prob) |
|
| 45 |
| Latent noise reg | Same mechanism | **Independent Beta(2,2), logSNR shift +1.0, 10% prob** |
|
| 46 |
| Depthwise kernel | 7Γ7 | 7Γ7 (same) |
|
|
@@ -58,54 +62,35 @@ During training, with 50% probability per sample:
|
|
| 58 |
3. Masked tokens are replaced with a learned `mask_feature` parameter (same dimensionality as model_dim)
|
| 59 |
4. The decoder processes the partially-masked input normally through all blocks
|
| 60 |
|
| 61 |
-
### 3.2
|
| 62 |
-
|
| 63 |
-
At inference, the trained mask_feature enables Path-Drop Guidance (PDG) through token-level masking rather than block-level dropping:
|
| 64 |
|
| 65 |
-
-
|
| 66 |
-
- **Unconditional pass**: Apply 2Γ2 groupwise token masking at the trained ratio (75%)
|
| 67 |
-
- **Guided output**: `x0 = x0_uncond + strength Γ (x0_cond β x0_uncond)`
|
| 68 |
-
|
| 69 |
-
Because the decoder has only 4 blocks and no skip connections, the guidance signal from token masking is very concentrated. This makes PDG extremely sensitive β even a strength of 1.2 produces noticeable sharpening, and values above 1.5 cause severe artifacts.
|
| 70 |
|
| 71 |
## 4. Flat Decoder Architecture
|
| 72 |
|
| 73 |
### 4.1 iRDiffAE v1 Decoder (for comparison)
|
| 74 |
|
| 75 |
-
The iRDiffAE v1 decoder uses an 8-block layout split into three groups with a skip connection:
|
| 76 |
-
|
| 77 |
```
|
| 78 |
Fused input β Start blocks (2) β [save for skip] β
|
| 79 |
Middle blocks (4) β [cat with saved skip] β FuseSkip Conv1Γ1 β
|
| 80 |
End blocks (2) β Output head
|
| 81 |
```
|
| 82 |
|
| 83 |
-
|
| 84 |
|
| 85 |
### 4.2 mDiffAE v1 Decoder
|
| 86 |
|
| 87 |
-
The mDiffAE decoder replaces this with a flat sequential architecture β no block groups, no skip connection:
|
| 88 |
-
|
| 89 |
```
|
| 90 |
-
Input: x_t [B, 3, H, W], t [B], z [B, 64, h, w]
|
| 91 |
-
|
| 92 |
Patchify(x_t) β RMSNorm β x_feat [B, 896, h, w]
|
| 93 |
LatentUp(z) β RMSNorm β z_up [B, 896, h, w]
|
| 94 |
FuseIn(cat(x_feat, z_up)) β fused [B, 896, h, w]
|
| 95 |
[Optional: token masking for PDG]
|
| 96 |
TimeEmbed(t) β cond [B, 896]
|
| 97 |
-
Block_0
|
| 98 |
-
Block_1(..., AdaLN(cond)) β ...
|
| 99 |
-
Block_2(..., AdaLN(cond)) β ...
|
| 100 |
-
Block_3(..., AdaLN(cond)) β out [B, 896, h, w]
|
| 101 |
RMSNorm β Conv1x1 β PixelShuffle β x0_hat [B, 3, H, W]
|
| 102 |
```
|
| 103 |
|
| 104 |
-
|
| 105 |
-
|
| 106 |
-
### 4.3 Bottleneck
|
| 107 |
-
|
| 108 |
-
The bottleneck dimension is halved from 128 channels (iRDiffAE) to 64 channels, giving a 12x compression ratio at patch size 16 (vs 6x for iRDiffAE). Despite the higher compression, the masking regularizer forces the encoder to produce informative per-token representations, maintaining reconstruction quality.
|
| 109 |
|
| 110 |
## 5. Model Configuration
|
| 111 |
|
|
@@ -133,8 +118,8 @@ Training checkpoint: step 708,000 (EMA weights).
|
|
| 133 |
|---------|-------|-------|
|
| 134 |
| Sampler | DDIM | Best for 1-step |
|
| 135 |
| Steps | 1 | PSNR-optimal |
|
| 136 |
-
| PDG | Disabled | Default
|
| 137 |
-
| PDG strength | 1.
|
| 138 |
|
| 139 |
## 7. Results
|
| 140 |
|
|
|
|
| 1 |
+
# mDiffAE: A Fast Masked Diffusion Autoencoder β Technical Report
|
| 2 |
|
| 3 |
**Version 1** β March 2026
|
| 4 |
|
| 5 |
## 1. Introduction
|
| 6 |
|
| 7 |
+
mDiffAE (**M**asked **Diff**usion **A**uto**E**ncoder) builds on the [iRDiffAE](https://huggingface.co/data-archetype/irdiffae-v1/blob/main/technical_report.md) model family. See that report for full background on the shared components: VP diffusion, DiCo blocks, patchify encoder, AdaLN-Zero conditioning, and Path-Drop Guidance (PDG).
|
| 8 |
|
| 9 |
+
iRDiffAE v1 used REPA (aligning encoder features with a frozen DINOv2 teacher) to regularize the latent space. REPA produces well-structured latents but tends toward overly smooth representations. Here we replace it with **decoder token masking**.
|
| 10 |
|
| 11 |
### 1.1 Token Masking as Regularizer
|
| 12 |
|
| 13 |
+
With 50% probability per sample, the decoder only sees **25% of tokens** in the fused conditioning input. The spatial token grid is divided into non-overlapping 2Γ2 groups; within each group a single token is randomly kept and the other three are replaced with a learned mask feature. The high masking ratio (75%) forces each spatial token to carry enough information for reconstruction even when most neighbors are absent. Lower masking ratios help downstream models learn sharp details quickly but fail to learn spatial coherence β the task becomes too close to local inpainting. We tested lower ratios and confirmed this tradeoff (see also He et al., 2022).
|
| 14 |
|
| 15 |
+
The 50% application probability controls the tradeoff between reconstruction quality and latent regularity.
|
| 16 |
|
| 17 |
### 1.2 Latent Noise Regularization
|
| 18 |
|
| 19 |
+
10% of the time, random noise is added to the latent representation. The noise level is sampled from a **Beta(2,2)** distribution with a **logSNR shift of +1.0** (biasing toward low noise), independently of the pixel-space diffusion schedule.
|
| 20 |
|
| 21 |
### 1.3 Simplified Decoder
|
| 22 |
|
| 23 |
+
The decoder uses only **4 blocks** (down from 8 in iRDiffAE v1) in a flat sequential layout β no start/middle/end groups, no skip connections. This halves the decoder's parameter count and is roughly 2Γ faster.
|
| 24 |
|
| 25 |
+
### 1.4 Bottleneck
|
| 26 |
|
| 27 |
+
iRDiffAE v1 used 128 bottleneck channels, partly because REPA alignment occupies half the channels. Without REPA, 64 channels suffice and give better channel utilisation. This yields a 12Γ compression ratio at patch size 16 (vs 6Γ for iRDiffAE).
|
| 28 |
|
| 29 |
+
### 1.5 Empirical Results
|
| 30 |
+
|
| 31 |
+
Compared to iRDiffAE v1, mDiffAE achieves comparable PSNR with less oversmoothed latent PCA. In downstream diffusion model training, mDiffAE's latent space does not show the steep initial loss descent of iRDiffAE, but catches up after 50kβ100k steps, producing more spatially coherent images with better high-frequency detail.
|
| 32 |
+
|
| 33 |
+
### 1.6 References
|
| 34 |
|
| 35 |
- He, K., Chen, X., Xie, S., Li, Y., DollΓ‘r, P., & Girshick, R. (2022). *Masked Autoencoders Are Scalable Vision Learners*. CVPR 2022.
|
| 36 |
- Li, T., Chang, H., Mishra, S.K., Zhang, H., Katabi, D., & Krishnan, D. (2023). *MAGE: MAsked Generative Encoder to Unify Representation Learning and Image Synthesis*. CVPR 2023.
|
|
|
|
| 44 |
| Decoder topology | START_MIDDLE_END_SKIP_CONCAT | **FLAT (no skip concat)** |
|
| 45 |
| Skip fusion | Yes (`fuse_skip` Conv1Γ1) | **No** |
|
| 46 |
| PDG mechanism | Drop middle blocks β mask_feature | **Token-level masking** (75% spatial tokens β mask_feature) |
|
| 47 |
+
| PDG sensitivity | Moderate (strength 1.5β3.0) | **Very sensitive** (strength 1.01β1.05) |
|
| 48 |
| Training regularizer | REPA (half-channel DINOv2 alignment) + covreg | **Decoder token masking** (75% ratio, 50% apply prob) |
|
| 49 |
| Latent noise reg | Same mechanism | **Independent Beta(2,2), logSNR shift +1.0, 10% prob** |
|
| 50 |
| Depthwise kernel | 7Γ7 | 7Γ7 (same) |
|
|
|
|
| 62 |
3. Masked tokens are replaced with a learned `mask_feature` parameter (same dimensionality as model_dim)
|
| 63 |
4. The decoder processes the partially-masked input normally through all blocks
|
| 64 |
|
| 65 |
+
### 3.2 PDG at Inference
|
|
|
|
|
|
|
| 66 |
|
| 67 |
+
At inference, the trained mask_feature can be used for Path-Drop Guidance (PDG): the conditional pass uses the full input, the unconditional pass applies 2Γ2 groupwise masking at 75%, and the two are interpolated as usual. PDG can sharpen reconstructions but should be kept very low (strength 1.01β1.05); higher values cause artifacts.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 68 |
|
| 69 |
## 4. Flat Decoder Architecture
|
| 70 |
|
| 71 |
### 4.1 iRDiffAE v1 Decoder (for comparison)
|
| 72 |
|
|
|
|
|
|
|
| 73 |
```
|
| 74 |
Fused input β Start blocks (2) β [save for skip] β
|
| 75 |
Middle blocks (4) β [cat with saved skip] β FuseSkip Conv1Γ1 β
|
| 76 |
End blocks (2) β Output head
|
| 77 |
```
|
| 78 |
|
| 79 |
+
8 blocks split into three groups with a skip connection. For PDG, the middle blocks are dropped and replaced with a learned mask feature.
|
| 80 |
|
| 81 |
### 4.2 mDiffAE v1 Decoder
|
| 82 |
|
|
|
|
|
|
|
| 83 |
```
|
|
|
|
|
|
|
| 84 |
Patchify(x_t) β RMSNorm β x_feat [B, 896, h, w]
|
| 85 |
LatentUp(z) β RMSNorm β z_up [B, 896, h, w]
|
| 86 |
FuseIn(cat(x_feat, z_up)) β fused [B, 896, h, w]
|
| 87 |
[Optional: token masking for PDG]
|
| 88 |
TimeEmbed(t) β cond [B, 896]
|
| 89 |
+
Block_0 β Block_1 β Block_2 β Block_3 β out [B, 896, h, w]
|
|
|
|
|
|
|
|
|
|
| 90 |
RMSNorm β Conv1x1 β PixelShuffle β x0_hat [B, 3, H, W]
|
| 91 |
```
|
| 92 |
|
| 93 |
+
4 flat sequential blocks, no skip connections. Roughly half the decoder parameters of iRDiffAE.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 94 |
|
| 95 |
## 5. Model Configuration
|
| 96 |
|
|
|
|
| 118 |
|---------|-------|-------|
|
| 119 |
| Sampler | DDIM | Best for 1-step |
|
| 120 |
| Steps | 1 | PSNR-optimal |
|
| 121 |
+
| PDG | Disabled | Default |
|
| 122 |
+
| PDG strength | 1.01β1.05 | If enabled; can sharpen but artifacts above ~1.1 |
|
| 123 |
|
| 124 |
## 7. Results
|
| 125 |
|