data-archetype commited on
Commit
9b877c3
Β·
verified Β·
1 Parent(s): 5595c25

Upload folder using huggingface_hub

Browse files
Files changed (2) hide show
  1. README.md +14 -25
  2. technical_report_mdiffae.md +22 -37
README.md CHANGED
@@ -13,9 +13,8 @@ library_name: mdiffae
13
 
14
  **mDiffAE** β€” **M**asked **Diff**usion **A**uto**E**ncoder.
15
  A fast, single-GPU-trainable diffusion autoencoder with a **64-channel**
16
- spatial bottleneck. Uses decoder token masking as an implicit regularizer instead of
17
- REPA alignment, achieving Flux.2-level conditioning quality with a simpler
18
- flat decoder architecture (4 blocks, no skip connections).
19
 
20
  This variant (mdiffae_v1): 81.4M parameters, 310.6 MB.
21
  Bottleneck: **64 channels** at patch size 16
@@ -74,20 +73,14 @@ learned residual gates.
74
 
75
  **Decoder**: VP diffusion conditioned on encoder latents and timestep via
76
  shared-base + per-layer low-rank AdaLN-Zero. 4 flat
77
- sequential blocks (no skip connections). Supports token-level Path-Drop
78
- Guidance (PDG) at inference β€” very sensitive, use small strengths only.
79
-
80
- **Compared to iRDiffAE's decoder**: iRDiffAE uses an 8-block decoder split into
81
- start (2), middle (4), and end (2) groups with a skip connection that concatenates
82
- start-block output with middle-block output and fuses them through a Conv1x1 before
83
- the end blocks. PDG works by dropping the entire middle block computation and
84
- replacing it with a learned mask feature. In contrast, mDiffAE uses a simple flat
85
- stack of 4 blocks with no skip connections or block groups.
86
- PDG instead works at the token level: 75% of spatial tokens in the fused decoder
87
- input are replaced with a learned mask feature, providing a much finer-grained
88
- guidance signal. The bottleneck is also halved from 128 to 64
89
- channels, giving a 12x
90
- compression ratio vs iRDiffAE's 6x.
91
 
92
  ### Key Differences from iRDiffAE
93
 
@@ -97,22 +90,18 @@ compression ratio vs iRDiffAE's 6x.
97
  | Decoder depth | 8 (2+4+2 skip-concat) | **4 (flat sequential)** |
98
  | PDG mechanism | Block dropping | **Token masking** |
99
  | Training regularizer | REPA + covariance reg | **Decoder token masking** |
100
- | PDG sensitivity | Moderate (1.5–3.0) | **Very sensitive (1.05–1.2)** |
101
 
102
  ## Recommended Settings
103
 
104
- Best quality is achieved with just **1 DDIM step** and PDG disabled,
105
- making inference extremely fast.
106
-
107
- PDG in mDiffAE is **very sensitive** β€” use tiny strengths (1.05–1.2)
108
- if enabled. Higher values will cause artifacts.
109
 
110
  | Setting | Default |
111
  |---|---|
112
  | Sampler | DDIM |
113
  | Steps | 1 |
114
  | PDG | Disabled |
115
- | PDG strength (if enabled) | 1.1 |
116
 
117
  ```python
118
  from m_diffae import MDiffAEInferenceConfig
@@ -126,7 +115,7 @@ recon = model.decode(latents, height=H, width=W, inference_config=cfg)
126
 
127
  ```bibtex
128
  @misc{m_diffae,
129
- title = {mDiffAE: A Masked Diffusion Autoencoder with Flat Decoder and Token-Level Guidance},
130
  author = {data-archetype},
131
  year = {2026},
132
  month = mar,
 
13
 
14
  **mDiffAE** β€” **M**asked **Diff**usion **A**uto**E**ncoder.
15
  A fast, single-GPU-trainable diffusion autoencoder with a **64-channel**
16
+ spatial bottleneck and a flat 4-block decoder. Uses decoder token masking
17
+ as an implicit regularizer instead of REPA alignment.
 
18
 
19
  This variant (mdiffae_v1): 81.4M parameters, 310.6 MB.
20
  Bottleneck: **64 channels** at patch size 16
 
73
 
74
  **Decoder**: VP diffusion conditioned on encoder latents and timestep via
75
  shared-base + per-layer low-rank AdaLN-Zero. 4 flat
76
+ sequential blocks (no skip connections).
77
+
78
+ **Compared to iRDiffAE**: iRDiffAE uses an 8-block decoder (2 start + 4 middle
79
+ + 2 end) with skip connections and 128 bottleneck channels (needed partly because
80
+ REPA occupies half the channels). mDiffAE uses 4 flat blocks
81
+ with no skip connections and 64 bottleneck channels
82
+ (12x compression vs
83
+ iRDiffAE's 6x), which gives better channel utilisation.
 
 
 
 
 
 
84
 
85
  ### Key Differences from iRDiffAE
86
 
 
90
  | Decoder depth | 8 (2+4+2 skip-concat) | **4 (flat sequential)** |
91
  | PDG mechanism | Block dropping | **Token masking** |
92
  | Training regularizer | REPA + covariance reg | **Decoder token masking** |
 
93
 
94
  ## Recommended Settings
95
 
96
+ Best quality is achieved with **1 DDIM step** and PDG disabled.
97
+ PDG can sharpen images but should be kept very low (1.01–1.05).
 
 
 
98
 
99
  | Setting | Default |
100
  |---|---|
101
  | Sampler | DDIM |
102
  | Steps | 1 |
103
  | PDG | Disabled |
104
+ | PDG strength (if enabled) | 1.05 |
105
 
106
  ```python
107
  from m_diffae import MDiffAEInferenceConfig
 
115
 
116
  ```bibtex
117
  @misc{m_diffae,
118
+ title = {mDiffAE: A Fast Masked Diffusion Autoencoder},
119
  author = {data-archetype},
120
  year = {2026},
121
  month = mar,
technical_report_mdiffae.md CHANGED
@@ -1,32 +1,36 @@
1
- # mDiffAE: Masked Diffusion AutoEncoder β€” Technical Report
2
 
3
  **Version 1** β€” March 2026
4
 
5
  ## 1. Introduction
6
 
7
- mDiffAE (**M**asked **Diff**usion **A**uto**E**ncoder) builds on the [iRDiffAE](https://huggingface.co/data-archetype/irdiffae-v1/blob/main/technical_report.md) model family, which provides a fast, single-GPU-trainable diffusion autoencoder platform capable of high-quality image reconstruction. See that report for full background on the shared components: VP diffusion math (logSNR schedules, alpha/sigma, x-prediction), DiCo block architecture (depthwise conv + compact channel attention + GELU MLP), patchify encoder (PixelUnshuffle + 1Γ—1 conv), shared-base + low-rank AdaLN-Zero conditioning, and the Path-Drop Guidance (PDG) concept.
8
 
9
- The iRDiffAE platform is designed to make it easy to experiment with different ways of regularizing the latent space. iRDiffAE v1 used REPA β€” aligning encoder features with a frozen DINOv2 teacher β€” which produces well-structured latents but tends toward overly smooth representations that are hard to reconcile with fine detail. Here we take a different approach entirely: **decoder token masking**.
10
 
11
  ### 1.1 Token Masking as Regularizer
12
 
13
- A fraction of the time (50% of samples per batch), the decoder only sees **25% of tokens** in the fused conditioning input. The spatial token grid is divided into non-overlapping 2Γ—2 groups, and within each group a single token is randomly kept while the other three are replaced with a learned mask feature. Hiding such a large fraction (75%) pushes the encoder to learn a form of representation consistency β€” each spatial token must carry enough information to support reconstruction even when most of its neighbors are absent. A smaller masking fraction helps downstream models learn sharp details quickly, but they fail to learn spatial coherence nearly as well β€” the task becomes too close to local inpainting, and the encoder is not pressured into globally consistent representations. The importance of a high masking ratio echoes findings in the masked autoencoder literature (He et al., 2022); we tested lower ratios and confirmed this tradeoff empirically.
14
 
15
- The 50% per-sample application probability is the knob that controls the compromise between reconstruction quality and latent space quality: samples that receive masking push the encoder toward consistent representations, while unmasked samples maintain reconstruction fidelity.
16
 
17
  ### 1.2 Latent Noise Regularization
18
 
19
- To further regularize the latent space, we retain the random latent noising mechanism 10% of the time. However, unlike the pixel-space diffusion noise, the latent noise level is sampled independently using a **Beta(2,2)** distribution (stratified), with a **logSNR shift of +1.0** that biases it toward low noise levels (low *t* in our convention). This decouples the latent regularization schedule from the decoder's diffusion schedule, providing a gentle push toward noise-robust representations without disrupting reconstruction training.
20
 
21
  ### 1.3 Simplified Decoder
22
 
23
- To keep the representational pressure on the encoder, we restrict the decoder to only **4 blocks** (down from 8 in iRDiffAE v1) and simplify it to a flat sequential architecture β€” no start/middle/end block groups, no skip connections. This halves the decoder's parameter count and makes it roughly 2Γ— faster, while forcing the encoder to compensate by producing more informative latents.
24
 
25
- ### 1.4 Empirical Results
26
 
27
- Compared to the REPA-regularized iRDiffAE v1, mDiffAE achieves slightly higher PSNR (to be confirmed with final benchmarks, but initial results were quite decisive) and produces a less oversmoothed but very locally consistent latent space PCA. In downstream diffusion model training, mDiffAE's latent space does not exhibit the very steep initial loss descent seen with iRDiffAE, but it quickly catches up after 50k–100k training steps, producing more spatially coherent images earlier with better high-frequency detail.
28
 
29
- ### 1.5 References
 
 
 
 
30
 
31
  - He, K., Chen, X., Xie, S., Li, Y., DollΓ‘r, P., & Girshick, R. (2022). *Masked Autoencoders Are Scalable Vision Learners*. CVPR 2022.
32
  - Li, T., Chang, H., Mishra, S.K., Zhang, H., Katabi, D., & Krishnan, D. (2023). *MAGE: MAsked Generative Encoder to Unify Representation Learning and Image Synthesis*. CVPR 2023.
@@ -40,7 +44,7 @@ Compared to the REPA-regularized iRDiffAE v1, mDiffAE achieves slightly higher P
40
  | Decoder topology | START_MIDDLE_END_SKIP_CONCAT | **FLAT (no skip concat)** |
41
  | Skip fusion | Yes (`fuse_skip` Conv1Γ—1) | **No** |
42
  | PDG mechanism | Drop middle blocks β†’ mask_feature | **Token-level masking** (75% spatial tokens β†’ mask_feature) |
43
- | PDG sensitivity | Moderate (strength 1.5–3.0) | **Very sensitive** (strength 1.05–1.2 only) |
44
  | Training regularizer | REPA (half-channel DINOv2 alignment) + covreg | **Decoder token masking** (75% ratio, 50% apply prob) |
45
  | Latent noise reg | Same mechanism | **Independent Beta(2,2), logSNR shift +1.0, 10% prob** |
46
  | Depthwise kernel | 7Γ—7 | 7Γ—7 (same) |
@@ -58,54 +62,35 @@ During training, with 50% probability per sample:
58
  3. Masked tokens are replaced with a learned `mask_feature` parameter (same dimensionality as model_dim)
59
  4. The decoder processes the partially-masked input normally through all blocks
60
 
61
- ### 3.2 Inference-Time PDG via Token Masking
62
-
63
- At inference, the trained mask_feature enables Path-Drop Guidance (PDG) through token-level masking rather than block-level dropping:
64
 
65
- - **Conditional pass**: Full decoder input (no masking)
66
- - **Unconditional pass**: Apply 2Γ—2 groupwise token masking at the trained ratio (75%)
67
- - **Guided output**: `x0 = x0_uncond + strength Γ— (x0_cond βˆ’ x0_uncond)`
68
-
69
- Because the decoder has only 4 blocks and no skip connections, the guidance signal from token masking is very concentrated. This makes PDG extremely sensitive β€” even a strength of 1.2 produces noticeable sharpening, and values above 1.5 cause severe artifacts.
70
 
71
  ## 4. Flat Decoder Architecture
72
 
73
  ### 4.1 iRDiffAE v1 Decoder (for comparison)
74
 
75
- The iRDiffAE v1 decoder uses an 8-block layout split into three groups with a skip connection:
76
-
77
  ```
78
  Fused input β†’ Start blocks (2) β†’ [save for skip] β†’
79
  Middle blocks (4) β†’ [cat with saved skip] β†’ FuseSkip Conv1Γ—1 β†’
80
  End blocks (2) β†’ Output head
81
  ```
82
 
83
- The skip connection concatenates the start-block output with the middle-block output and fuses them through a learned Conv1Γ—1 before feeding into the end blocks. For PDG, the entire middle block computation is dropped and replaced with a broadcasted learned `mask_feature`, effectively removing all 4 middle blocks from the forward pass. This produces a coarse "unconditional" signal for classifier-free guidance.
84
 
85
  ### 4.2 mDiffAE v1 Decoder
86
 
87
- The mDiffAE decoder replaces this with a flat sequential architecture β€” no block groups, no skip connection:
88
-
89
  ```
90
- Input: x_t [B, 3, H, W], t [B], z [B, 64, h, w]
91
-
92
  Patchify(x_t) β†’ RMSNorm β†’ x_feat [B, 896, h, w]
93
  LatentUp(z) β†’ RMSNorm β†’ z_up [B, 896, h, w]
94
  FuseIn(cat(x_feat, z_up)) β†’ fused [B, 896, h, w]
95
  [Optional: token masking for PDG]
96
  TimeEmbed(t) β†’ cond [B, 896]
97
- Block_0(fused, AdaLN(cond)) β†’ ...
98
- Block_1(..., AdaLN(cond)) β†’ ...
99
- Block_2(..., AdaLN(cond)) β†’ ...
100
- Block_3(..., AdaLN(cond)) β†’ out [B, 896, h, w]
101
  RMSNorm β†’ Conv1x1 β†’ PixelShuffle β†’ x0_hat [B, 3, H, W]
102
  ```
103
 
104
- With only 4 blocks and no skip fusion layer, the decoder has roughly half the parameters of iRDiffAE's decoder. The `fuse_skip` Conv1Γ—1 layer is eliminated entirely. For PDG, instead of dropping blocks, 75% of spatial tokens in the fused input are replaced with a learned `mask_feature` before the blocks run. This token-level masking provides a finer-grained guidance signal but is much more sensitive to strength β€” the decoder sees the full block computation in both the conditional and unconditional paths, so the difference between them is subtle.
105
-
106
- ### 4.3 Bottleneck
107
-
108
- The bottleneck dimension is halved from 128 channels (iRDiffAE) to 64 channels, giving a 12x compression ratio at patch size 16 (vs 6x for iRDiffAE). Despite the higher compression, the masking regularizer forces the encoder to produce informative per-token representations, maintaining reconstruction quality.
109
 
110
  ## 5. Model Configuration
111
 
@@ -133,8 +118,8 @@ Training checkpoint: step 708,000 (EMA weights).
133
  |---------|-------|-------|
134
  | Sampler | DDIM | Best for 1-step |
135
  | Steps | 1 | PSNR-optimal |
136
- | PDG | Disabled | Default, safest |
137
- | PDG strength | 1.05–1.2 | If enabled, very sensitive |
138
 
139
  ## 7. Results
140
 
 
1
+ # mDiffAE: A Fast Masked Diffusion Autoencoder β€” Technical Report
2
 
3
  **Version 1** β€” March 2026
4
 
5
  ## 1. Introduction
6
 
7
+ mDiffAE (**M**asked **Diff**usion **A**uto**E**ncoder) builds on the [iRDiffAE](https://huggingface.co/data-archetype/irdiffae-v1/blob/main/technical_report.md) model family. See that report for full background on the shared components: VP diffusion, DiCo blocks, patchify encoder, AdaLN-Zero conditioning, and Path-Drop Guidance (PDG).
8
 
9
+ iRDiffAE v1 used REPA (aligning encoder features with a frozen DINOv2 teacher) to regularize the latent space. REPA produces well-structured latents but tends toward overly smooth representations. Here we replace it with **decoder token masking**.
10
 
11
  ### 1.1 Token Masking as Regularizer
12
 
13
+ With 50% probability per sample, the decoder only sees **25% of tokens** in the fused conditioning input. The spatial token grid is divided into non-overlapping 2Γ—2 groups; within each group a single token is randomly kept and the other three are replaced with a learned mask feature. The high masking ratio (75%) forces each spatial token to carry enough information for reconstruction even when most neighbors are absent. Lower masking ratios help downstream models learn sharp details quickly but fail to learn spatial coherence β€” the task becomes too close to local inpainting. We tested lower ratios and confirmed this tradeoff (see also He et al., 2022).
14
 
15
+ The 50% application probability controls the tradeoff between reconstruction quality and latent regularity.
16
 
17
  ### 1.2 Latent Noise Regularization
18
 
19
+ 10% of the time, random noise is added to the latent representation. The noise level is sampled from a **Beta(2,2)** distribution with a **logSNR shift of +1.0** (biasing toward low noise), independently of the pixel-space diffusion schedule.
20
 
21
  ### 1.3 Simplified Decoder
22
 
23
+ The decoder uses only **4 blocks** (down from 8 in iRDiffAE v1) in a flat sequential layout β€” no start/middle/end groups, no skip connections. This halves the decoder's parameter count and is roughly 2Γ— faster.
24
 
25
+ ### 1.4 Bottleneck
26
 
27
+ iRDiffAE v1 used 128 bottleneck channels, partly because REPA alignment occupies half the channels. Without REPA, 64 channels suffice and give better channel utilisation. This yields a 12Γ— compression ratio at patch size 16 (vs 6Γ— for iRDiffAE).
28
 
29
+ ### 1.5 Empirical Results
30
+
31
+ Compared to iRDiffAE v1, mDiffAE achieves comparable PSNR with less oversmoothed latent PCA. In downstream diffusion model training, mDiffAE's latent space does not show the steep initial loss descent of iRDiffAE, but catches up after 50k–100k steps, producing more spatially coherent images with better high-frequency detail.
32
+
33
+ ### 1.6 References
34
 
35
  - He, K., Chen, X., Xie, S., Li, Y., DollΓ‘r, P., & Girshick, R. (2022). *Masked Autoencoders Are Scalable Vision Learners*. CVPR 2022.
36
  - Li, T., Chang, H., Mishra, S.K., Zhang, H., Katabi, D., & Krishnan, D. (2023). *MAGE: MAsked Generative Encoder to Unify Representation Learning and Image Synthesis*. CVPR 2023.
 
44
  | Decoder topology | START_MIDDLE_END_SKIP_CONCAT | **FLAT (no skip concat)** |
45
  | Skip fusion | Yes (`fuse_skip` Conv1Γ—1) | **No** |
46
  | PDG mechanism | Drop middle blocks β†’ mask_feature | **Token-level masking** (75% spatial tokens β†’ mask_feature) |
47
+ | PDG sensitivity | Moderate (strength 1.5–3.0) | **Very sensitive** (strength 1.01–1.05) |
48
  | Training regularizer | REPA (half-channel DINOv2 alignment) + covreg | **Decoder token masking** (75% ratio, 50% apply prob) |
49
  | Latent noise reg | Same mechanism | **Independent Beta(2,2), logSNR shift +1.0, 10% prob** |
50
  | Depthwise kernel | 7Γ—7 | 7Γ—7 (same) |
 
62
  3. Masked tokens are replaced with a learned `mask_feature` parameter (same dimensionality as model_dim)
63
  4. The decoder processes the partially-masked input normally through all blocks
64
 
65
+ ### 3.2 PDG at Inference
 
 
66
 
67
+ At inference, the trained mask_feature can be used for Path-Drop Guidance (PDG): the conditional pass uses the full input, the unconditional pass applies 2Γ—2 groupwise masking at 75%, and the two are interpolated as usual. PDG can sharpen reconstructions but should be kept very low (strength 1.01–1.05); higher values cause artifacts.
 
 
 
 
68
 
69
  ## 4. Flat Decoder Architecture
70
 
71
  ### 4.1 iRDiffAE v1 Decoder (for comparison)
72
 
 
 
73
  ```
74
  Fused input β†’ Start blocks (2) β†’ [save for skip] β†’
75
  Middle blocks (4) β†’ [cat with saved skip] β†’ FuseSkip Conv1Γ—1 β†’
76
  End blocks (2) β†’ Output head
77
  ```
78
 
79
+ 8 blocks split into three groups with a skip connection. For PDG, the middle blocks are dropped and replaced with a learned mask feature.
80
 
81
  ### 4.2 mDiffAE v1 Decoder
82
 
 
 
83
  ```
 
 
84
  Patchify(x_t) β†’ RMSNorm β†’ x_feat [B, 896, h, w]
85
  LatentUp(z) β†’ RMSNorm β†’ z_up [B, 896, h, w]
86
  FuseIn(cat(x_feat, z_up)) β†’ fused [B, 896, h, w]
87
  [Optional: token masking for PDG]
88
  TimeEmbed(t) β†’ cond [B, 896]
89
+ Block_0 β†’ Block_1 β†’ Block_2 β†’ Block_3 β†’ out [B, 896, h, w]
 
 
 
90
  RMSNorm β†’ Conv1x1 β†’ PixelShuffle β†’ x0_hat [B, 3, H, W]
91
  ```
92
 
93
+ 4 flat sequential blocks, no skip connections. Roughly half the decoder parameters of iRDiffAE.
 
 
 
 
94
 
95
  ## 5. Model Configuration
96
 
 
118
  |---------|-------|-------|
119
  | Sampler | DDIM | Best for 1-step |
120
  | Steps | 1 | PSNR-optimal |
121
+ | PDG | Disabled | Default |
122
+ | PDG strength | 1.01–1.05 | If enabled; can sharpen but artifacts above ~1.1 |
123
 
124
  ## 7. Results
125