data-archetype
/

mdiffae-v1

@@ -13,9 +13,8 @@ library_name: mdiffae
 **mDiffAE** — **M**asked **Diff**usion **A**uto**E**ncoder.
 A fast, single-GPU-trainable diffusion autoencoder with a **64-channel**
-spatial bottleneck. Uses decoder token masking as an implicit regularizer instead of
-REPA alignment, achieving Flux.2-level conditioning quality with a simpler
-flat decoder architecture (4 blocks, no skip connections).
 This variant (mdiffae_v1): 81.4M parameters, 310.6 MB.
 Bottleneck: **64 channels** at patch size 16
@@ -74,20 +73,14 @@ learned residual gates.
 **Decoder**: VP diffusion conditioned on encoder latents and timestep via
 shared-base + per-layer low-rank AdaLN-Zero. 4 flat
-sequential blocks (no skip connections). Supports token-level Path-Drop
-Guidance (PDG) at inference — very sensitive, use small strengths only.
-**Compared to iRDiffAE's decoder**: iRDiffAE uses an 8-block decoder split into
-start (2), middle (4), and end (2) groups with a skip connection that concatenates
-start-block output with middle-block output and fuses them through a Conv1x1 before
-the end blocks. PDG works by dropping the entire middle block computation and
-replacing it with a learned mask feature. In contrast, mDiffAE uses a simple flat
-stack of 4 blocks with no skip connections or block groups.
-PDG instead works at the token level: 75% of spatial tokens in the fused decoder
-input are replaced with a learned mask feature, providing a much finer-grained
-guidance signal. The bottleneck is also halved from 128 to 64
-channels, giving a 12x
-compression ratio vs iRDiffAE's 6x.
 ### Key Differences from iRDiffAE
@@ -97,22 +90,18 @@ compression ratio vs iRDiffAE's 6x.
 | Decoder depth | 8 (2+4+2 skip-concat) | **4 (flat sequential)** |
 | PDG mechanism | Block dropping | **Token masking** |
 | Training regularizer | REPA + covariance reg | **Decoder token masking** |
-| PDG sensitivity | Moderate (1.5–3.0) | **Very sensitive (1.05–1.2)** |
 ## Recommended Settings
-Best quality is achieved with just **1 DDIM step** and PDG disabled,
-making inference extremely fast.
-PDG in mDiffAE is **very sensitive** — use tiny strengths (1.05–1.2)
-if enabled. Higher values will cause artifacts.
 | Setting | Default |
 |---|---|
 | Sampler | DDIM |
 | Steps | 1 |
 | PDG | Disabled |
-| PDG strength (if enabled) | 1.1 |
 ```python
 from m_diffae import MDiffAEInferenceConfig
@@ -126,7 +115,7 @@ recon = model.decode(latents, height=H, width=W, inference_config=cfg)
 ```bibtex
 @misc{m_diffae,
-  title   = {mDiffAE: A Masked Diffusion Autoencoder with Flat Decoder and Token-Level Guidance},
   author  = {data-archetype},
   year    = {2026},
   month   = mar,

 **mDiffAE** — **M**asked **Diff**usion **A**uto**E**ncoder.
 A fast, single-GPU-trainable diffusion autoencoder with a **64-channel**
+spatial bottleneck and a flat 4-block decoder. Uses decoder token masking
+as an implicit regularizer instead of REPA alignment.
 This variant (mdiffae_v1): 81.4M parameters, 310.6 MB.
 Bottleneck: **64 channels** at patch size 16
 **Decoder**: VP diffusion conditioned on encoder latents and timestep via
 shared-base + per-layer low-rank AdaLN-Zero. 4 flat
+sequential blocks (no skip connections).
+**Compared to iRDiffAE**: iRDiffAE uses an 8-block decoder (2 start + 4 middle
++ 2 end) with skip connections and 128 bottleneck channels (needed partly because
+REPA occupies half the channels). mDiffAE uses 4 flat blocks
+with no skip connections and 64 bottleneck channels
+(12x compression vs
+iRDiffAE's 6x), which gives better channel utilisation.
 ### Key Differences from iRDiffAE
 | Decoder depth | 8 (2+4+2 skip-concat) | **4 (flat sequential)** |
 | PDG mechanism | Block dropping | **Token masking** |
 | Training regularizer | REPA + covariance reg | **Decoder token masking** |
 ## Recommended Settings
+Best quality is achieved with **1 DDIM step** and PDG disabled.
+PDG can sharpen images but should be kept very low (1.01–1.05).
 | Setting | Default |
 |---|---|
 | Sampler | DDIM |
 | Steps | 1 |
 | PDG | Disabled |
+| PDG strength (if enabled) | 1.05 |
 ```python
 from m_diffae import MDiffAEInferenceConfig
 ```bibtex
 @misc{m_diffae,
+  title   = {mDiffAE: A Fast Masked Diffusion Autoencoder},
   author  = {data-archetype},
   year    = {2026},
   month   = mar,

technical_report_mdiffae.md CHANGED Viewed

@@ -1,32 +1,36 @@
-# mDiffAE: Masked Diffusion AutoEncoder — Technical Report
 **Version 1** — March 2026
 ## 1. Introduction
-mDiffAE (**M**asked **Diff**usion **A**uto**E**ncoder) builds on the [iRDiffAE](https://huggingface.co/data-archetype/irdiffae-v1/blob/main/technical_report.md) model family, which provides a fast, single-GPU-trainable diffusion autoencoder platform capable of high-quality image reconstruction. See that report for full background on the shared components: VP diffusion math (logSNR schedules, alpha/sigma, x-prediction), DiCo block architecture (depthwise conv + compact channel attention + GELU MLP), patchify encoder (PixelUnshuffle + 1×1 conv), shared-base + low-rank AdaLN-Zero conditioning, and the Path-Drop Guidance (PDG) concept.
-The iRDiffAE platform is designed to make it easy to experiment with different ways of regularizing the latent space. iRDiffAE v1 used REPA — aligning encoder features with a frozen DINOv2 teacher — which produces well-structured latents but tends toward overly smooth representations that are hard to reconcile with fine detail. Here we take a different approach entirely: **decoder token masking**.
 ### 1.1 Token Masking as Regularizer
-A fraction of the time (50% of samples per batch), the decoder only sees **25% of tokens** in the fused conditioning input. The spatial token grid is divided into non-overlapping 2×2 groups, and within each group a single token is randomly kept while the other three are replaced with a learned mask feature. Hiding such a large fraction (75%) pushes the encoder to learn a form of representation consistency — each spatial token must carry enough information to support reconstruction even when most of its neighbors are absent. A smaller masking fraction helps downstream models learn sharp details quickly, but they fail to learn spatial coherence nearly as well — the task becomes too close to local inpainting, and the encoder is not pressured into globally consistent representations. The importance of a high masking ratio echoes findings in the masked autoencoder literature (He et al., 2022); we tested lower ratios and confirmed this tradeoff empirically.
-The 50% per-sample application probability is the knob that controls the compromise between reconstruction quality and latent space quality: samples that receive masking push the encoder toward consistent representations, while unmasked samples maintain reconstruction fidelity.
 ### 1.2 Latent Noise Regularization
-To further regularize the latent space, we retain the random latent noising mechanism 10% of the time. However, unlike the pixel-space diffusion noise, the latent noise level is sampled independently using a **Beta(2,2)** distribution (stratified), with a **logSNR shift of +1.0** that biases it toward low noise levels (low *t* in our convention). This decouples the latent regularization schedule from the decoder's diffusion schedule, providing a gentle push toward noise-robust representations without disrupting reconstruction training.
 ### 1.3 Simplified Decoder
-To keep the representational pressure on the encoder, we restrict the decoder to only **4 blocks** (down from 8 in iRDiffAE v1) and simplify it to a flat sequential architecture — no start/middle/end block groups, no skip connections. This halves the decoder's parameter count and makes it roughly 2× faster, while forcing the encoder to compensate by producing more informative latents.
-### 1.4 Empirical Results
-Compared to the REPA-regularized iRDiffAE v1, mDiffAE achieves slightly higher PSNR (to be confirmed with final benchmarks, but initial results were quite decisive) and produces a less oversmoothed but very locally consistent latent space PCA. In downstream diffusion model training, mDiffAE's latent space does not exhibit the very steep initial loss descent seen with iRDiffAE, but it quickly catches up after 50k–100k training steps, producing more spatially coherent images earlier with better high-frequency detail.
-### 1.5 References
 - He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2022). *Masked Autoencoders Are Scalable Vision Learners*. CVPR 2022.
 - Li, T., Chang, H., Mishra, S.K., Zhang, H., Katabi, D., & Krishnan, D. (2023). *MAGE: MAsked Generative Encoder to Unify Representation Learning and Image Synthesis*. CVPR 2023.
@@ -40,7 +44,7 @@ Compared to the REPA-regularized iRDiffAE v1, mDiffAE achieves slightly higher P
 | Decoder topology | START_MIDDLE_END_SKIP_CONCAT | **FLAT (no skip concat)** |
 | Skip fusion | Yes (`fuse_skip` Conv1×1) | **No** |
 | PDG mechanism | Drop middle blocks → mask_feature | **Token-level masking** (75% spatial tokens → mask_feature) |
-| PDG sensitivity | Moderate (strength 1.5–3.0) | **Very sensitive** (strength 1.05–1.2 only) |
 | Training regularizer | REPA (half-channel DINOv2 alignment) + covreg | **Decoder token masking** (75% ratio, 50% apply prob) |
 | Latent noise reg | Same mechanism | **Independent Beta(2,2), logSNR shift +1.0, 10% prob** |
 | Depthwise kernel | 7×7 | 7×7 (same) |
@@ -58,54 +62,35 @@ During training, with 50% probability per sample:
 3. Masked tokens are replaced with a learned `mask_feature` parameter (same dimensionality as model_dim)
 4. The decoder processes the partially-masked input normally through all blocks
-### 3.2 Inference-Time PDG via Token Masking
-At inference, the trained mask_feature enables Path-Drop Guidance (PDG) through token-level masking rather than block-level dropping:
-- **Conditional pass**: Full decoder input (no masking)
-- **Unconditional pass**: Apply 2×2 groupwise token masking at the trained ratio (75%)
-- **Guided output**: `x0 = x0_uncond + strength × (x0_cond − x0_uncond)`
-Because the decoder has only 4 blocks and no skip connections, the guidance signal from token masking is very concentrated. This makes PDG extremely sensitive — even a strength of 1.2 produces noticeable sharpening, and values above 1.5 cause severe artifacts.
 ## 4. Flat Decoder Architecture
 ### 4.1 iRDiffAE v1 Decoder (for comparison)
-The iRDiffAE v1 decoder uses an 8-block layout split into three groups with a skip connection:
 ```
 Fused input → Start blocks (2) → [save for skip] →
   Middle blocks (4) → [cat with saved skip] → FuseSkip Conv1×1 →
   End blocks (2) → Output head
 ```
-The skip connection concatenates the start-block output with the middle-block output and fuses them through a learned Conv1×1 before feeding into the end blocks. For PDG, the entire middle block computation is dropped and replaced with a broadcasted learned `mask_feature`, effectively removing all 4 middle blocks from the forward pass. This produces a coarse "unconditional" signal for classifier-free guidance.
 ### 4.2 mDiffAE v1 Decoder
-The mDiffAE decoder replaces this with a flat sequential architecture — no block groups, no skip connection:
 ```
-Input: x_t [B, 3, H, W], t [B], z [B, 64, h, w]
 Patchify(x_t) → RMSNorm → x_feat [B, 896, h, w]
 LatentUp(z) → RMSNorm → z_up [B, 896, h, w]
 FuseIn(cat(x_feat, z_up)) → fused [B, 896, h, w]
 [Optional: token masking for PDG]
 TimeEmbed(t) → cond [B, 896]
-Block_0(fused, AdaLN(cond)) → ...
-Block_1(..., AdaLN(cond)) → ...
-Block_2(..., AdaLN(cond)) → ...
-Block_3(..., AdaLN(cond)) → out [B, 896, h, w]
 RMSNorm → Conv1x1 → PixelShuffle → x0_hat [B, 3, H, W]
 ```
-With only 4 blocks and no skip fusion layer, the decoder has roughly half the parameters of iRDiffAE's decoder. The `fuse_skip` Conv1×1 layer is eliminated entirely. For PDG, instead of dropping blocks, 75% of spatial tokens in the fused input are replaced with a learned `mask_feature` before the blocks run. This token-level masking provides a finer-grained guidance signal but is much more sensitive to strength — the decoder sees the full block computation in both the conditional and unconditional paths, so the difference between them is subtle.
-### 4.3 Bottleneck
-The bottleneck dimension is halved from 128 channels (iRDiffAE) to 64 channels, giving a 12x compression ratio at patch size 16 (vs 6x for iRDiffAE). Despite the higher compression, the masking regularizer forces the encoder to produce informative per-token representations, maintaining reconstruction quality.
 ## 5. Model Configuration
@@ -133,8 +118,8 @@ Training checkpoint: step 708,000 (EMA weights).
 |---------|-------|-------|
 | Sampler | DDIM | Best for 1-step |
 | Steps | 1 | PSNR-optimal |
-| PDG | Disabled | Default, safest |
-| PDG strength | 1.05–1.2 | If enabled, very sensitive |
 ## 7. Results

+# mDiffAE: A Fast Masked Diffusion Autoencoder — Technical Report
 **Version 1** — March 2026
 ## 1. Introduction
+mDiffAE (**M**asked **Diff**usion **A**uto**E**ncoder) builds on the [iRDiffAE](https://huggingface.co/data-archetype/irdiffae-v1/blob/main/technical_report.md) model family. See that report for full background on the shared components: VP diffusion, DiCo blocks, patchify encoder, AdaLN-Zero conditioning, and Path-Drop Guidance (PDG).
+iRDiffAE v1 used REPA (aligning encoder features with a frozen DINOv2 teacher) to regularize the latent space. REPA produces well-structured latents but tends toward overly smooth representations. Here we replace it with **decoder token masking**.
 ### 1.1 Token Masking as Regularizer
+With 50% probability per sample, the decoder only sees **25% of tokens** in the fused conditioning input. The spatial token grid is divided into non-overlapping 2×2 groups; within each group a single token is randomly kept and the other three are replaced with a learned mask feature. The high masking ratio (75%) forces each spatial token to carry enough information for reconstruction even when most neighbors are absent. Lower masking ratios help downstream models learn sharp details quickly but fail to learn spatial coherence — the task becomes too close to local inpainting. We tested lower ratios and confirmed this tradeoff (see also He et al., 2022).
+The 50% application probability controls the tradeoff between reconstruction quality and latent regularity.
 ### 1.2 Latent Noise Regularization
+10% of the time, random noise is added to the latent representation. The noise level is sampled from a **Beta(2,2)** distribution with a **logSNR shift of +1.0** (biasing toward low noise), independently of the pixel-space diffusion schedule.
 ### 1.3 Simplified Decoder
+The decoder uses only **4 blocks** (down from 8 in iRDiffAE v1) in a flat sequential layout — no start/middle/end groups, no skip connections. This halves the decoder's parameter count and is roughly 2× faster.
+### 1.4 Bottleneck
+iRDiffAE v1 used 128 bottleneck channels, partly because REPA alignment occupies half the channels. Without REPA, 64 channels suffice and give better channel utilisation. This yields a 12× compression ratio at patch size 16 (vs 6× for iRDiffAE).
+### 1.5 Empirical Results
+Compared to iRDiffAE v1, mDiffAE achieves comparable PSNR with less oversmoothed latent PCA. In downstream diffusion model training, mDiffAE's latent space does not show the steep initial loss descent of iRDiffAE, but catches up after 50k–100k steps, producing more spatially coherent images with better high-frequency detail.
+### 1.6 References
 - He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2022). *Masked Autoencoders Are Scalable Vision Learners*. CVPR 2022.
 - Li, T., Chang, H., Mishra, S.K., Zhang, H., Katabi, D., & Krishnan, D. (2023). *MAGE: MAsked Generative Encoder to Unify Representation Learning and Image Synthesis*. CVPR 2023.
 | Decoder topology | START_MIDDLE_END_SKIP_CONCAT | **FLAT (no skip concat)** |
 | Skip fusion | Yes (`fuse_skip` Conv1×1) | **No** |
 | PDG mechanism | Drop middle blocks → mask_feature | **Token-level masking** (75% spatial tokens → mask_feature) |
+| PDG sensitivity | Moderate (strength 1.5–3.0) | **Very sensitive** (strength 1.01–1.05) |
 | Training regularizer | REPA (half-channel DINOv2 alignment) + covreg | **Decoder token masking** (75% ratio, 50% apply prob) |
 | Latent noise reg | Same mechanism | **Independent Beta(2,2), logSNR shift +1.0, 10% prob** |
 | Depthwise kernel | 7×7 | 7×7 (same) |
 3. Masked tokens are replaced with a learned `mask_feature` parameter (same dimensionality as model_dim)
 4. The decoder processes the partially-masked input normally through all blocks
+### 3.2 PDG at Inference
+At inference, the trained mask_feature can be used for Path-Drop Guidance (PDG): the conditional pass uses the full input, the unconditional pass applies 2×2 groupwise masking at 75%, and the two are interpolated as usual. PDG can sharpen reconstructions but should be kept very low (strength 1.01–1.05); higher values cause artifacts.
 ## 4. Flat Decoder Architecture
 ### 4.1 iRDiffAE v1 Decoder (for comparison)
 ```
 Fused input → Start blocks (2) → [save for skip] →
   Middle blocks (4) → [cat with saved skip] → FuseSkip Conv1×1 →
   End blocks (2) → Output head
 ```
+8 blocks split into three groups with a skip connection. For PDG, the middle blocks are dropped and replaced with a learned mask feature.
 ### 4.2 mDiffAE v1 Decoder
 ```
 Patchify(x_t) → RMSNorm → x_feat [B, 896, h, w]
 LatentUp(z) → RMSNorm → z_up [B, 896, h, w]
 FuseIn(cat(x_feat, z_up)) → fused [B, 896, h, w]
 [Optional: token masking for PDG]
 TimeEmbed(t) → cond [B, 896]
+Block_0 → Block_1 → Block_2 → Block_3 → out [B, 896, h, w]
 RMSNorm → Conv1x1 → PixelShuffle → x0_hat [B, 3, H, W]
 ```
+4 flat sequential blocks, no skip connections. Roughly half the decoder parameters of iRDiffAE.
 ## 5. Model Configuration
 |---------|-------|-------|
 | Sampler | DDIM | Best for 1-step |
 | Steps | 1 | PSNR-optimal |
+| PDG | Disabled | Default |
+| PDG strength | 1.01–1.05 | If enabled; can sharpen but artifacts above ~1.1 |
 ## 7. Results