data-archetype
/

mdiffae-v1

@@ -4,7 +4,7 @@
 ## 1. Introduction
-mDiffAE (**M**asked **Diff**usion **A**uto**E**ncoder) builds on the [iRDiffAE](https://huggingface.co/data-archetype/irdiffae-v1/blob/main/technical_report.md) model family. See that report for full background on the shared components: VP diffusion, DiCo blocks, patchify encoder, AdaLN-Zero conditioning, and Path-Drop Guidance (PDG).
 iRDiffAE v1 used REPA (aligning encoder features with a frozen DINOv2 teacher) to regularize the latent space. REPA produces well-structured latents but tends toward overly smooth representations. Here we replace it with **decoder token masking**.
@@ -16,7 +16,7 @@ The 50% application probability controls the tradeoff between reconstruction qua
 ### 1.2 Latent Noise Regularization
-10% of the time, random noise is added to the latent representation. The noise level is sampled from a **Beta(2,2)** distribution with a **logSNR shift of +1.0** (biasing toward low noise), independently of the pixel-space diffusion schedule.
 ### 1.3 Simplified Decoder
@@ -28,7 +28,7 @@ iRDiffAE v1 used 128 bottleneck channels, partly because REPA alignment occupies
 ### 1.5 Empirical Results
-Compared to iRDiffAE v1, mDiffAE achieves comparable PSNR with less oversmoothed latent PCA. In downstream diffusion model training, mDiffAE's latent space does not show the steep initial loss descent of iRDiffAE, but catches up after 50k–100k steps, producing more spatially coherent images with better high-frequency detail.
 ### 1.6 References

 ## 1. Introduction
+mDiffAE (**M**asked **Diff**usion **A**uto**E**ncoder) builds on the [iRDiffAE](https://huggingface.co/data-archetype/irdiffae-v1/blob/main/technical_report.md) model family, which provides a fast, single-GPU-trainable diffusion autoencoder with good reconstruction quality, making it a good platform for experimenting with latent space regularization. See that report for full background on the shared components: VP diffusion, DiCo blocks, patchify encoder, AdaLN-Zero conditioning, and Path-Drop Guidance (PDG).
 iRDiffAE v1 used REPA (aligning encoder features with a frozen DINOv2 teacher) to regularize the latent space. REPA produces well-structured latents but tends toward overly smooth representations. Here we replace it with **decoder token masking**.
 ### 1.2 Latent Noise Regularization
+10% of the time, random noise is added to the latent representation. Unlike iRDiffAE (and the DiTo paper), which synchronizes the latent noise level with the pixel-space diffusion timestep, here the noise level is sampled independently from a **Beta(2,2)** distribution with a **logSNR shift of +1.0**, biasing it toward low noise. This improves robustness to incomplete convergence of downstream models and encourages local smoothness of the latent space distribution.
 ### 1.3 Simplified Decoder
 ### 1.5 Empirical Results
+Compared to iRDiffAE v1, mDiffAE achieves slightly higher PSNR with less oversmoothed latent PCA. In downstream diffusion model training, mDiffAE's latent space does not show the steep initial loss descent of iRDiffAE, but catches up after 50k–100k steps, producing more spatially coherent images with better high-frequency detail.
 ### 1.6 References