data-archetype commited on
Commit
d9ec2a4
·
verified ·
1 Parent(s): 9b877c3

Upload folder using huggingface_hub

Browse files
Files changed (1) hide show
  1. technical_report_mdiffae.md +3 -3
technical_report_mdiffae.md CHANGED
@@ -4,7 +4,7 @@
4
 
5
  ## 1. Introduction
6
 
7
- mDiffAE (**M**asked **Diff**usion **A**uto**E**ncoder) builds on the [iRDiffAE](https://huggingface.co/data-archetype/irdiffae-v1/blob/main/technical_report.md) model family. See that report for full background on the shared components: VP diffusion, DiCo blocks, patchify encoder, AdaLN-Zero conditioning, and Path-Drop Guidance (PDG).
8
 
9
  iRDiffAE v1 used REPA (aligning encoder features with a frozen DINOv2 teacher) to regularize the latent space. REPA produces well-structured latents but tends toward overly smooth representations. Here we replace it with **decoder token masking**.
10
 
@@ -16,7 +16,7 @@ The 50% application probability controls the tradeoff between reconstruction qua
16
 
17
  ### 1.2 Latent Noise Regularization
18
 
19
- 10% of the time, random noise is added to the latent representation. The noise level is sampled from a **Beta(2,2)** distribution with a **logSNR shift of +1.0** (biasing toward low noise), independently of the pixel-space diffusion schedule.
20
 
21
  ### 1.3 Simplified Decoder
22
 
@@ -28,7 +28,7 @@ iRDiffAE v1 used 128 bottleneck channels, partly because REPA alignment occupies
28
 
29
  ### 1.5 Empirical Results
30
 
31
- Compared to iRDiffAE v1, mDiffAE achieves comparable PSNR with less oversmoothed latent PCA. In downstream diffusion model training, mDiffAE's latent space does not show the steep initial loss descent of iRDiffAE, but catches up after 50k–100k steps, producing more spatially coherent images with better high-frequency detail.
32
 
33
  ### 1.6 References
34
 
 
4
 
5
  ## 1. Introduction
6
 
7
+ mDiffAE (**M**asked **Diff**usion **A**uto**E**ncoder) builds on the [iRDiffAE](https://huggingface.co/data-archetype/irdiffae-v1/blob/main/technical_report.md) model family, which provides a fast, single-GPU-trainable diffusion autoencoder with good reconstruction quality, making it a good platform for experimenting with latent space regularization. See that report for full background on the shared components: VP diffusion, DiCo blocks, patchify encoder, AdaLN-Zero conditioning, and Path-Drop Guidance (PDG).
8
 
9
  iRDiffAE v1 used REPA (aligning encoder features with a frozen DINOv2 teacher) to regularize the latent space. REPA produces well-structured latents but tends toward overly smooth representations. Here we replace it with **decoder token masking**.
10
 
 
16
 
17
  ### 1.2 Latent Noise Regularization
18
 
19
+ 10% of the time, random noise is added to the latent representation. Unlike iRDiffAE (and the DiTo paper), which synchronizes the latent noise level with the pixel-space diffusion timestep, here the noise level is sampled independently from a **Beta(2,2)** distribution with a **logSNR shift of +1.0**, biasing it toward low noise. This improves robustness to incomplete convergence of downstream models and encourages local smoothness of the latent space distribution.
20
 
21
  ### 1.3 Simplified Decoder
22
 
 
28
 
29
  ### 1.5 Empirical Results
30
 
31
+ Compared to iRDiffAE v1, mDiffAE achieves slightly higher PSNR with less oversmoothed latent PCA. In downstream diffusion model training, mDiffAE's latent space does not show the steep initial loss descent of iRDiffAE, but catches up after 50k–100k steps, producing more spatially coherent images with better high-frequency detail.
32
 
33
  ### 1.6 References
34