--- license: apache-2.0 tags: - diffusion - autoencoder - image-reconstruction - pytorch - masked-autoencoder library_name: mdiffae --- # mdiffae_v1 **mDiffAE** — **M**asked **Diff**usion **A**uto**E**ncoder. A fast, single-GPU-trainable diffusion autoencoder with a **64-channel** spatial bottleneck. Uses decoder token masking as an implicit regularizer instead of REPA alignment. This variant (mdiffae_v1): 81.4M parameters, 310.6 MB. Bottleneck: **64 channels** at patch size 16 (compression ratio 12x). ## Documentation - [Technical Report](technical_report_mdiffae.md) — architecture, masking strategy, and results - [iRDiffAE Technical Report](https://huggingface.co/data-archetype/irdiffae-v1/blob/main/technical_report.md) — full background on VP diffusion, DiCo blocks, patchify encoder, AdaLN - [Results — interactive viewer](https://huggingface.co/spaces/data-archetype/mdiffae-results) — full-resolution side-by-side comparison ## Quick Start ```python import torch from m_diffae import MDiffAE # Load from HuggingFace Hub (or a local path) model = MDiffAE.from_pretrained("data-archetype/mdiffae_v1", device="cuda") # Encode images = ... # [B, 3, H, W] in [-1, 1], H and W divisible by 16 latents = model.encode(images) # Decode (1 step by default — PSNR-optimal) recon = model.decode(latents, height=H, width=W) # Reconstruct (encode + 1-step decode) recon = model.reconstruct(images) ``` > **Note:** Requires `pip install huggingface_hub safetensors` for Hub downloads. > You can also pass a local directory path to `from_pretrained()`. ## Architecture | Property | Value | |---|---| | Parameters | 81,410,624 | | File size | 310.6 MB | | Patch size | 16 | | Model dim | 896 | | Encoder depth | 4 | | Decoder depth | 4 | | Decoder topology | Flat sequential (no skip connections) | | Bottleneck dim | 64 | | MLP ratio | 4.0 | | Depthwise kernel | 7 | | AdaLN rank | 128 | | PDG mechanism | Token-level masking (ratio 0.75) | | Training regularizer | Decoder token masking (75% ratio, 50% apply prob) | **Encoder**: Deterministic. Patchify (PixelUnshuffle + 1x1 conv) followed by DiCo blocks (depthwise conv + compact channel attention + GELU MLP) with learned residual gates. **Decoder**: VP diffusion conditioned on encoder latents and timestep via shared-base + per-layer low-rank AdaLN-Zero. 4 flat sequential blocks (no skip connections). **Compared to iRDiffAE**: iRDiffAE uses an 8-block decoder (2 start + 4 middle + 2 end) with skip connections and 128 bottleneck channels (needed partly because REPA occupies half the channels). mDiffAE uses 4 flat blocks with no skip connections and 64 bottleneck channels (12x compression vs iRDiffAE's 6x), which gives better channel utilisation. ### Key Differences from iRDiffAE | Aspect | iRDiffAE v1 | mDiffAE v1 | |---|---|---| | Bottleneck dim | 128 | **64** | | Decoder depth | 8 (2+4+2 skip-concat) | **4 (flat sequential)** | | PDG mechanism | Block dropping | **Token masking** | | Training regularizer | REPA + covariance reg | **Decoder token masking** | ## Recommended Settings Best quality is achieved with **1 DDIM step** and PDG disabled. PDG can sharpen images but should be kept very low (1.01–1.05). | Setting | Default | |---|---| | Sampler | DDIM | | Steps | 1 | | PDG | Disabled | | PDG strength (if enabled) | 1.05 | ```python from m_diffae import MDiffAEInferenceConfig # PSNR-optimal (fast, 1 step) cfg = MDiffAEInferenceConfig(num_steps=1, sampler="ddim") recon = model.decode(latents, height=H, width=W, inference_config=cfg) ``` ## Citation ```bibtex @misc{m_diffae, title = {mDiffAE: A Fast Masked Diffusion Autoencoder}, author = {data-archetype}, year = {2026}, month = mar, url = {https://huggingface.co/data-archetype/mdiffae_v1}, } ``` ## Dependencies - PyTorch >= 2.0 - safetensors (for loading weights) ## License Apache 2.0