| | --- |
| | license: apache-2.0 |
| | tags: |
| | - diffusion |
| | - autoencoder |
| | - image-reconstruction |
| | - pytorch |
| | - masked-autoencoder |
| | library_name: mdiffae |
| | --- |
| | |
| | # mdiffae_v1 |
| | |
| | **mDiffAE** β **M**asked **Diff**usion **A**uto**E**ncoder. |
| | A fast, single-GPU-trainable diffusion autoencoder with a **64-channel** |
| | spatial bottleneck. Uses decoder token masking as an implicit regularizer |
| | instead of REPA alignment. |
| | |
| | This variant (mdiffae_v1): 81.4M parameters, 310.6 MB. |
| | Bottleneck: **64 channels** at patch size 16 |
| | (compression ratio 12x). |
| |
|
| | ## Documentation |
| |
|
| | - [Technical Report](technical_report_mdiffae.md) β architecture, masking strategy, and results |
| | - [iRDiffAE Technical Report](https://huggingface.co/data-archetype/irdiffae-v1/blob/main/technical_report.md) β full background on VP diffusion, DiCo blocks, patchify encoder, AdaLN |
| | - [Results β interactive viewer](https://huggingface.co/spaces/data-archetype/mdiffae-results) β full-resolution side-by-side comparison |
| |
|
| | ## Quick Start |
| |
|
| | ```python |
| | import torch |
| | from m_diffae import MDiffAE |
| | |
| | # Load from HuggingFace Hub (or a local path) |
| | model = MDiffAE.from_pretrained("data-archetype/mdiffae_v1", device="cuda") |
| | |
| | # Encode |
| | images = ... # [B, 3, H, W] in [-1, 1], H and W divisible by 16 |
| | latents = model.encode(images) |
| | |
| | # Decode (1 step by default β PSNR-optimal) |
| | recon = model.decode(latents, height=H, width=W) |
| | |
| | # Reconstruct (encode + 1-step decode) |
| | recon = model.reconstruct(images) |
| | ``` |
| |
|
| | > **Note:** Requires `pip install huggingface_hub safetensors` for Hub downloads. |
| | > You can also pass a local directory path to `from_pretrained()`. |
| |
|
| | ## Architecture |
| |
|
| | | Property | Value | |
| | |---|---| |
| | | Parameters | 81,410,624 | |
| | | File size | 310.6 MB | |
| | | Patch size | 16 | |
| | | Model dim | 896 | |
| | | Encoder depth | 4 | |
| | | Decoder depth | 4 | |
| | | Decoder topology | Flat sequential (no skip connections) | |
| | | Bottleneck dim | 64 | |
| | | MLP ratio | 4.0 | |
| | | Depthwise kernel | 7 | |
| | | AdaLN rank | 128 | |
| | | PDG mechanism | Token-level masking (ratio 0.75) | |
| | | Training regularizer | Decoder token masking (75% ratio, 50% apply prob) | |
| |
|
| | **Encoder**: Deterministic. Patchify (PixelUnshuffle + 1x1 conv) followed by |
| | DiCo blocks (depthwise conv + compact channel attention + GELU MLP) with |
| | learned residual gates. |
| |
|
| | **Decoder**: VP diffusion conditioned on encoder latents and timestep via |
| | shared-base + per-layer low-rank AdaLN-Zero. 4 flat |
| | sequential blocks (no skip connections). |
| |
|
| | **Compared to iRDiffAE**: iRDiffAE uses an 8-block decoder (2 start + 4 middle |
| | + 2 end) with skip connections and 128 bottleneck channels (needed partly because |
| | REPA occupies half the channels). mDiffAE uses 4 flat blocks |
| | with no skip connections and 64 bottleneck channels |
| | (12x compression vs |
| | iRDiffAE's 6x), which gives better channel utilisation. |
| |
|
| | ### Key Differences from iRDiffAE |
| |
|
| | | Aspect | iRDiffAE v1 | mDiffAE v1 | |
| | |---|---|---| |
| | | Bottleneck dim | 128 | **64** | |
| | | Decoder depth | 8 (2+4+2 skip-concat) | **4 (flat sequential)** | |
| | | PDG mechanism | Block dropping | **Token masking** | |
| | | Training regularizer | REPA + covariance reg | **Decoder token masking** | |
| |
|
| | ## Recommended Settings |
| |
|
| | Best quality is achieved with **1 DDIM step** and PDG disabled. |
| | PDG can sharpen images but should be kept very low (1.01β1.05). |
| |
|
| | | Setting | Default | |
| | |---|---| |
| | | Sampler | DDIM | |
| | | Steps | 1 | |
| | | PDG | Disabled | |
| | | PDG strength (if enabled) | 1.05 | |
| |
|
| | ```python |
| | from m_diffae import MDiffAEInferenceConfig |
| | |
| | # PSNR-optimal (fast, 1 step) |
| | cfg = MDiffAEInferenceConfig(num_steps=1, sampler="ddim") |
| | recon = model.decode(latents, height=H, width=W, inference_config=cfg) |
| | ``` |
| |
|
| | ## Citation |
| |
|
| | ```bibtex |
| | @misc{m_diffae, |
| | title = {mDiffAE: A Fast Masked Diffusion Autoencoder}, |
| | author = {data-archetype}, |
| | year = {2026}, |
| | month = mar, |
| | url = {https://huggingface.co/data-archetype/mdiffae_v1}, |
| | } |
| | ``` |
| |
|
| | ## Dependencies |
| |
|
| | - PyTorch >= 2.0 |
| | - safetensors (for loading weights) |
| |
|
| | ## License |
| |
|
| | Apache 2.0 |
| |
|