File size: 3,900 Bytes
128cb34 42d9d8c 128cb34 9b877c3 128cb34 9b877c3 128cb34 9b877c3 128cb34 9b877c3 128cb34 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 | ---
license: apache-2.0
tags:
- diffusion
- autoencoder
- image-reconstruction
- pytorch
- masked-autoencoder
library_name: mdiffae
---
# mdiffae_v1
**mDiffAE** — **M**asked **Diff**usion **A**uto**E**ncoder.
A fast, single-GPU-trainable diffusion autoencoder with a **64-channel**
spatial bottleneck. Uses decoder token masking as an implicit regularizer
instead of REPA alignment.
This variant (mdiffae_v1): 81.4M parameters, 310.6 MB.
Bottleneck: **64 channels** at patch size 16
(compression ratio 12x).
## Documentation
- [Technical Report](technical_report_mdiffae.md) — architecture, masking strategy, and results
- [iRDiffAE Technical Report](https://huggingface.co/data-archetype/irdiffae-v1/blob/main/technical_report.md) — full background on VP diffusion, DiCo blocks, patchify encoder, AdaLN
- [Results — interactive viewer](https://huggingface.co/spaces/data-archetype/mdiffae-results) — full-resolution side-by-side comparison
## Quick Start
```python
import torch
from m_diffae import MDiffAE
# Load from HuggingFace Hub (or a local path)
model = MDiffAE.from_pretrained("data-archetype/mdiffae_v1", device="cuda")
# Encode
images = ... # [B, 3, H, W] in [-1, 1], H and W divisible by 16
latents = model.encode(images)
# Decode (1 step by default — PSNR-optimal)
recon = model.decode(latents, height=H, width=W)
# Reconstruct (encode + 1-step decode)
recon = model.reconstruct(images)
```
> **Note:** Requires `pip install huggingface_hub safetensors` for Hub downloads.
> You can also pass a local directory path to `from_pretrained()`.
## Architecture
| Property | Value |
|---|---|
| Parameters | 81,410,624 |
| File size | 310.6 MB |
| Patch size | 16 |
| Model dim | 896 |
| Encoder depth | 4 |
| Decoder depth | 4 |
| Decoder topology | Flat sequential (no skip connections) |
| Bottleneck dim | 64 |
| MLP ratio | 4.0 |
| Depthwise kernel | 7 |
| AdaLN rank | 128 |
| PDG mechanism | Token-level masking (ratio 0.75) |
| Training regularizer | Decoder token masking (75% ratio, 50% apply prob) |
**Encoder**: Deterministic. Patchify (PixelUnshuffle + 1x1 conv) followed by
DiCo blocks (depthwise conv + compact channel attention + GELU MLP) with
learned residual gates.
**Decoder**: VP diffusion conditioned on encoder latents and timestep via
shared-base + per-layer low-rank AdaLN-Zero. 4 flat
sequential blocks (no skip connections).
**Compared to iRDiffAE**: iRDiffAE uses an 8-block decoder (2 start + 4 middle
+ 2 end) with skip connections and 128 bottleneck channels (needed partly because
REPA occupies half the channels). mDiffAE uses 4 flat blocks
with no skip connections and 64 bottleneck channels
(12x compression vs
iRDiffAE's 6x), which gives better channel utilisation.
### Key Differences from iRDiffAE
| Aspect | iRDiffAE v1 | mDiffAE v1 |
|---|---|---|
| Bottleneck dim | 128 | **64** |
| Decoder depth | 8 (2+4+2 skip-concat) | **4 (flat sequential)** |
| PDG mechanism | Block dropping | **Token masking** |
| Training regularizer | REPA + covariance reg | **Decoder token masking** |
## Recommended Settings
Best quality is achieved with **1 DDIM step** and PDG disabled.
PDG can sharpen images but should be kept very low (1.01–1.05).
| Setting | Default |
|---|---|
| Sampler | DDIM |
| Steps | 1 |
| PDG | Disabled |
| PDG strength (if enabled) | 1.05 |
```python
from m_diffae import MDiffAEInferenceConfig
# PSNR-optimal (fast, 1 step)
cfg = MDiffAEInferenceConfig(num_steps=1, sampler="ddim")
recon = model.decode(latents, height=H, width=W, inference_config=cfg)
```
## Citation
```bibtex
@misc{m_diffae,
title = {mDiffAE: A Fast Masked Diffusion Autoencoder},
author = {data-archetype},
year = {2026},
month = mar,
url = {https://huggingface.co/data-archetype/mdiffae_v1},
}
```
## Dependencies
- PyTorch >= 2.0
- safetensors (for loading weights)
## License
Apache 2.0
|