File size: 4,458 Bytes
58b87c6 e816c03 58b87c6 e816c03 58b87c6 e816c03 58b87c6 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 | ---
license: apache-2.0
tags:
- diffusion
- autoencoder
- image-reconstruction
- pytorch
- masked-autoencoder
library_name: mdiffae
---
# mdiffae-v2
**mDiffAE v2** β **M**asked **Diff**usion **A**uto**E**ncoder v2.
A fast, single-GPU-trainable diffusion autoencoder with a **96-channel**
spatial bottleneck and optional PDG sharpening.
**This is the recommended version** β it offers substantially better
reconstruction than [v1](https://huggingface.co/data-archetype/mdiffae-v1)
(+1.7 dB mean PSNR) while maintaining the same or better convergence for
downstream latent diffusion models.
This variant (mdiffae-v2): 120.9M parameters, 461.2 MB.
Bottleneck: **96 channels** at patch size 16
(compression ratio 8x).
## Documentation
- [Technical Report](technical_report_mdiffae_v2.md) β architecture, training changes from v1, and results
- [Results β interactive viewer](https://huggingface.co/spaces/data-archetype/mdiffae-v2-results) β full-resolution side-by-side comparison
- [mDiffAE v1](https://huggingface.co/data-archetype/mdiffae-v1) β previous version
- [iRDiffAE Technical Report](https://huggingface.co/data-archetype/irdiffae-v1/blob/main/technical_report.md) β full background on VP diffusion, DiCo blocks, patchify encoder, AdaLN
## Quick Start
```python
import torch
from m_diffae_v2 import MDiffAEV2
# Load from HuggingFace Hub (or a local path)
model = MDiffAEV2.from_pretrained("data-archetype/mdiffae-v2", device="cuda")
# Encode
images = ... # [B, 3, H, W] in [-1, 1], H and W divisible by 16
latents = model.encode(images)
# Decode (2 steps by default β PSNR-optimal)
recon = model.decode(latents, height=H, width=W)
# Reconstruct (encode + 2-step decode)
recon = model.reconstruct(images)
```
> **Note:** Requires `pip install huggingface_hub safetensors` for Hub downloads.
> You can also pass a local directory path to `from_pretrained()`.
## Architecture
| Property | Value |
|---|---|
| Parameters | 120,893,792 |
| File size | 461.2 MB |
| Patch size | 16 |
| Model dim | 896 |
| Encoder depth | 4 |
| Decoder depth | 8 (2+4+2 skip-concat) |
| Bottleneck dim | 96 |
| Compression ratio | 8x |
| MLP ratio | 4.0 |
| Depthwise kernel | 7 |
| AdaLN rank | 128 |
| PDG | Conditioning degradation for CFG-style sharpening at inference |
| Training regularizer | Token masking (25-75% ratio, 90% apply prob) + Path drop (10% drop prob) |
**Encoder**: Deterministic. Patchify (PixelUnshuffle + 1x1 conv) followed by
DiCo blocks with learned residual gates. No input RMSNorm. Post-bottleneck
RMSNorm (affine=False) normalizes the latent tokens.
**Decoder**: VP diffusion conditioned on encoder latents and timestep via
shared-base + per-layer low-rank AdaLN-Zero. Skip-concat topology
(2 start + 4 middle + 2 end blocks)
with skip connections from start to end blocks. No outer RMSNorms
(input, latent conditioning, and output norms all removed).
### Changes from v1
| Aspect | mDiffAE v1 | mDiffAE v2 |
|---|---|---|
| Bottleneck dim | 64 (12x compression) | **96** (8x compression) |
| Decoder topology | 4 flat sequential blocks | **8 blocks (2+4+2 skip-concat)** |
| Token mask apply prob | 50% | **90%** |
| Token mask ratio | Fixed 75% | **Uniform(25%, 75%)** |
| PDG training regularizer | Token masking (50%) | **Token masking (90%) + path drop (10%)** |
| Latent noise prob | 10% | **50%** |
| Encoder input norm | RMSNorm (affine) | **Removed** |
| Decoder input norm | RMSNorm (affine) | **Removed** |
| Decoder latent norm | RMSNorm (affine) | **Removed** |
| Decoder output norm | RMSNorm (affine) | **Removed** |
## Recommended Settings
| Mode | Steps | PDG | Strength |
|---|---|---|---|
| **Default** (best PSNR) | 2 | off | β |
| **Sharp** (perceptual) | 10 | on | 2.0 |
```python
from m_diffae_v2 import MDiffAEV2InferenceConfig
# Default β best PSNR, fast (2 steps, no PDG)
recon = model.decode(latents, height=H, width=W)
# Sharp mode β perceptual sharpening (10 steps + PDG)
cfg = MDiffAEV2InferenceConfig(num_steps=10, pdg=True, pdg_strength=2.0)
recon = model.decode(latents, height=H, width=W, inference_config=cfg)
```
## Citation
```bibtex
@misc{mdiffae_v2,
title = {mDiffAE v2: A Fast Masked Diffusion Autoencoder},
author = {data-archetype},
year = {2026},
month = mar,
url = {https://huggingface.co/data-archetype/mdiffae-v2},
}
```
## Dependencies
- PyTorch >= 2.0
- safetensors (for loading weights)
## License
Apache 2.0
|