File size: 4,458 Bytes

---
license: apache-2.0
tags:
  - diffusion
  - autoencoder
  - image-reconstruction
  - pytorch
  - masked-autoencoder
library_name: mdiffae
---

# mdiffae-v2

**mDiffAE v2** — **M**asked **Diff**usion **A**uto**E**ncoder v2.
A fast, single-GPU-trainable diffusion autoencoder with a **96-channel**
spatial bottleneck and optional PDG sharpening.

**This is the recommended version** — it offers substantially better
reconstruction than [v1](https://huggingface.co/data-archetype/mdiffae-v1)
(+1.7 dB mean PSNR) while maintaining the same or better convergence for
downstream latent diffusion models.

This variant (mdiffae-v2): 120.9M parameters, 461.2 MB.
Bottleneck: **96 channels** at patch size 16
(compression ratio 8x).

## Documentation

- [Technical Report](technical_report_mdiffae_v2.md) — architecture, training changes from v1, and results
- [Results — interactive viewer](https://huggingface.co/spaces/data-archetype/mdiffae-v2-results) — full-resolution side-by-side comparison
- [mDiffAE v1](https://huggingface.co/data-archetype/mdiffae-v1) — previous version
- [iRDiffAE Technical Report](https://huggingface.co/data-archetype/irdiffae-v1/blob/main/technical_report.md) — full background on VP diffusion, DiCo blocks, patchify encoder, AdaLN

## Quick Start

```python
import torch
from m_diffae_v2 import MDiffAEV2

# Load from HuggingFace Hub (or a local path)
model = MDiffAEV2.from_pretrained("data-archetype/mdiffae-v2", device="cuda")

# Encode
images = ...  # [B, 3, H, W] in [-1, 1], H and W divisible by 16
latents = model.encode(images)

# Decode (2 steps by default — PSNR-optimal)
recon = model.decode(latents, height=H, width=W)

# Reconstruct (encode + 2-step decode)
recon = model.reconstruct(images)
```

> **Note:** Requires `pip install huggingface_hub safetensors` for Hub downloads.
> You can also pass a local directory path to `from_pretrained()`.

## Architecture

| Property | Value |
|---|---|
| Parameters | 120,893,792 |
| File size | 461.2 MB |
| Patch size | 16 |
| Model dim | 896 |
| Encoder depth | 4 |
| Decoder depth | 8 (2+4+2 skip-concat) |
| Bottleneck dim | 96 |
| Compression ratio | 8x |
| MLP ratio | 4.0 |
| Depthwise kernel | 7 |
| AdaLN rank | 128 |
| PDG | Conditioning degradation for CFG-style sharpening at inference |
| Training regularizer | Token masking (25-75% ratio, 90% apply prob) + Path drop (10% drop prob) |

**Encoder**: Deterministic. Patchify (PixelUnshuffle + 1x1 conv) followed by
DiCo blocks with learned residual gates. No input RMSNorm. Post-bottleneck
RMSNorm (affine=False) normalizes the latent tokens.

**Decoder**: VP diffusion conditioned on encoder latents and timestep via
shared-base + per-layer low-rank AdaLN-Zero. Skip-concat topology
(2 start + 4 middle + 2 end blocks)
with skip connections from start to end blocks. No outer RMSNorms
(input, latent conditioning, and output norms all removed).

### Changes from v1

| Aspect | mDiffAE v1 | mDiffAE v2 |
|---|---|---|
| Bottleneck dim | 64 (12x compression) | **96** (8x compression) |
| Decoder topology | 4 flat sequential blocks | **8 blocks (2+4+2 skip-concat)** |
| Token mask apply prob | 50% | **90%** |
| Token mask ratio | Fixed 75% | **Uniform(25%, 75%)** |
| PDG training regularizer | Token masking (50%) | **Token masking (90%) + path drop (10%)** |
| Latent noise prob | 10% | **50%** |
| Encoder input norm | RMSNorm (affine) | **Removed** |
| Decoder input norm | RMSNorm (affine) | **Removed** |
| Decoder latent norm | RMSNorm (affine) | **Removed** |
| Decoder output norm | RMSNorm (affine) | **Removed** |

## Recommended Settings

| Mode | Steps | PDG | Strength |
|---|---|---|---|
| **Default** (best PSNR) | 2 | off | — |
| **Sharp** (perceptual) | 10 | on | 2.0 |

```python
from m_diffae_v2 import MDiffAEV2InferenceConfig

# Default — best PSNR, fast (2 steps, no PDG)
recon = model.decode(latents, height=H, width=W)

# Sharp mode — perceptual sharpening (10 steps + PDG)
cfg = MDiffAEV2InferenceConfig(num_steps=10, pdg=True, pdg_strength=2.0)
recon = model.decode(latents, height=H, width=W, inference_config=cfg)
```

## Citation

```bibtex
@misc{mdiffae_v2,
  title   = {mDiffAE v2: A Fast Masked Diffusion Autoencoder},
  author  = {data-archetype},
  year    = {2026},
  month   = mar,
  url     = {https://huggingface.co/data-archetype/mdiffae-v2},
}
```

## Dependencies

- PyTorch >= 2.0
- safetensors (for loading weights)

## License

Apache 2.0