File size: 3,900 Bytes

---
license: apache-2.0
tags:
  - diffusion
  - autoencoder
  - image-reconstruction
  - pytorch
  - masked-autoencoder
library_name: mdiffae
---

# mdiffae_v1

**mDiffAE** — **M**asked **Diff**usion **A**uto**E**ncoder.
A fast, single-GPU-trainable diffusion autoencoder with a **64-channel**
spatial bottleneck. Uses decoder token masking as an implicit regularizer
instead of REPA alignment.

This variant (mdiffae_v1): 81.4M parameters, 310.6 MB.
Bottleneck: **64 channels** at patch size 16
(compression ratio 12x).

## Documentation

- [Technical Report](technical_report_mdiffae.md) — architecture, masking strategy, and results
- [iRDiffAE Technical Report](https://huggingface.co/data-archetype/irdiffae-v1/blob/main/technical_report.md) — full background on VP diffusion, DiCo blocks, patchify encoder, AdaLN
- [Results — interactive viewer](https://huggingface.co/spaces/data-archetype/mdiffae-results) — full-resolution side-by-side comparison

## Quick Start

```python
import torch
from m_diffae import MDiffAE

# Load from HuggingFace Hub (or a local path)
model = MDiffAE.from_pretrained("data-archetype/mdiffae_v1", device="cuda")

# Encode
images = ...  # [B, 3, H, W] in [-1, 1], H and W divisible by 16
latents = model.encode(images)

# Decode (1 step by default — PSNR-optimal)
recon = model.decode(latents, height=H, width=W)

# Reconstruct (encode + 1-step decode)
recon = model.reconstruct(images)
```

> **Note:** Requires `pip install huggingface_hub safetensors` for Hub downloads.
> You can also pass a local directory path to `from_pretrained()`.

## Architecture

| Property | Value |
|---|---|
| Parameters | 81,410,624 |
| File size | 310.6 MB |
| Patch size | 16 |
| Model dim | 896 |
| Encoder depth | 4 |
| Decoder depth | 4 |
| Decoder topology | Flat sequential (no skip connections) |
| Bottleneck dim | 64 |
| MLP ratio | 4.0 |
| Depthwise kernel | 7 |
| AdaLN rank | 128 |
| PDG mechanism | Token-level masking (ratio 0.75) |
| Training regularizer | Decoder token masking (75% ratio, 50% apply prob) |

**Encoder**: Deterministic. Patchify (PixelUnshuffle + 1x1 conv) followed by
DiCo blocks (depthwise conv + compact channel attention + GELU MLP) with
learned residual gates.

**Decoder**: VP diffusion conditioned on encoder latents and timestep via
shared-base + per-layer low-rank AdaLN-Zero. 4 flat
sequential blocks (no skip connections).

**Compared to iRDiffAE**: iRDiffAE uses an 8-block decoder (2 start + 4 middle
+ 2 end) with skip connections and 128 bottleneck channels (needed partly because
REPA occupies half the channels). mDiffAE uses 4 flat blocks
with no skip connections and 64 bottleneck channels
(12x compression vs
iRDiffAE's 6x), which gives better channel utilisation.

### Key Differences from iRDiffAE

| Aspect | iRDiffAE v1 | mDiffAE v1 |
|---|---|---|
| Bottleneck dim | 128 | **64** |
| Decoder depth | 8 (2+4+2 skip-concat) | **4 (flat sequential)** |
| PDG mechanism | Block dropping | **Token masking** |
| Training regularizer | REPA + covariance reg | **Decoder token masking** |

## Recommended Settings

Best quality is achieved with **1 DDIM step** and PDG disabled.
PDG can sharpen images but should be kept very low (1.01–1.05).

| Setting | Default |
|---|---|
| Sampler | DDIM |
| Steps | 1 |
| PDG | Disabled |
| PDG strength (if enabled) | 1.05 |

```python
from m_diffae import MDiffAEInferenceConfig

# PSNR-optimal (fast, 1 step)
cfg = MDiffAEInferenceConfig(num_steps=1, sampler="ddim")
recon = model.decode(latents, height=H, width=W, inference_config=cfg)
```

## Citation

```bibtex
@misc{m_diffae,
  title   = {mDiffAE: A Fast Masked Diffusion Autoencoder},
  author  = {data-archetype},
  year    = {2026},
  month   = mar,
  url     = {https://huggingface.co/data-archetype/mdiffae_v1},
}
```

## Dependencies

- PyTorch >= 2.0
- safetensors (for loading weights)

## License

Apache 2.0