mdiffae-v1 / README.md
data-archetype's picture
Upload folder using huggingface_hub
42d9d8c verified
---
license: apache-2.0
tags:
- diffusion
- autoencoder
- image-reconstruction
- pytorch
- masked-autoencoder
library_name: mdiffae
---
# mdiffae_v1
**mDiffAE** β€” **M**asked **Diff**usion **A**uto**E**ncoder.
A fast, single-GPU-trainable diffusion autoencoder with a **64-channel**
spatial bottleneck. Uses decoder token masking as an implicit regularizer
instead of REPA alignment.
This variant (mdiffae_v1): 81.4M parameters, 310.6 MB.
Bottleneck: **64 channels** at patch size 16
(compression ratio 12x).
## Documentation
- [Technical Report](technical_report_mdiffae.md) β€” architecture, masking strategy, and results
- [iRDiffAE Technical Report](https://huggingface.co/data-archetype/irdiffae-v1/blob/main/technical_report.md) β€” full background on VP diffusion, DiCo blocks, patchify encoder, AdaLN
- [Results β€” interactive viewer](https://huggingface.co/spaces/data-archetype/mdiffae-results) β€” full-resolution side-by-side comparison
## Quick Start
```python
import torch
from m_diffae import MDiffAE
# Load from HuggingFace Hub (or a local path)
model = MDiffAE.from_pretrained("data-archetype/mdiffae_v1", device="cuda")
# Encode
images = ... # [B, 3, H, W] in [-1, 1], H and W divisible by 16
latents = model.encode(images)
# Decode (1 step by default β€” PSNR-optimal)
recon = model.decode(latents, height=H, width=W)
# Reconstruct (encode + 1-step decode)
recon = model.reconstruct(images)
```
> **Note:** Requires `pip install huggingface_hub safetensors` for Hub downloads.
> You can also pass a local directory path to `from_pretrained()`.
## Architecture
| Property | Value |
|---|---|
| Parameters | 81,410,624 |
| File size | 310.6 MB |
| Patch size | 16 |
| Model dim | 896 |
| Encoder depth | 4 |
| Decoder depth | 4 |
| Decoder topology | Flat sequential (no skip connections) |
| Bottleneck dim | 64 |
| MLP ratio | 4.0 |
| Depthwise kernel | 7 |
| AdaLN rank | 128 |
| PDG mechanism | Token-level masking (ratio 0.75) |
| Training regularizer | Decoder token masking (75% ratio, 50% apply prob) |
**Encoder**: Deterministic. Patchify (PixelUnshuffle + 1x1 conv) followed by
DiCo blocks (depthwise conv + compact channel attention + GELU MLP) with
learned residual gates.
**Decoder**: VP diffusion conditioned on encoder latents and timestep via
shared-base + per-layer low-rank AdaLN-Zero. 4 flat
sequential blocks (no skip connections).
**Compared to iRDiffAE**: iRDiffAE uses an 8-block decoder (2 start + 4 middle
+ 2 end) with skip connections and 128 bottleneck channels (needed partly because
REPA occupies half the channels). mDiffAE uses 4 flat blocks
with no skip connections and 64 bottleneck channels
(12x compression vs
iRDiffAE's 6x), which gives better channel utilisation.
### Key Differences from iRDiffAE
| Aspect | iRDiffAE v1 | mDiffAE v1 |
|---|---|---|
| Bottleneck dim | 128 | **64** |
| Decoder depth | 8 (2+4+2 skip-concat) | **4 (flat sequential)** |
| PDG mechanism | Block dropping | **Token masking** |
| Training regularizer | REPA + covariance reg | **Decoder token masking** |
## Recommended Settings
Best quality is achieved with **1 DDIM step** and PDG disabled.
PDG can sharpen images but should be kept very low (1.01–1.05).
| Setting | Default |
|---|---|
| Sampler | DDIM |
| Steps | 1 |
| PDG | Disabled |
| PDG strength (if enabled) | 1.05 |
```python
from m_diffae import MDiffAEInferenceConfig
# PSNR-optimal (fast, 1 step)
cfg = MDiffAEInferenceConfig(num_steps=1, sampler="ddim")
recon = model.decode(latents, height=H, width=W, inference_config=cfg)
```
## Citation
```bibtex
@misc{m_diffae,
title = {mDiffAE: A Fast Masked Diffusion Autoencoder},
author = {data-archetype},
year = {2026},
month = mar,
url = {https://huggingface.co/data-archetype/mdiffae_v1},
}
```
## Dependencies
- PyTorch >= 2.0
- safetensors (for loading weights)
## License
Apache 2.0