mdiffae-v2

DEPRECATED — This model is superseded by SemDisDiffAE, which offers better reconstruction quality, better downstream diffusion convergence, and slightly faster inference.

mDiffAE v2 — Masked Diffusion AutoEncoder v2. A fast, single-GPU-trainable diffusion autoencoder with a 96-channel spatial bottleneck and optional PDG sharpening.

This is the recommended version — it offers substantially better reconstruction than v1 (+1.7 dB mean PSNR) while maintaining the same or better convergence for downstream latent diffusion models.

This variant (mdiffae-v2): 120.9M parameters, 461.2 MB. Bottleneck: 96 channels at patch size 16 (compression ratio 8x).

Documentation

Technical Report — architecture, training changes from v1, and results
Results — interactive viewer — full-resolution side-by-side comparison
mDiffAE v1 — previous version
iRDiffAE Technical Report — full background on VP diffusion, DiCo blocks, patchify encoder, AdaLN

Quick Start

import torch
from m_diffae_v2 import MDiffAEV2

# Load from HuggingFace Hub (or a local path)
model = MDiffAEV2.from_pretrained("data-archetype/mdiffae-v2", device="cuda")

# Encode
images = ...  # [B, 3, H, W] in [-1, 1], H and W divisible by 16
latents = model.encode(images)

# Decode (2 steps by default — PSNR-optimal)
recon = model.decode(latents, height=H, width=W)

# Reconstruct (encode + 2-step decode)
recon = model.reconstruct(images)

Note: Requires pip install huggingface_hub safetensors for Hub downloads. You can also pass a local directory path to from_pretrained().

Architecture

Property	Value
Parameters	120,893,792
File size	461.2 MB
Patch size	16
Model dim	896
Encoder depth	4
Decoder depth	8 (2+4+2 skip-concat)
Bottleneck dim	96
Compression ratio	8x
MLP ratio	4.0
Depthwise kernel	7
AdaLN rank	128
PDG	Conditioning degradation for CFG-style sharpening at inference
Training regularizer	Token masking (25-75% ratio, 90% apply prob) + Path drop (10% drop prob)

Encoder: Deterministic. Patchify (PixelUnshuffle + 1x1 conv) followed by DiCo blocks with learned residual gates. No input RMSNorm. Post-bottleneck RMSNorm (affine=False) normalizes the latent tokens.

Decoder: VP diffusion conditioned on encoder latents and timestep via shared-base + per-layer low-rank AdaLN-Zero. Skip-concat topology (2 start + 4 middle + 2 end blocks) with skip connections from start to end blocks. No outer RMSNorms (input, latent conditioning, and output norms all removed).

Changes from v1

Aspect	mDiffAE v1	mDiffAE v2
Bottleneck dim	64 (12x compression)	96 (8x compression)
Decoder topology	4 flat sequential blocks	8 blocks (2+4+2 skip-concat)
Token mask apply prob	50%	90%
Token mask ratio	Fixed 75%	Uniform(25%, 75%)
PDG training regularizer	Token masking (50%)	Token masking (90%) + path drop (10%)
Latent noise prob	10%	50%
Encoder input norm	RMSNorm (affine)	Removed
Decoder input norm	RMSNorm (affine)	Removed
Decoder latent norm	RMSNorm (affine)	Removed
Decoder output norm	RMSNorm (affine)	Removed

Recommended Settings

Mode	Steps	PDG	Strength
Default (best PSNR)	2	off	—
Sharp (perceptual)	10	on	2.0

from m_diffae_v2 import MDiffAEV2InferenceConfig

# Default — best PSNR, fast (2 steps, no PDG)
recon = model.decode(latents, height=H, width=W)

# Sharp mode — perceptual sharpening (10 steps + PDG)
cfg = MDiffAEV2InferenceConfig(num_steps=10, pdg=True, pdg_strength=2.0)
recon = model.decode(latents, height=H, width=W, inference_config=cfg)

Citation

@misc{mdiffae_v2,
  title   = {mDiffAE v2: A Fast Masked Diffusion Autoencoder},
  author  = {data-archetype},
  year    = {2026},
  month   = mar,
  url     = {https://huggingface.co/data-archetype/mdiffae-v2},
}

Dependencies

PyTorch >= 2.0
safetensors (for loading weights)

License

Apache 2.0

Downloads last month: 6

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support