mdiffae-v2

mDiffAE v2 β€” Masked Diffusion AutoEncoder v2. A fast, single-GPU-trainable diffusion autoencoder with a 96-channel spatial bottleneck and optional PDG sharpening.

This is the recommended version β€” it offers substantially better reconstruction than v1 (+1.7 dB mean PSNR) while maintaining the same or better convergence for downstream latent diffusion models.

This variant (mdiffae-v2): 120.9M parameters, 461.2 MB. Bottleneck: 96 channels at patch size 16 (compression ratio 8x).

Documentation

Quick Start

import torch
from m_diffae_v2 import MDiffAEV2

# Load from HuggingFace Hub (or a local path)
model = MDiffAEV2.from_pretrained("data-archetype/mdiffae-v2", device="cuda")

# Encode
images = ...  # [B, 3, H, W] in [-1, 1], H and W divisible by 16
latents = model.encode(images)

# Decode (2 steps by default β€” PSNR-optimal)
recon = model.decode(latents, height=H, width=W)

# Reconstruct (encode + 2-step decode)
recon = model.reconstruct(images)

Note: Requires pip install huggingface_hub safetensors for Hub downloads. You can also pass a local directory path to from_pretrained().

Architecture

Property Value
Parameters 120,893,792
File size 461.2 MB
Patch size 16
Model dim 896
Encoder depth 4
Decoder depth 8 (2+4+2 skip-concat)
Bottleneck dim 96
Compression ratio 8x
MLP ratio 4.0
Depthwise kernel 7
AdaLN rank 128
PDG Conditioning degradation for CFG-style sharpening at inference
Training regularizer Token masking (25-75% ratio, 90% apply prob) + Path drop (10% drop prob)

Encoder: Deterministic. Patchify (PixelUnshuffle + 1x1 conv) followed by DiCo blocks with learned residual gates. No input RMSNorm. Post-bottleneck RMSNorm (affine=False) normalizes the latent tokens.

Decoder: VP diffusion conditioned on encoder latents and timestep via shared-base + per-layer low-rank AdaLN-Zero. Skip-concat topology (2 start + 4 middle + 2 end blocks) with skip connections from start to end blocks. No outer RMSNorms (input, latent conditioning, and output norms all removed).

Changes from v1

Aspect mDiffAE v1 mDiffAE v2
Bottleneck dim 64 (12x compression) 96 (8x compression)
Decoder topology 4 flat sequential blocks 8 blocks (2+4+2 skip-concat)
Token mask apply prob 50% 90%
Token mask ratio Fixed 75% Uniform(25%, 75%)
PDG training regularizer Token masking (50%) Token masking (90%) + path drop (10%)
Latent noise prob 10% 50%
Encoder input norm RMSNorm (affine) Removed
Decoder input norm RMSNorm (affine) Removed
Decoder latent norm RMSNorm (affine) Removed
Decoder output norm RMSNorm (affine) Removed

Recommended Settings

Mode Steps PDG Strength
Default (best PSNR) 2 off β€”
Sharp (perceptual) 10 on 2.0
from m_diffae_v2 import MDiffAEV2InferenceConfig

# Default β€” best PSNR, fast (2 steps, no PDG)
recon = model.decode(latents, height=H, width=W)

# Sharp mode β€” perceptual sharpening (10 steps + PDG)
cfg = MDiffAEV2InferenceConfig(num_steps=10, pdg=True, pdg_strength=2.0)
recon = model.decode(latents, height=H, width=W, inference_config=cfg)

Citation

@misc{mdiffae_v2,
  title   = {mDiffAE v2: A Fast Masked Diffusion Autoencoder},
  author  = {data-archetype},
  year    = {2026},
  month   = mar,
  url     = {https://huggingface.co/data-archetype/mdiffae-v2},
}

Dependencies

  • PyTorch >= 2.0
  • safetensors (for loading weights)

License

Apache 2.0

Downloads last month
11
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support