Fix model card: citation title, line break, add results viewer link

e816c03 verified 2 days ago

4.46 kB

	---
	license: apache-2.0
	tags:
	- diffusion
	- autoencoder
	- image-reconstruction
	- pytorch
	- masked-autoencoder
	library_name: mdiffae
	---

	# mdiffae-v2

	mDiffAE v2 — Masked Diffusion AutoEncoder v2.
	A fast, single-GPU-trainable diffusion autoencoder with a 96-channel
	spatial bottleneck and optional PDG sharpening.

	This is the recommended version — it offers substantially better
	reconstruction than [v1](https://huggingface.co/data-archetype/mdiffae-v1)
	(+1.7 dB mean PSNR) while maintaining the same or better convergence for
	downstream latent diffusion models.

	This variant (mdiffae-v2): 120.9M parameters, 461.2 MB.
	Bottleneck: 96 channels at patch size 16
	(compression ratio 8x).

	## Documentation

	- [Technical Report](technical_report_mdiffae_v2.md) — architecture, training changes from v1, and results
	- [Results — interactive viewer](https://huggingface.co/spaces/data-archetype/mdiffae-v2-results) — full-resolution side-by-side comparison
	- [mDiffAE v1](https://huggingface.co/data-archetype/mdiffae-v1) — previous version
	- [iRDiffAE Technical Report](https://huggingface.co/data-archetype/irdiffae-v1/blob/main/technical_report.md) — full background on VP diffusion, DiCo blocks, patchify encoder, AdaLN

	## Quick Start

	```python
	import torch
	from m_diffae_v2 import MDiffAEV2

	# Load from HuggingFace Hub (or a local path)
	model = MDiffAEV2.from_pretrained("data-archetype/mdiffae-v2", device="cuda")

	# Encode
	images = ... # [B, 3, H, W] in [-1, 1], H and W divisible by 16
	latents = model.encode(images)

	# Decode (2 steps by default — PSNR-optimal)
	recon = model.decode(latents, height=H, width=W)

	# Reconstruct (encode + 2-step decode)
	recon = model.reconstruct(images)
	```

	> Note: Requires `pip install huggingface_hub safetensors` for Hub downloads.
	> You can also pass a local directory path to `from_pretrained()`.

	## Architecture

	\| Property \| Value \|
	\|---\|---\|
	\| Parameters \| 120,893,792 \|
	\| File size \| 461.2 MB \|
	\| Patch size \| 16 \|
	\| Model dim \| 896 \|
	\| Encoder depth \| 4 \|
	\| Decoder depth \| 8 (2+4+2 skip-concat) \|
	\| Bottleneck dim \| 96 \|
	\| Compression ratio \| 8x \|
	\| MLP ratio \| 4.0 \|
	\| Depthwise kernel \| 7 \|
	\| AdaLN rank \| 128 \|
	\| PDG \| Conditioning degradation for CFG-style sharpening at inference \|
	\| Training regularizer \| Token masking (25-75% ratio, 90% apply prob) + Path drop (10% drop prob) \|

	Encoder: Deterministic. Patchify (PixelUnshuffle + 1x1 conv) followed by
	DiCo blocks with learned residual gates. No input RMSNorm. Post-bottleneck
	RMSNorm (affine=False) normalizes the latent tokens.

	Decoder: VP diffusion conditioned on encoder latents and timestep via
	shared-base + per-layer low-rank AdaLN-Zero. Skip-concat topology
	(2 start + 4 middle + 2 end blocks)
	with skip connections from start to end blocks. No outer RMSNorms
	(input, latent conditioning, and output norms all removed).

	### Changes from v1

	\| Aspect \| mDiffAE v1 \| mDiffAE v2 \|
	\|---\|---\|---\|
	\| Bottleneck dim \| 64 (12x compression) \| 96 (8x compression) \|
	\| Decoder topology \| 4 flat sequential blocks \| 8 blocks (2+4+2 skip-concat) \|
	\| Token mask apply prob \| 50% \| 90% \|
	\| Token mask ratio \| Fixed 75% \| Uniform(25%, 75%) \|
	\| PDG training regularizer \| Token masking (50%) \| Token masking (90%) + path drop (10%) \|
	\| Latent noise prob \| 10% \| 50% \|
	\| Encoder input norm \| RMSNorm (affine) \| Removed \|
	\| Decoder input norm \| RMSNorm (affine) \| Removed \|
	\| Decoder latent norm \| RMSNorm (affine) \| Removed \|
	\| Decoder output norm \| RMSNorm (affine) \| Removed \|

	## Recommended Settings

	\| Mode \| Steps \| PDG \| Strength \|
	\|---\|---\|---\|---\|
	\| Default (best PSNR) \| 2 \| off \| — \|
	\| Sharp (perceptual) \| 10 \| on \| 2.0 \|

	```python
	from m_diffae_v2 import MDiffAEV2InferenceConfig

	# Default — best PSNR, fast (2 steps, no PDG)
	recon = model.decode(latents, height=H, width=W)

	# Sharp mode — perceptual sharpening (10 steps + PDG)
	cfg = MDiffAEV2InferenceConfig(num_steps=10, pdg=True, pdg_strength=2.0)
	recon = model.decode(latents, height=H, width=W, inference_config=cfg)
	```

	## Citation

	```bibtex
	@misc{mdiffae_v2,
	title = {mDiffAE v2: A Fast Masked Diffusion Autoencoder},
	author = {data-archetype},
	year = {2026},
	month = mar,
	url = {https://huggingface.co/data-archetype/mdiffae-v2},
	}
	```

	## Dependencies

	- PyTorch >= 2.0
	- safetensors (for loading weights)

	## License

	Apache 2.0