dinac_ae / README.md

Upload DINAC-AE export package

1b703d5 3 days ago

5.19 kB

	---
	license: apache-2.0
	tags:
	- diffusion
	- autoencoder
	- image-reconstruction
	- latent-space
	- dino
	- pytorch
	---

	# data-archetype/dinac_ae

	DINAC-AE is a DINO-Aligned Class-token AutoEncoder.
	It follows the [SemDisDiffAE](https://huggingface.co/data-archetype/semdisdiffae)
	family: patch-16 spatial latents, a VP diffusion decoder, and DINO-aligned
	representations.

	Relative to SemDisDiffAE, DINAC-AE changes the encoder from FCDM blocks to a
	6-block ViT/DiT-style transformer encoder and uses DINOv3 ViT-B/16 alignment.
	The latent-to-DINO alignment head is extended to predict the DINO class token
	as well as patch tokens. `predict_class(latents)` exposes that class-token
	feature directly from latents.

	## 2k PSNR Benchmark

	\| Model \| Mean PSNR (dB) \| Std (dB) \| Median (dB) \| P5 (dB) \| P95 (dB) \|
	\|---\|---:\|---:\|---:\|---:\|---:\|
	\| dinac_ae \| `35.19` \| `4.53` \| `35.06` \| `28.02` \| `42.43` \|
	\| FLUX.2 VAE \| `36.28` \| `4.53` \| `36.07` \| `28.89` \| `43.63` \|

	Evaluated on `2000` validation images.

	DINAC-AE targets a compromise between high reconstruction quality, a learnable
	latent space with KL-like variance expansion, DINOv3 alignment, and robustness
	to local token errors.

	[Results viewer](https://huggingface.co/spaces/data-archetype/dinac_ae-results)
	shows the 39-image reconstruction set with DINAC-AE and FLUX.2 VAE
	reconstructions, RGB differences, and latent PCA.
	The released export recheck on that 39-image set gives `35.15 dB` mean PSNR
	(`25.73` min, `45.99` max).

	[Full technical report](https://huggingface.co/data-archetype/dinac_ae/blob/main/technical_report_dinac_ae.md)

	## Encode Throughput

	Measured on an `NVIDIA GeForce RTX 5090` in `bfloat16`, averaging repeated
	batches per resolution.

	\| Resolution \| Batch Size \| dinac_ae encode (ms/batch) \| FLUX.2 encode (ms/batch) \| dinac_ae peak VRAM (MiB) \| FLUX.2 peak VRAM (MiB) \| Speedup vs FLUX.2 \| Peak VRAM Reduction vs FLUX.2 \|
	\|---:\|---:\|---:\|---:\|---:\|---:\|---:\|---:\|
	\| `256x256` \| `128` \| `50` \| `383` \| `1,637` \| `12,511` \| `7.62x` \| `86.9%` \|
	\| `512x512` \| `32` \| `53` \| `354` \| `1,639` \| `12,511` \| `6.72x` \| `86.9%` \|

	The transformer encoder is slightly slower and larger than the full_capacitor
	FCDM encoder, but remains much faster and much smaller than the FLUX.2 VAE
	encoder.

	## Latent Interface

	- `encode()` returns DINAC-AE's own whitened latent space.
	- `decode()` expects that same whitened latent space and dewhitens internally.
	- `predict_class()` expects the same whitened latent space, dewhitens
	internally, and predicts a DINOv3 ViT-B/16 class-token feature.
	- `whiten()` and `dewhiten()` are exposed for explicit control.
	- `encode_posterior()` returns the raw exported posterior before whitening.
	- `DinacAEInferenceConfig.num_steps` counts decoder evaluations directly:
	`num_steps=1` means one NFE.

	The export ships weights in `float32`. The recommended and default runtime path
	is `bfloat16` AMP for the main encoder, decoder, and class-token path, with
	`float32` retained for sensitive operations such as whitening/dewhitening,
	normalization math, RoPE frequency construction, and VP diffusion schedule
	helpers.

	## Usage

	```python
	import torch

	from dinac_ae import DinacAE, DinacAEInferenceConfig


	device = "cuda"
	model = DinacAE.from_pretrained(
	"data-archetype/dinac_ae",
	device=device,
	dtype=torch.bfloat16,
	)

	image = ... # [1, 3, H, W] in [-1, 1], H and W divisible by 16

	with torch.inference_mode():
	latents = model.encode(image.to(device=device, dtype=torch.bfloat16))
	class_token = model.predict_class(latents)
	recon = model.decode(
	latents,
	height=int(image.shape[-2]),
	width=int(image.shape[-1]),
	inference_config=DinacAEInferenceConfig(num_steps=1),
	)
	```

	## Details

	- DINAC-AE uses a `6`-block ViT/DiT-style transformer encoder and an `8`-block
	FCDM decoder.
	- Patch size is `16`, model width is `896`, and latent width is `128`.
	- The DINO alignment head predicts spatial patch tokens and is extended with a
	class-token output in DINOv3 ViT-B/16 feature space.
	- The class-token output is used to improve semantic organization of the latent
	space and to support FD-loss / Representation Frechet Distance objectives
	directly in latent space.
	- `predict_class(latents)` reaches mean cosine similarity `0.757458` against
	the frozen DINOv3 ViT-B/16 teacher class token on the same `2000` images.
	- DINO alignment is applied directly to clean latent tokens. Robustness to
	local token errors is handled by random-token logSNR offset regularization.
	- Results viewer: https://huggingface.co/spaces/data-archetype/dinac_ae-results
	- Related: [SemDisDiffAE](https://huggingface.co/data-archetype/semdisdiffae),
	[full_capacitor](https://huggingface.co/data-archetype/full_capacitor),
	[capacitor_decoder](https://huggingface.co/data-archetype/capacitor_decoder)

	## Citation

	```bibtex
	@misc{dinac_ae,
	title = {DINAC-AE: a DINO-aligned class-token diffusion autoencoder},
	author = {data-archetype},
	email = {data-archetype@proton.me},
	year = {2026},
	month = may,
	url = {https://huggingface.co/data-archetype/dinac_ae},
	}
	```