--- license: apache-2.0 tags: - diffusion - autoencoder - image-reconstruction - latent-space - dino - pytorch --- # data-archetype/dinac_ae **DINAC-AE** is a **DIN**O-**A**ligned **C**lass-token **A**uto**E**ncoder. It follows the [SemDisDiffAE](https://huggingface.co/data-archetype/semdisdiffae) family: patch-16 spatial latents, a VP diffusion decoder, and DINO-aligned representations. Relative to SemDisDiffAE, DINAC-AE changes the encoder from FCDM blocks to a 6-block ViT/DiT-style transformer encoder and uses DINOv3 ViT-B/16 alignment. The latent-to-DINO alignment head is extended to predict the DINO class token as well as patch tokens. `predict_class(latents)` exposes that class-token feature directly from latents. ## 2k PSNR Benchmark | Model | Mean PSNR (dB) | Std (dB) | Median (dB) | P5 (dB) | P95 (dB) | |---|---:|---:|---:|---:|---:| | dinac_ae | `35.19` | `4.53` | `35.06` | `28.02` | `42.43` | | FLUX.2 VAE | `36.28` | `4.53` | `36.07` | `28.89` | `43.63` | Evaluated on `2000` validation images. DINAC-AE targets a compromise between high reconstruction quality, a learnable latent space with KL-like variance expansion, DINOv3 alignment, and robustness to local token errors. [Results viewer](https://huggingface.co/spaces/data-archetype/dinac_ae-results) shows the 39-image reconstruction set with DINAC-AE and FLUX.2 VAE reconstructions, RGB differences, and latent PCA. The released export recheck on that 39-image set gives `35.15 dB` mean PSNR (`25.73` min, `45.99` max). [Full technical report](https://huggingface.co/data-archetype/dinac_ae/blob/main/technical_report_dinac_ae.md) ## Encode Throughput Measured on an `NVIDIA GeForce RTX 5090` in `bfloat16`, averaging repeated batches per resolution. | Resolution | Batch Size | dinac_ae encode (ms/batch) | FLUX.2 encode (ms/batch) | dinac_ae peak VRAM (MiB) | FLUX.2 peak VRAM (MiB) | Speedup vs FLUX.2 | Peak VRAM Reduction vs FLUX.2 | |---:|---:|---:|---:|---:|---:|---:|---:| | `256x256` | `128` | `50` | `383` | `1,637` | `12,511` | `7.62x` | `86.9%` | | `512x512` | `32` | `53` | `354` | `1,639` | `12,511` | `6.72x` | `86.9%` | The transformer encoder is slightly slower and larger than the full_capacitor FCDM encoder, but remains much faster and much smaller than the FLUX.2 VAE encoder. ## Latent Interface - `encode()` returns DINAC-AE's own whitened latent space. - `decode()` expects that same whitened latent space and dewhitens internally. - `predict_class()` expects the same whitened latent space, dewhitens internally, and predicts a DINOv3 ViT-B/16 class-token feature. - `whiten()` and `dewhiten()` are exposed for explicit control. - `encode_posterior()` returns the raw exported posterior before whitening. - `DinacAEInferenceConfig.num_steps` counts decoder evaluations directly: `num_steps=1` means one NFE. The export ships weights in `float32`. The recommended and default runtime path is `bfloat16` AMP for the main encoder, decoder, and class-token path, with `float32` retained for sensitive operations such as whitening/dewhitening, normalization math, RoPE frequency construction, and VP diffusion schedule helpers. ## Usage ```python import torch from dinac_ae import DinacAE, DinacAEInferenceConfig device = "cuda" model = DinacAE.from_pretrained( "data-archetype/dinac_ae", device=device, dtype=torch.bfloat16, ) image = ... # [1, 3, H, W] in [-1, 1], H and W divisible by 16 with torch.inference_mode(): latents = model.encode(image.to(device=device, dtype=torch.bfloat16)) class_token = model.predict_class(latents) recon = model.decode( latents, height=int(image.shape[-2]), width=int(image.shape[-1]), inference_config=DinacAEInferenceConfig(num_steps=1), ) ``` ## Details - DINAC-AE uses a `6`-block ViT/DiT-style transformer encoder and an `8`-block FCDM decoder. - Patch size is `16`, model width is `896`, and latent width is `128`. - The DINO alignment head predicts spatial patch tokens and is extended with a class-token output in DINOv3 ViT-B/16 feature space. - The class-token output is used to improve semantic organization of the latent space and to support FD-loss / Representation Frechet Distance objectives directly in latent space. - `predict_class(latents)` reaches mean cosine similarity `0.757458` against the frozen DINOv3 ViT-B/16 teacher class token on the same `2000` images. - DINO alignment is applied directly to clean latent tokens. Robustness to local token errors is handled by random-token logSNR offset regularization. - Results viewer: https://huggingface.co/spaces/data-archetype/dinac_ae-results - Related: [SemDisDiffAE](https://huggingface.co/data-archetype/semdisdiffae), [full_capacitor](https://huggingface.co/data-archetype/full_capacitor), [capacitor_decoder](https://huggingface.co/data-archetype/capacitor_decoder) ## Citation ```bibtex @misc{dinac_ae, title = {DINAC-AE: a DINO-aligned class-token diffusion autoencoder}, author = {data-archetype}, email = {data-archetype@proton.me}, year = {2026}, month = may, url = {https://huggingface.co/data-archetype/dinac_ae}, } ```