--- license: apache-2.0 tags: - diffusion - autoencoder - image-reconstruction - latent-space - dino - pytorch --- # data-archetype/dinac_ae_d2 **DINAC-AE-D2** is a close variant of [DINAC-AE](https://huggingface.co/data-archetype/dinac_ae). It keeps the same patch-16 spatial latent interface, VP diffusion decoder, class-token prediction API, and one-step default reconstruction path, but changes the teacher alignment and encoder capacity: - DINO alignment target: **DINOv2 ViT-B/14** feature space. - Encoder: **8** ViT/DiT-style transformer blocks instead of DINAC-AE's 6. - Decoder: unchanged 8-block FCDM decoder. DINOv2-B is empirically less spatially smooth than DINOv3-B and preserves more high-frequency information. In downstream diffusion experiments, this variant has shown faster early convergence than the original DINAC-AE latent space. ## 2k PSNR Benchmark | Model | Mean PSNR (dB) | Std (dB) | Median (dB) | P5 (dB) | P95 (dB) | |---|---:|---:|---:|---:|---:| | dinac_ae_d2 | `35.59` | `4.87` | `35.40` | `27.89` | `43.51` | | dinac_ae | `35.19` | `4.53` | `35.06` | `28.02` | `42.43` | | FLUX.2 VAE | `36.28` | `4.53` | `36.07` | `28.89` | `43.63` | [Results viewer](https://huggingface.co/spaces/data-archetype/dinac_ae_d2-results) shows the 39-image reconstruction set with DINAC-AE-D2 and FLUX.2 VAE reconstructions, RGB differences, and latent PCA. The 39-image set gives `35.46 dB` mean PSNR (`25.61` min, `46.69` max). [DINAC-AE technical report](https://huggingface.co/data-archetype/dinac_ae/blob/main/technical_report_dinac_ae.md) describes the training recipe used for this model. DINAC-AE-D2 follows the same autoencoder training setup, with the teacher alignment changed to DINOv2 ViT-B/14 and the encoder depth increased from 6 to 8 blocks. ## Encode Throughput Measured on an `NVIDIA GeForce RTX 5090` in `bfloat16`, averaging repeated batches per resolution. | Resolution | Batch Size | Model | Encode (ms/batch) | ms/image | Images/s | Peak VRAM (MiB) | Speedup vs FLUX.2 | Peak VRAM Reduction vs FLUX.2 | |---:|---:|---|---:|---:|---:|---:|---:|---:| | `256x256` | `128` | dinac_ae_d2 | `69.56` | `0.543` | `1840.0` | `1606.5` | `4.92x` | `87.2%` | | `256x256` | `128` | dinac_ae | `50.25` | `0.393` | `2547.4` | `1569.7` | `6.80x` | `87.5%` | | `256x256` | `128` | FLUX.2 VAE | `341.94` | `2.671` | `374.3` | `12533.8` | `1.00x` | `0.0%` | | `512x512` | `32` | dinac_ae_d2 | `75.09` | `2.347` | `426.2` | `1606.7` | `4.74x` | `87.2%` | | `512x512` | `32` | dinac_ae | `53.09` | `1.659` | `602.7` | `1570.0` | `6.70x` | `87.5%` | | `512x512` | `32` | FLUX.2 VAE | `355.64` | `11.114` | `90.0` | `12533.8` | `1.00x` | `0.0%` | The encoder is slower than DINAC-AE's encoder because it uses 8 transformer blocks instead of 6, but remains much faster and lighter than the FLUX.2 VAE encoder. ## Latent Interface - `encode()` returns DINAC-AE-D2's own whitened latent space. - `decode()` expects that same whitened latent space and dewhitens internally. - `predict_class()` expects the same whitened latent space, dewhitens internally, and predicts a DINOv2-B class-token feature. - `whiten()` and `dewhiten()` are exposed for explicit control. - `encode_posterior()` returns the raw exported posterior before whitening. - `DinacAEInferenceConfig.num_steps` counts decoder evaluations directly: `num_steps=1` means one NFE. The export ships weights in `float32`. The recommended and default runtime path is `bfloat16` for the main encoder, decoder, and class-token path, with `float32` retained for whitening/dewhitening, normalization math, RoPE frequency construction, and VP diffusion schedule helpers. ## Usage ```python import torch from dinac_ae import DinacAE, DinacAEInferenceConfig device = "cuda" model = DinacAE.from_pretrained( "data-archetype/dinac_ae_d2", device=device, dtype=torch.bfloat16, ) image = ... # [1, 3, H, W] in [-1, 1], H and W divisible by 16 with torch.inference_mode(): latents = model.encode(image.to(device=device, dtype=torch.bfloat16)) class_token = model.predict_class(latents) recon = model.decode( latents, height=int(image.shape[-2]), width=int(image.shape[-1]), inference_config=DinacAEInferenceConfig(num_steps=1), ) ``` ## Details - DINAC-AE-D2 uses an `8`-block ViT/DiT-style transformer encoder and an `8`-block FCDM decoder. - Patch size is `16`, model width is `896`, and latent width is `128`. - Total parameter count is `154.22M`: `78.02M` encoder, `61.93M` decoder, and `14.26M` DINO token/class alignment head. - The DINO alignment head predicts spatial patch tokens and a class-token output in DINOv2 ViT-B/14 feature space. - `predict_class(latents)` exposes the DINOv2 ViT-B/14 class-token feature directly from latents. - Results viewer: https://huggingface.co/spaces/data-archetype/dinac_ae_d2-results - Related: [DINAC-AE](https://huggingface.co/data-archetype/dinac_ae), [SemDisDiffAE](https://huggingface.co/data-archetype/semdisdiffae), [full_capacitor](https://huggingface.co/data-archetype/full_capacitor), [capacitor_decoder](https://huggingface.co/data-archetype/capacitor_decoder) ## Citation ```bibtex @misc{dinac_ae_d2, title = {DINAC-AE-D2: a DINOv2-aligned class-token diffusion autoencoder}, author = {data-archetype}, email = {data-archetype@proton.me}, year = {2026}, month = jun, url = {https://huggingface.co/data-archetype/dinac_ae_d2}, } ```