data-archetype
/

semdisdiffae_p32_v2

@@ -23,6 +23,9 @@ background, see the original
 | Model dim | `896` | `1024` |
 | Encoder blocks | `4` | `8` |
 | Decoder blocks | `8` | `8` |
 | Parameters | `88.8M` | `156.6M` |
 | Semantic teacher | DINOv3 ViT-S/16 LVD1689M, `vit_small_patch16_dinov3.lvd_1689m` | DINOv3 ConvNeXt-B/LVD1689M, `convnext_base.dinov3_lvd1689m` |
 | Semantic loss | negative cosine | 50/50 MSE plus negative cosine |
@@ -33,20 +36,6 @@ encoder depth. The result is a lower-resolution latent grid intended to be
 easier and cheaper for downstream diffusion models, while still preserving the
 SemDisDiffAE-style VP diffusion decoder and stochastic posterior.
-## Architecture
-| Component | Value |
-|---|---:|
-| Parameters | `156.6M` |
-| Encoder blocks | `8` |
-| Decoder blocks | `8` |
-| Patch size | `32` |
-| Model dim | `1024` |
-| Bottleneck dim | `384` |
-| Spatial compression | `32x` |
-| Posterior | `diagonal_gaussian` |
-| Bottleneck norm | `disabled` |
 The decoder uses the start / middle / end skip-concat layout with `2` start
 blocks and `2` end blocks. The encoder and decoder both operate natively at
 patch size `32`; this is not a patch-16 model with an additional latent
@@ -177,8 +166,8 @@ repeated batched `encode()` calls after `5` warmup batches.
 | Resolution | Batch Size | Mean (ms/batch) | Median (ms/batch) | P95 (ms/batch) | ms/image | Images/s | Peak Allocated VRAM |
 |---:|---:|---:|---:|---:|---:|---:|---:|
-| `256x256` | `128` | `12.54` | `12.52` | `12.86` | `0.098` | `10206.3` | `567.8 MiB` |
-| `512x512` | `32` | `12.09` | `12.12` | `12.33` | `0.378` | `2647.2` | `563.8 MiB` |
 ## Decode Latency
@@ -191,9 +180,9 @@ time.
 | Resolution | Batch Size | Images | Mean (ms/image) | Median (ms/image) | P95 (ms/image) | Images/s | Peak Allocated VRAM |
 |---:|---:|---:|---:|---:|---:|---:|---:|
-| `512x512` | `1` | `20` | `5.11` | `5.10` | `5.27` | `195.6` | `340.8 MiB` |
-| `1024x1024` | `1` | `20` | `10.14` | `10.16` | `10.22` | `98.6` | `409.6 MiB` |
-| `2048x2048` | `1` | `20` | `53.86` | `53.95` | `53.98` | `18.6` | `720.9 MiB` |
 ## VP Stability

 | Model dim | `896` | `1024` |
 | Encoder blocks | `4` | `8` |
 | Decoder blocks | `8` | `8` |
+| Spatial compression | `16x` | `32x` |
+| Posterior | diagonal Gaussian | diagonal Gaussian |
+| Bottleneck norm | disabled | disabled |
 | Parameters | `88.8M` | `156.6M` |
 | Semantic teacher | DINOv3 ViT-S/16 LVD1689M, `vit_small_patch16_dinov3.lvd_1689m` | DINOv3 ConvNeXt-B/LVD1689M, `convnext_base.dinov3_lvd1689m` |
 | Semantic loss | negative cosine | 50/50 MSE plus negative cosine |
 easier and cheaper for downstream diffusion models, while still preserving the
 SemDisDiffAE-style VP diffusion decoder and stochastic posterior.
 The decoder uses the start / middle / end skip-concat layout with `2` start
 blocks and `2` end blocks. The encoder and decoder both operate natively at
 patch size `32`; this is not a patch-16 model with an additional latent
 | Resolution | Batch Size | Mean (ms/batch) | Median (ms/batch) | P95 (ms/batch) | ms/image | Images/s | Peak Allocated VRAM |
 |---:|---:|---:|---:|---:|---:|---:|---:|
+| `256x256` | `128` | `12.38` | `12.29` | `12.75` | `0.097` | `10336.1` | `567.8 MiB` |
+| `512x512` | `128` | `53.49` | `52.98` | `56.19` | `0.418` | `2393.0` | `1353.8 MiB` |
 ## Decode Latency
 | Resolution | Batch Size | Images | Mean (ms/image) | Median (ms/image) | P95 (ms/image) | Images/s | Peak Allocated VRAM |
 |---:|---:|---:|---:|---:|---:|---:|---:|
+| `512x512` | `1` | `20` | `3.89` | `3.90` | `3.90` | `256.8` | `340.8 MiB` |
+| `1024x1024` | `1` | `20` | `9.79` | `9.79` | `9.83` | `102.2` | `409.6 MiB` |
+| `2048x2048` | `1` | `20` | `51.90` | `51.73` | `52.20` | `19.3` | `720.9 MiB` |
 ## VP Stability