Update technical_report_fcdm_diffae.md benchmark and docs
Browse files
technical_report_fcdm_diffae.md
CHANGED
|
@@ -23,6 +23,9 @@ background, see the original
|
|
| 23 |
| Model dim | `896` | `1024` |
|
| 24 |
| Encoder blocks | `4` | `8` |
|
| 25 |
| Decoder blocks | `8` | `8` |
|
|
|
|
|
|
|
|
|
|
| 26 |
| Parameters | `88.8M` | `156.6M` |
|
| 27 |
| Semantic teacher | DINOv3 ViT-S/16 LVD1689M, `vit_small_patch16_dinov3.lvd_1689m` | DINOv3 ConvNeXt-B/LVD1689M, `convnext_base.dinov3_lvd1689m` |
|
| 28 |
| Semantic loss | negative cosine | 50/50 MSE plus negative cosine |
|
|
@@ -33,20 +36,6 @@ encoder depth. The result is a lower-resolution latent grid intended to be
|
|
| 33 |
easier and cheaper for downstream diffusion models, while still preserving the
|
| 34 |
SemDisDiffAE-style VP diffusion decoder and stochastic posterior.
|
| 35 |
|
| 36 |
-
## Architecture
|
| 37 |
-
|
| 38 |
-
| Component | Value |
|
| 39 |
-
|---|---:|
|
| 40 |
-
| Parameters | `156.6M` |
|
| 41 |
-
| Encoder blocks | `8` |
|
| 42 |
-
| Decoder blocks | `8` |
|
| 43 |
-
| Patch size | `32` |
|
| 44 |
-
| Model dim | `1024` |
|
| 45 |
-
| Bottleneck dim | `384` |
|
| 46 |
-
| Spatial compression | `32x` |
|
| 47 |
-
| Posterior | `diagonal_gaussian` |
|
| 48 |
-
| Bottleneck norm | `disabled` |
|
| 49 |
-
|
| 50 |
The decoder uses the start / middle / end skip-concat layout with `2` start
|
| 51 |
blocks and `2` end blocks. The encoder and decoder both operate natively at
|
| 52 |
patch size `32`; this is not a patch-16 model with an additional latent
|
|
@@ -177,8 +166,8 @@ repeated batched `encode()` calls after `5` warmup batches.
|
|
| 177 |
|
| 178 |
| Resolution | Batch Size | Mean (ms/batch) | Median (ms/batch) | P95 (ms/batch) | ms/image | Images/s | Peak Allocated VRAM |
|
| 179 |
|---:|---:|---:|---:|---:|---:|---:|---:|
|
| 180 |
-
| `256x256` | `128` | `12.
|
| 181 |
-
| `512x512` | `
|
| 182 |
|
| 183 |
## Decode Latency
|
| 184 |
|
|
@@ -191,9 +180,9 @@ time.
|
|
| 191 |
|
| 192 |
| Resolution | Batch Size | Images | Mean (ms/image) | Median (ms/image) | P95 (ms/image) | Images/s | Peak Allocated VRAM |
|
| 193 |
|---:|---:|---:|---:|---:|---:|---:|---:|
|
| 194 |
-
| `512x512` | `1` | `20` | `
|
| 195 |
-
| `1024x1024` | `1` | `20` | `
|
| 196 |
-
| `2048x2048` | `1` | `20` | `
|
| 197 |
|
| 198 |
## VP Stability
|
| 199 |
|
|
|
|
| 23 |
| Model dim | `896` | `1024` |
|
| 24 |
| Encoder blocks | `4` | `8` |
|
| 25 |
| Decoder blocks | `8` | `8` |
|
| 26 |
+
| Spatial compression | `16x` | `32x` |
|
| 27 |
+
| Posterior | diagonal Gaussian | diagonal Gaussian |
|
| 28 |
+
| Bottleneck norm | disabled | disabled |
|
| 29 |
| Parameters | `88.8M` | `156.6M` |
|
| 30 |
| Semantic teacher | DINOv3 ViT-S/16 LVD1689M, `vit_small_patch16_dinov3.lvd_1689m` | DINOv3 ConvNeXt-B/LVD1689M, `convnext_base.dinov3_lvd1689m` |
|
| 31 |
| Semantic loss | negative cosine | 50/50 MSE plus negative cosine |
|
|
|
|
| 36 |
easier and cheaper for downstream diffusion models, while still preserving the
|
| 37 |
SemDisDiffAE-style VP diffusion decoder and stochastic posterior.
|
| 38 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 39 |
The decoder uses the start / middle / end skip-concat layout with `2` start
|
| 40 |
blocks and `2` end blocks. The encoder and decoder both operate natively at
|
| 41 |
patch size `32`; this is not a patch-16 model with an additional latent
|
|
|
|
| 166 |
|
| 167 |
| Resolution | Batch Size | Mean (ms/batch) | Median (ms/batch) | P95 (ms/batch) | ms/image | Images/s | Peak Allocated VRAM |
|
| 168 |
|---:|---:|---:|---:|---:|---:|---:|---:|
|
| 169 |
+
| `256x256` | `128` | `12.38` | `12.29` | `12.75` | `0.097` | `10336.1` | `567.8 MiB` |
|
| 170 |
+
| `512x512` | `128` | `53.49` | `52.98` | `56.19` | `0.418` | `2393.0` | `1353.8 MiB` |
|
| 171 |
|
| 172 |
## Decode Latency
|
| 173 |
|
|
|
|
| 180 |
|
| 181 |
| Resolution | Batch Size | Images | Mean (ms/image) | Median (ms/image) | P95 (ms/image) | Images/s | Peak Allocated VRAM |
|
| 182 |
|---:|---:|---:|---:|---:|---:|---:|---:|
|
| 183 |
+
| `512x512` | `1` | `20` | `3.89` | `3.90` | `3.90` | `256.8` | `340.8 MiB` |
|
| 184 |
+
| `1024x1024` | `1` | `20` | `9.79` | `9.79` | `9.83` | `102.2` | `409.6 MiB` |
|
| 185 |
+
| `2048x2048` | `1` | `20` | `51.90` | `51.73` | `52.20` | `19.3` | `720.9 MiB` |
|
| 186 |
|
| 187 |
## VP Stability
|
| 188 |
|