data-archetype commited on
Commit
44f2564
·
verified ·
1 Parent(s): 60276a8

Update technical_report_fcdm_diffae.md benchmark and docs

Browse files
Files changed (1) hide show
  1. technical_report_fcdm_diffae.md +8 -19
technical_report_fcdm_diffae.md CHANGED
@@ -23,6 +23,9 @@ background, see the original
23
  | Model dim | `896` | `1024` |
24
  | Encoder blocks | `4` | `8` |
25
  | Decoder blocks | `8` | `8` |
 
 
 
26
  | Parameters | `88.8M` | `156.6M` |
27
  | Semantic teacher | DINOv3 ViT-S/16 LVD1689M, `vit_small_patch16_dinov3.lvd_1689m` | DINOv3 ConvNeXt-B/LVD1689M, `convnext_base.dinov3_lvd1689m` |
28
  | Semantic loss | negative cosine | 50/50 MSE plus negative cosine |
@@ -33,20 +36,6 @@ encoder depth. The result is a lower-resolution latent grid intended to be
33
  easier and cheaper for downstream diffusion models, while still preserving the
34
  SemDisDiffAE-style VP diffusion decoder and stochastic posterior.
35
 
36
- ## Architecture
37
-
38
- | Component | Value |
39
- |---|---:|
40
- | Parameters | `156.6M` |
41
- | Encoder blocks | `8` |
42
- | Decoder blocks | `8` |
43
- | Patch size | `32` |
44
- | Model dim | `1024` |
45
- | Bottleneck dim | `384` |
46
- | Spatial compression | `32x` |
47
- | Posterior | `diagonal_gaussian` |
48
- | Bottleneck norm | `disabled` |
49
-
50
  The decoder uses the start / middle / end skip-concat layout with `2` start
51
  blocks and `2` end blocks. The encoder and decoder both operate natively at
52
  patch size `32`; this is not a patch-16 model with an additional latent
@@ -177,8 +166,8 @@ repeated batched `encode()` calls after `5` warmup batches.
177
 
178
  | Resolution | Batch Size | Mean (ms/batch) | Median (ms/batch) | P95 (ms/batch) | ms/image | Images/s | Peak Allocated VRAM |
179
  |---:|---:|---:|---:|---:|---:|---:|---:|
180
- | `256x256` | `128` | `12.54` | `12.52` | `12.86` | `0.098` | `10206.3` | `567.8 MiB` |
181
- | `512x512` | `32` | `12.09` | `12.12` | `12.33` | `0.378` | `2647.2` | `563.8 MiB` |
182
 
183
  ## Decode Latency
184
 
@@ -191,9 +180,9 @@ time.
191
 
192
  | Resolution | Batch Size | Images | Mean (ms/image) | Median (ms/image) | P95 (ms/image) | Images/s | Peak Allocated VRAM |
193
  |---:|---:|---:|---:|---:|---:|---:|---:|
194
- | `512x512` | `1` | `20` | `5.11` | `5.10` | `5.27` | `195.6` | `340.8 MiB` |
195
- | `1024x1024` | `1` | `20` | `10.14` | `10.16` | `10.22` | `98.6` | `409.6 MiB` |
196
- | `2048x2048` | `1` | `20` | `53.86` | `53.95` | `53.98` | `18.6` | `720.9 MiB` |
197
 
198
  ## VP Stability
199
 
 
23
  | Model dim | `896` | `1024` |
24
  | Encoder blocks | `4` | `8` |
25
  | Decoder blocks | `8` | `8` |
26
+ | Spatial compression | `16x` | `32x` |
27
+ | Posterior | diagonal Gaussian | diagonal Gaussian |
28
+ | Bottleneck norm | disabled | disabled |
29
  | Parameters | `88.8M` | `156.6M` |
30
  | Semantic teacher | DINOv3 ViT-S/16 LVD1689M, `vit_small_patch16_dinov3.lvd_1689m` | DINOv3 ConvNeXt-B/LVD1689M, `convnext_base.dinov3_lvd1689m` |
31
  | Semantic loss | negative cosine | 50/50 MSE plus negative cosine |
 
36
  easier and cheaper for downstream diffusion models, while still preserving the
37
  SemDisDiffAE-style VP diffusion decoder and stochastic posterior.
38
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
39
  The decoder uses the start / middle / end skip-concat layout with `2` start
40
  blocks and `2` end blocks. The encoder and decoder both operate natively at
41
  patch size `32`; this is not a patch-16 model with an additional latent
 
166
 
167
  | Resolution | Batch Size | Mean (ms/batch) | Median (ms/batch) | P95 (ms/batch) | ms/image | Images/s | Peak Allocated VRAM |
168
  |---:|---:|---:|---:|---:|---:|---:|---:|
169
+ | `256x256` | `128` | `12.38` | `12.29` | `12.75` | `0.097` | `10336.1` | `567.8 MiB` |
170
+ | `512x512` | `128` | `53.49` | `52.98` | `56.19` | `0.418` | `2393.0` | `1353.8 MiB` |
171
 
172
  ## Decode Latency
173
 
 
180
 
181
  | Resolution | Batch Size | Images | Mean (ms/image) | Median (ms/image) | P95 (ms/image) | Images/s | Peak Allocated VRAM |
182
  |---:|---:|---:|---:|---:|---:|---:|---:|
183
+ | `512x512` | `1` | `20` | `3.89` | `3.90` | `3.90` | `256.8` | `340.8 MiB` |
184
+ | `1024x1024` | `1` | `20` | `9.79` | `9.79` | `9.83` | `102.2` | `409.6 MiB` |
185
+ | `2048x2048` | `1` | `20` | `51.90` | `51.73` | `52.20` | `19.3` | `720.9 MiB` |
186
 
187
  ## VP Stability
188