WaveCut commited on
Commit
b292728
·
verified ·
1 Parent(s): 687a1fe

Document corrected ERNIE qmm runtime profile

Browse files

Adds runtime_config.json and corrected explicit quantized-matmul/default allocator metrics. Updates the model card to distinguish serialized config-path measurements from explicit qmm runtime, and documents the PYTORCH_CUDA_ALLOC_CONF allocator pitfall.

README.md CHANGED
@@ -18,8 +18,8 @@ tags:
18
  # ERNIE-Image-Turbo SDNQ UINT4 Static
19
 
20
  This is a 4-bit SDNQ static quantization of [baidu/ERNIE-Image-Turbo](https://huggingface.co/baidu/ERNIE-Image-Turbo).
21
- The published SDNQ configs now set `use_quantized_matmul=true` for `pe`, `text_encoder`, `transformer`, and the pipeline-level config.
22
- Weights are unchanged from the UINT4 static quantization; this update makes the quantized-matmul runtime path explicit in the artifact metadata.
23
 
24
  ## Recipe
25
 
@@ -29,6 +29,8 @@ Weights are unchanged from the UINT4 static quantization; this update makes the
29
  - Runtime validation: `use_quantized_matmul=true`
30
  - Validation GPU: NVIDIA RTX 6000 Ada Generation
31
  - Validation settings: 10 fixed prompt/seed pairs, 8 inference steps, guidance scale 1.0, `use_pe=False`
 
 
32
 
33
  `use_pe=False` is used for the headline validation table to compare the image models directly. Stage-level debugging showed that `use_pe=True` can dominate latency: on the `1200x896` technical-diagram prompt, `pe.forward` accounted for most of the runtime, while the denoising transformer was much smaller.
34
 
@@ -37,9 +39,21 @@ Weights are unchanged from the UINT4 static quantization; this update makes the
37
  | Model | PE | Load s | Load peak VRAM MiB | Cold inference s | Cold peak VRAM MiB | Hot mean s/img | Hot median s/img | Hot peak VRAM MiB |
38
  |---|---:|---:|---:|---:|---:|---:|---:|---:|
39
  | Original BF16 | off | 91.84 | 29692 | 7.67 | 34840 | 7.69 | 7.67 | 34932 |
40
- | SDNQ UINT4 static + quantized matmul | off | 71.84 | 10172 | 16.10 | 15254 | 11.15 | 12.26 | 15390 |
41
 
42
- With PE disabled, this quant is about `1.45x` the original BF16 hot latency on the measured RTX 6000 Ada run, while reducing hot peak VRAM from `34932` MiB to `15390` MiB by `nvidia-smi` sampling.
 
 
 
 
 
 
 
 
 
 
 
 
43
 
44
  ## Visual Comparison
45
 
@@ -53,12 +67,18 @@ Individual prompt pairs are stored in `comparison/`, and full metrics are stored
53
  import torch
54
  import sdnq # registers SDNQ support
55
  from diffusers import ErnieImagePipeline
 
56
 
57
  pipe = ErnieImagePipeline.from_pretrained(
58
  "WaveCut/ERNIE-Image-Turbo-SDNQ-uint4-static",
59
  torch_dtype=torch.bfloat16,
60
  ).to("cuda")
61
 
 
 
 
 
 
62
  image = pipe(
63
  prompt="A clean modern poster with readable Cyrillic typography",
64
  width=1024,
@@ -69,15 +89,14 @@ image = pipe(
69
  ).images[0]
70
  ```
71
 
72
- If your local SDNQ/Diffusers build ignores the serialized `use_quantized_matmul` flag, enable it explicitly after loading:
73
 
74
- ```python
75
- from sdnq.loader import apply_sdnq_options_to_model
76
 
 
77
  for name in ("pe", "text_encoder", "transformer"):
78
- component = getattr(pipe, name, None)
79
- if component is not None:
80
- setattr(pipe, name, apply_sdnq_options_to_model(component, use_quantized_matmul=True))
81
  ```
82
 
83
  ## Prompt Set
@@ -99,4 +118,5 @@ for name in ("pe", "text_encoder", "transformer"):
99
 
100
  - The comparison uses the same prompts, dimensions, seeds, 8 inference steps, and guidance scale for both original and quantized runs.
101
  - `use_pe=True` remains supported by the pipeline, but it measures prompt-enhancer behavior in addition to image generation.
 
102
  - This is an independent quantized artifact; see the original Baidu model card for upstream model details, benchmarks, and license terms.
 
18
  # ERNIE-Image-Turbo SDNQ UINT4 Static
19
 
20
  This is a 4-bit SDNQ static quantization of [baidu/ERNIE-Image-Turbo](https://huggingface.co/baidu/ERNIE-Image-Turbo).
21
+ The published SDNQ configs set `use_quantized_matmul=true` for `pe`, `text_encoder`, `transformer`, and the pipeline-level config.
22
+ For current SDNQ/Diffusers builds, enable quantized matmul explicitly after loading with `apply_sdnq_options_to_model`; the serialized flag is retained in metadata, but may not be applied automatically by `from_pretrained()`.
23
 
24
  ## Recipe
25
 
 
29
  - Runtime validation: `use_quantized_matmul=true`
30
  - Validation GPU: NVIDIA RTX 6000 Ada Generation
31
  - Validation settings: 10 fixed prompt/seed pairs, 8 inference steps, guidance scale 1.0, `use_pe=False`
32
+ - Runtime note: do not set `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True,max_split_size_mb:32` for this pipeline; it caused allocator over-reservation and much slower denoising in validation.
33
+ - Machine-readable runtime recommendations are stored in `runtime_config.json`.
34
 
35
  `use_pe=False` is used for the headline validation table to compare the image models directly. Stage-level debugging showed that `use_pe=True` can dominate latency: on the `1200x896` technical-diagram prompt, `pe.forward` accounted for most of the runtime, while the denoising transformer was much smaller.
36
 
 
39
  | Model | PE | Load s | Load peak VRAM MiB | Cold inference s | Cold peak VRAM MiB | Hot mean s/img | Hot median s/img | Hot peak VRAM MiB |
40
  |---|---:|---:|---:|---:|---:|---:|---:|---:|
41
  | Original BF16 | off | 91.84 | 29692 | 7.67 | 34840 | 7.69 | 7.67 | 34932 |
42
+ | SDNQ UINT4 static, serialized config path | off | 71.84 | 10172 | 16.10 | 15254 | 11.15 | 12.26 | 15390 |
43
 
44
+ The row above is preserved for reproducibility of the original validation run. A follow-up profiling pass found that current loaders may leave quantized matmul disabled unless it is applied explicitly after loading.
45
+
46
+ ### Explicit Quantized-Matmul Runtime
47
+
48
+ With explicit `apply_sdnq_options_to_model(..., use_quantized_matmul=True)`, default PyTorch CUDA allocator settings, and no `torch.cuda.empty_cache()` between hot generations:
49
+
50
+ | Runtime | PE | Cold s | Hot mean s/img | Hot median s/img | Hot range s/img | Hot peak torch reserved MiB | Hot peak torch allocated MiB |
51
+ |---|---:|---:|---:|---:|---:|---:|---:|
52
+ | SDNQ UINT4 static + explicit qmm | off | 8.34 | 6.08 | 5.81 | 5.55-6.94 | 19540 | 19391 |
53
+
54
+ The slow component with PE disabled is the denoising transformer. In the corrected qmm profile, `transformer.forward` accounts for roughly `5.0-5.4s` of a `5.8-7.0s` hot generation on RTX 6000 Ada. `text_encoder.forward` is about `0.55-0.65s` after warmup, and `vae.decode` is usually about `0.15s`.
55
+
56
+ The allocator pitfall is large: with `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True,max_split_size_mb:32`, the same explicit-qmm runtime reserved about `48 GiB` and measured `25.88s` hot median with `empty_cache=True`, or `15.86s` without `empty_cache`.
57
 
58
  ## Visual Comparison
59
 
 
67
  import torch
68
  import sdnq # registers SDNQ support
69
  from diffusers import ErnieImagePipeline
70
+ from sdnq.loader import apply_sdnq_options_to_model
71
 
72
  pipe = ErnieImagePipeline.from_pretrained(
73
  "WaveCut/ERNIE-Image-Turbo-SDNQ-uint4-static",
74
  torch_dtype=torch.bfloat16,
75
  ).to("cuda")
76
 
77
+ for name in ("pe", "text_encoder", "transformer"):
78
+ component = getattr(pipe, name, None)
79
+ if component is not None:
80
+ setattr(pipe, name, apply_sdnq_options_to_model(component, use_quantized_matmul=True))
81
+
82
  image = pipe(
83
  prompt="A clean modern poster with readable Cyrillic typography",
84
  width=1024,
 
89
  ).images[0]
90
  ```
91
 
92
+ If you need maximum throughput, keep the model resident and avoid calling `torch.cuda.empty_cache()` between requests.
93
 
94
+ You can confirm the runtime state after loading:
 
95
 
96
+ ```python
97
  for name in ("pe", "text_encoder", "transformer"):
98
+ qcfg = getattr(getattr(pipe, name, None), "quantization_config", None)
99
+ print(name, getattr(qcfg, "use_quantized_matmul", None))
 
100
  ```
101
 
102
  ## Prompt Set
 
118
 
119
  - The comparison uses the same prompts, dimensions, seeds, 8 inference steps, and guidance scale for both original and quantized runs.
120
  - `use_pe=True` remains supported by the pipeline, but it measures prompt-enhancer behavior in addition to image generation.
121
+ - Corrected qmm runtime metrics are stored in `metrics/ernie_uint4_qmm_explicit_default_allocator_8step_metrics.json`; allocator-debug metrics are stored in `metrics/runtime_allocator_debug_metrics.json`.
122
  - This is an independent quantized artifact; see the original Baidu model card for upstream model details, benchmarks, and license terms.
metrics/ernie_qmm_peoff_vs_flux2_klein_qmm_summary.json CHANGED
@@ -39,5 +39,5 @@
39
  }
40
  ],
41
  "hot_speed_ratio_ernie_over_flux": 5.11579273478586,
42
- "note": "Hot mean excludes the first generation after loading; cold is prompt 00."
43
  }
 
39
  }
40
  ],
41
  "hot_speed_ratio_ernie_over_flux": 5.11579273478586,
42
+ "note": "Hot mean excludes the first generation after loading; cold is prompt 00. The ERNIE timing in this comparison was later found to be allocator-affected: PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True,max_split_size_mb:32 caused about 48 GiB torch reservation and slow transformer.forward. For corrected ERNIE explicit-qmm/default-allocator timings, see metrics/ernie_uint4_qmm_explicit_default_allocator_8step_metrics.json."
43
  }
metrics/ernie_uint4_qmm_explicit_default_allocator_8step_metrics.json ADDED
@@ -0,0 +1,158 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "label": "ernie_uint4_qmm_explicit_default_allocator_8step_repeat2",
3
+ "device": "NVIDIA RTX 6000 Ada Generation",
4
+ "torch": "2.8.0+cu128",
5
+ "sdnq": "0.1.9",
6
+ "env": {
7
+ "PYTORCH_CUDA_ALLOC_CONF": null
8
+ },
9
+ "settings": {
10
+ "num_inference_steps": 8,
11
+ "guidance_scale": 1.0,
12
+ "use_pe": false,
13
+ "explicit_apply_qmm": true,
14
+ "qmm_states": {
15
+ "pe": true,
16
+ "text_encoder": true,
17
+ "transformer": true
18
+ }
19
+ },
20
+ "load": {
21
+ "seconds": 63.15932087600231,
22
+ "gpu_start_mib": 436,
23
+ "gpu_end_mib": 10172,
24
+ "torch_peak_allocated_mib": 9724,
25
+ "torch_peak_reserved_mib": 9738
26
+ },
27
+ "generations": [
28
+ {
29
+ "prompt_id": "00-cyrillic-poster",
30
+ "title": "Cyrillic event poster",
31
+ "seed": 41001,
32
+ "width": 1024,
33
+ "height": 1024,
34
+ "seconds": 8.344464145600796,
35
+ "gpu_start_mib": 10172,
36
+ "gpu_end_mib": 11070,
37
+ "torch_peak_allocated_mib": 19150,
38
+ "torch_peak_reserved_mib": 19296
39
+ },
40
+ {
41
+ "prompt_id": "01-long-text-bakery-ad",
42
+ "title": "Long text product ad",
43
+ "seed": 41002,
44
+ "width": 896,
45
+ "height": 1200,
46
+ "seconds": 6.855839736759663,
47
+ "gpu_start_mib": 11070,
48
+ "gpu_end_mib": 11086,
49
+ "torch_peak_allocated_mib": 19391,
50
+ "torch_peak_reserved_mib": 19540
51
+ },
52
+ {
53
+ "prompt_id": "02-technical-diagram",
54
+ "title": "Technical diagram",
55
+ "seed": 41003,
56
+ "width": 1200,
57
+ "height": 896,
58
+ "seconds": 6.9446728229522705,
59
+ "gpu_start_mib": 11086,
60
+ "gpu_end_mib": 11086,
61
+ "torch_peak_allocated_mib": 19391,
62
+ "torch_peak_reserved_mib": 19540
63
+ },
64
+ {
65
+ "prompt_id": "03-four-panel-comic",
66
+ "title": "Four-panel comic",
67
+ "seed": 41004,
68
+ "width": 1024,
69
+ "height": 1024,
70
+ "seconds": 5.54518087208271,
71
+ "gpu_start_mib": 11086,
72
+ "gpu_end_mib": 13474,
73
+ "torch_peak_allocated_mib": 12172,
74
+ "torch_peak_reserved_mib": 12954
75
+ },
76
+ {
77
+ "prompt_id": "04-public-domain-painter-fusion",
78
+ "title": "Painterly style fusion",
79
+ "seed": 41005,
80
+ "width": 1024,
81
+ "height": 1024,
82
+ "seconds": 5.585576198995113,
83
+ "gpu_start_mib": 13474,
84
+ "gpu_end_mib": 13474,
85
+ "torch_peak_allocated_mib": 12171,
86
+ "torch_peak_reserved_mib": 12954
87
+ },
88
+ {
89
+ "prompt_id": "05-dashboard-ui",
90
+ "title": "Dense UI dashboard",
91
+ "seed": 41006,
92
+ "width": 1376,
93
+ "height": 768,
94
+ "seconds": 6.736225217580795,
95
+ "gpu_start_mib": 13474,
96
+ "gpu_end_mib": 11140,
97
+ "torch_peak_allocated_mib": 19226,
98
+ "torch_peak_reserved_mib": 19308
99
+ },
100
+ {
101
+ "prompt_id": "06-glass-still-life",
102
+ "title": "Glass and reflections",
103
+ "seed": 41007,
104
+ "width": 1024,
105
+ "height": 1024,
106
+ "seconds": 5.988675691187382,
107
+ "gpu_start_mib": 11140,
108
+ "gpu_end_mib": 13506,
109
+ "torch_peak_allocated_mib": 12171,
110
+ "torch_peak_reserved_mib": 12986
111
+ },
112
+ {
113
+ "prompt_id": "07-botanical-field-guide",
114
+ "title": "Field guide plate",
115
+ "seed": 41008,
116
+ "width": 896,
117
+ "height": 1200,
118
+ "seconds": 5.80891427397728,
119
+ "gpu_start_mib": 13506,
120
+ "gpu_end_mib": 15086,
121
+ "torch_peak_allocated_mib": 12236,
122
+ "torch_peak_reserved_mib": 14564
123
+ },
124
+ {
125
+ "prompt_id": "08-restaurant-menu-board",
126
+ "title": "Menu board text",
127
+ "seed": 41009,
128
+ "width": 1024,
129
+ "height": 1024,
130
+ "seconds": 5.700445763766766,
131
+ "gpu_start_mib": 15086,
132
+ "gpu_end_mib": 15098,
133
+ "torch_peak_allocated_mib": 12172,
134
+ "torch_peak_reserved_mib": 14576
135
+ },
136
+ {
137
+ "prompt_id": "09-isometric-city-map",
138
+ "title": "Isometric map",
139
+ "seed": 41010,
140
+ "width": 1200,
141
+ "height": 896,
142
+ "seconds": 5.566534325480461,
143
+ "gpu_start_mib": 15098,
144
+ "gpu_end_mib": 15888,
145
+ "torch_peak_allocated_mib": 12236,
146
+ "torch_peak_reserved_mib": 15366
147
+ }
148
+ ],
149
+ "summary": {
150
+ "cold_seconds": 8.344464145600796,
151
+ "hot_mean_seconds": 6.081340544753605,
152
+ "hot_median_seconds": 5.80891427397728,
153
+ "hot_min_seconds": 5.54518087208271,
154
+ "hot_max_seconds": 6.9446728229522705,
155
+ "hot_max_reserved_mib": 19540,
156
+ "hot_max_allocated_mib": 19391
157
+ }
158
+ }
metrics/runtime_allocator_debug_metrics.json ADDED
@@ -0,0 +1,740 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "device": "NVIDIA RTX 6000 Ada Generation",
3
+ "torch": "2.8.0+cu128",
4
+ "cases": [
5
+ {
6
+ "name": "qmm_all_empty_true",
7
+ "enabled_qmm_components": [
8
+ "pe",
9
+ "text_encoder",
10
+ "transformer"
11
+ ],
12
+ "empty_cache": true,
13
+ "load": {
14
+ "seconds": 59.13287413865328,
15
+ "gpu_start_mib": 434,
16
+ "gpu_end_mib": 10006,
17
+ "gpu_peak_mib": 10006,
18
+ "torch_peak_allocated_mib": 9555,
19
+ "torch_peak_reserved_mib": 9572,
20
+ "qmm_states": {
21
+ "pe": {
22
+ "requested": true,
23
+ "actual": true
24
+ },
25
+ "text_encoder": {
26
+ "requested": true,
27
+ "actual": true
28
+ },
29
+ "transformer": {
30
+ "requested": true,
31
+ "actual": true
32
+ }
33
+ }
34
+ },
35
+ "speed_rows": [
36
+ {
37
+ "prompt_id": "00-cyrillic-poster",
38
+ "width": 1024,
39
+ "height": 1024,
40
+ "seconds": 48.675362937152386,
41
+ "gpu_start_mib": 10006,
42
+ "gpu_end_mib": 10658,
43
+ "torch_peak_allocated_mib": 18980,
44
+ "torch_peak_reserved_mib": 48118,
45
+ "stages": {},
46
+ "stage_seconds_sum": 0,
47
+ "unattributed_seconds": null,
48
+ "empty_cache": true,
49
+ "instrument": false
50
+ },
51
+ {
52
+ "prompt_id": "01-long-text-bakery-ad",
53
+ "width": 896,
54
+ "height": 1200,
55
+ "seconds": 25.80502188205719,
56
+ "gpu_start_mib": 10094,
57
+ "gpu_end_mib": 10676,
58
+ "torch_peak_allocated_mib": 19219,
59
+ "torch_peak_reserved_mib": 48118,
60
+ "stages": {},
61
+ "stage_seconds_sum": 0,
62
+ "unattributed_seconds": null,
63
+ "empty_cache": true,
64
+ "instrument": false
65
+ },
66
+ {
67
+ "prompt_id": "02-technical-diagram",
68
+ "width": 1200,
69
+ "height": 896,
70
+ "seconds": 25.88253455609083,
71
+ "gpu_start_mib": 10092,
72
+ "gpu_end_mib": 10696,
73
+ "torch_peak_allocated_mib": 19219,
74
+ "torch_peak_reserved_mib": 48116,
75
+ "stages": {},
76
+ "stage_seconds_sum": 0,
77
+ "unattributed_seconds": null,
78
+ "empty_cache": true,
79
+ "instrument": false
80
+ },
81
+ {
82
+ "prompt_id": "03-four-panel-comic",
83
+ "width": 1024,
84
+ "height": 1024,
85
+ "seconds": 26.319617360830307,
86
+ "gpu_start_mib": 10092,
87
+ "gpu_end_mib": 32080,
88
+ "torch_peak_allocated_mib": 12002,
89
+ "torch_peak_reserved_mib": 48108,
90
+ "stages": {},
91
+ "stage_seconds_sum": 0,
92
+ "unattributed_seconds": null,
93
+ "empty_cache": true,
94
+ "instrument": false
95
+ }
96
+ ],
97
+ "stage_rows": [
98
+ {
99
+ "prompt_id": "00-cyrillic-poster",
100
+ "width": 1024,
101
+ "height": 1024,
102
+ "seconds": 16.900417506694794,
103
+ "gpu_start_mib": 10092,
104
+ "gpu_end_mib": 29120,
105
+ "torch_peak_allocated_mib": 12002,
106
+ "torch_peak_reserved_mib": 48118,
107
+ "stages": {
108
+ "text_encoder.forward": {
109
+ "seconds": 0.080795519053936,
110
+ "calls": 1
111
+ },
112
+ "transformer.forward": {
113
+ "seconds": 16.636043712496758,
114
+ "calls": 8
115
+ },
116
+ "vae.decode": {
117
+ "seconds": 0.14374303817749023,
118
+ "calls": 1
119
+ }
120
+ },
121
+ "stage_seconds_sum": 16.860582269728184,
122
+ "unattributed_seconds": 0.039835236966609955,
123
+ "empty_cache": true,
124
+ "instrument": true
125
+ },
126
+ {
127
+ "prompt_id": "02-technical-diagram",
128
+ "width": 1200,
129
+ "height": 896,
130
+ "seconds": 17.653532460331917,
131
+ "gpu_start_mib": 10094,
132
+ "gpu_end_mib": 20096,
133
+ "torch_peak_allocated_mib": 12063,
134
+ "torch_peak_reserved_mib": 48116,
135
+ "stages": {
136
+ "text_encoder.forward": {
137
+ "seconds": 0.06983215361833572,
138
+ "calls": 1
139
+ },
140
+ "transformer.forward": {
141
+ "seconds": 17.393484614789486,
142
+ "calls": 8
143
+ },
144
+ "vae.decode": {
145
+ "seconds": 0.14451629668474197,
146
+ "calls": 1
147
+ }
148
+ },
149
+ "stage_seconds_sum": 17.607833065092564,
150
+ "unattributed_seconds": 0.04569939523935318,
151
+ "empty_cache": true,
152
+ "instrument": true
153
+ }
154
+ ],
155
+ "summary": {
156
+ "cold_seconds": 48.675362937152386,
157
+ "hot_mean_seconds": 26.00239126632611,
158
+ "hot_median_seconds": 25.88253455609083,
159
+ "hot_min_seconds": 25.80502188205719,
160
+ "hot_max_seconds": 26.319617360830307,
161
+ "hot_max_reserved_mib": 48118,
162
+ "hot_max_allocated_mib": 19219
163
+ }
164
+ },
165
+ {
166
+ "name": "qmm_all_empty_false",
167
+ "enabled_qmm_components": [
168
+ "pe",
169
+ "text_encoder",
170
+ "transformer"
171
+ ],
172
+ "empty_cache": false,
173
+ "load": {
174
+ "seconds": 54.08376982063055,
175
+ "gpu_start_mib": 540,
176
+ "gpu_end_mib": 10092,
177
+ "gpu_peak_mib": 10092,
178
+ "torch_peak_allocated_mib": 9563,
179
+ "torch_peak_reserved_mib": 9572,
180
+ "qmm_states": {
181
+ "pe": {
182
+ "requested": true,
183
+ "actual": true
184
+ },
185
+ "text_encoder": {
186
+ "requested": true,
187
+ "actual": true
188
+ },
189
+ "transformer": {
190
+ "requested": true,
191
+ "actual": true
192
+ }
193
+ }
194
+ },
195
+ "speed_rows": [
196
+ {
197
+ "prompt_id": "00-cyrillic-poster",
198
+ "width": 1024,
199
+ "height": 1024,
200
+ "seconds": 17.728631243109703,
201
+ "gpu_start_mib": 10092,
202
+ "gpu_end_mib": 20200,
203
+ "torch_peak_allocated_mib": 12002,
204
+ "torch_peak_reserved_mib": 48118,
205
+ "stages": {},
206
+ "stage_seconds_sum": 0,
207
+ "unattributed_seconds": null,
208
+ "empty_cache": false,
209
+ "instrument": false
210
+ },
211
+ {
212
+ "prompt_id": "01-long-text-bakery-ad",
213
+ "width": 896,
214
+ "height": 1200,
215
+ "seconds": 18.051423519849777,
216
+ "gpu_start_mib": 20200,
217
+ "gpu_end_mib": 13716,
218
+ "torch_peak_allocated_mib": 12064,
219
+ "torch_peak_reserved_mib": 48118,
220
+ "stages": {},
221
+ "stage_seconds_sum": 0,
222
+ "unattributed_seconds": null,
223
+ "empty_cache": false,
224
+ "instrument": false
225
+ },
226
+ {
227
+ "prompt_id": "02-technical-diagram",
228
+ "width": 1200,
229
+ "height": 896,
230
+ "seconds": 15.653381533920765,
231
+ "gpu_start_mib": 13716,
232
+ "gpu_end_mib": 36456,
233
+ "torch_peak_allocated_mib": 12063,
234
+ "torch_peak_reserved_mib": 48116,
235
+ "stages": {},
236
+ "stage_seconds_sum": 0,
237
+ "unattributed_seconds": null,
238
+ "empty_cache": false,
239
+ "instrument": false
240
+ },
241
+ {
242
+ "prompt_id": "03-four-panel-comic",
243
+ "width": 1024,
244
+ "height": 1024,
245
+ "seconds": 15.864265829324722,
246
+ "gpu_start_mib": 36456,
247
+ "gpu_end_mib": 39720,
248
+ "torch_peak_allocated_mib": 12002,
249
+ "torch_peak_reserved_mib": 48118,
250
+ "stages": {},
251
+ "stage_seconds_sum": 0,
252
+ "unattributed_seconds": null,
253
+ "empty_cache": false,
254
+ "instrument": false
255
+ }
256
+ ],
257
+ "stage_rows": [
258
+ {
259
+ "prompt_id": "00-cyrillic-poster",
260
+ "width": 1024,
261
+ "height": 1024,
262
+ "seconds": 14.801267191767693,
263
+ "gpu_start_mib": 39720,
264
+ "gpu_end_mib": 37980,
265
+ "torch_peak_allocated_mib": 12002,
266
+ "torch_peak_reserved_mib": 48118,
267
+ "stages": {
268
+ "text_encoder.forward": {
269
+ "seconds": 0.07214060425758362,
270
+ "calls": 1
271
+ },
272
+ "transformer.forward": {
273
+ "seconds": 14.541316010057926,
274
+ "calls": 8
275
+ },
276
+ "vae.decode": {
277
+ "seconds": 0.14533011615276337,
278
+ "calls": 1
279
+ }
280
+ },
281
+ "stage_seconds_sum": 14.758786730468273,
282
+ "unattributed_seconds": 0.0424804612994194,
283
+ "empty_cache": false,
284
+ "instrument": true
285
+ },
286
+ {
287
+ "prompt_id": "02-technical-diagram",
288
+ "width": 1200,
289
+ "height": 896,
290
+ "seconds": 16.745930925011635,
291
+ "gpu_start_mib": 37980,
292
+ "gpu_end_mib": 45636,
293
+ "torch_peak_allocated_mib": 12063,
294
+ "torch_peak_reserved_mib": 48116,
295
+ "stages": {
296
+ "text_encoder.forward": {
297
+ "seconds": 0.07216303050518036,
298
+ "calls": 1
299
+ },
300
+ "transformer.forward": {
301
+ "seconds": 16.474046893417835,
302
+ "calls": 8
303
+ },
304
+ "vae.decode": {
305
+ "seconds": 0.1505463719367981,
306
+ "calls": 1
307
+ }
308
+ },
309
+ "stage_seconds_sum": 16.696756295859814,
310
+ "unattributed_seconds": 0.049174629151821136,
311
+ "empty_cache": false,
312
+ "instrument": true
313
+ }
314
+ ],
315
+ "summary": {
316
+ "cold_seconds": 17.728631243109703,
317
+ "hot_mean_seconds": 16.52302362769842,
318
+ "hot_median_seconds": 15.864265829324722,
319
+ "hot_min_seconds": 15.653381533920765,
320
+ "hot_max_seconds": 18.051423519849777,
321
+ "hot_max_reserved_mib": 48118,
322
+ "hot_max_allocated_mib": 12064
323
+ }
324
+ },
325
+ {
326
+ "name": "qmm_transformer_only_empty_true",
327
+ "enabled_qmm_components": [
328
+ "transformer"
329
+ ],
330
+ "empty_cache": true,
331
+ "load": {
332
+ "seconds": 54.33865138143301,
333
+ "gpu_start_mib": 540,
334
+ "gpu_end_mib": 10092,
335
+ "gpu_peak_mib": 10092,
336
+ "torch_peak_allocated_mib": 9563,
337
+ "torch_peak_reserved_mib": 9572,
338
+ "qmm_states": {
339
+ "pe": {
340
+ "requested": false,
341
+ "actual": false
342
+ },
343
+ "text_encoder": {
344
+ "requested": false,
345
+ "actual": false
346
+ },
347
+ "transformer": {
348
+ "requested": true,
349
+ "actual": true
350
+ }
351
+ }
352
+ },
353
+ "speed_rows": [
354
+ {
355
+ "prompt_id": "00-cyrillic-poster",
356
+ "width": 1024,
357
+ "height": 1024,
358
+ "seconds": 22.075800754129887,
359
+ "gpu_start_mib": 10092,
360
+ "gpu_end_mib": 21082,
361
+ "torch_peak_allocated_mib": 12002,
362
+ "torch_peak_reserved_mib": 48098,
363
+ "stages": {},
364
+ "stage_seconds_sum": 0,
365
+ "unattributed_seconds": null,
366
+ "empty_cache": true,
367
+ "instrument": false
368
+ },
369
+ {
370
+ "prompt_id": "01-long-text-bakery-ad",
371
+ "width": 896,
372
+ "height": 1200,
373
+ "seconds": 20.010109677910805,
374
+ "gpu_start_mib": 10096,
375
+ "gpu_end_mib": 34960,
376
+ "torch_peak_allocated_mib": 12064,
377
+ "torch_peak_reserved_mib": 48110,
378
+ "stages": {},
379
+ "stage_seconds_sum": 0,
380
+ "unattributed_seconds": null,
381
+ "empty_cache": true,
382
+ "instrument": false
383
+ },
384
+ {
385
+ "prompt_id": "02-technical-diagram",
386
+ "width": 1200,
387
+ "height": 896,
388
+ "seconds": 16.71645325422287,
389
+ "gpu_start_mib": 10094,
390
+ "gpu_end_mib": 36458,
391
+ "torch_peak_allocated_mib": 12063,
392
+ "torch_peak_reserved_mib": 48116,
393
+ "stages": {},
394
+ "stage_seconds_sum": 0,
395
+ "unattributed_seconds": null,
396
+ "empty_cache": true,
397
+ "instrument": false
398
+ },
399
+ {
400
+ "prompt_id": "03-four-panel-comic",
401
+ "width": 1024,
402
+ "height": 1024,
403
+ "seconds": 17.778217256069183,
404
+ "gpu_start_mib": 10094,
405
+ "gpu_end_mib": 20264,
406
+ "torch_peak_allocated_mib": 12002,
407
+ "torch_peak_reserved_mib": 48108,
408
+ "stages": {},
409
+ "stage_seconds_sum": 0,
410
+ "unattributed_seconds": null,
411
+ "empty_cache": true,
412
+ "instrument": false
413
+ }
414
+ ],
415
+ "stage_rows": [
416
+ {
417
+ "prompt_id": "00-cyrillic-poster",
418
+ "width": 1024,
419
+ "height": 1024,
420
+ "seconds": 16.47866802662611,
421
+ "gpu_start_mib": 10094,
422
+ "gpu_end_mib": 21082,
423
+ "torch_peak_allocated_mib": 12002,
424
+ "torch_peak_reserved_mib": 48098,
425
+ "stages": {
426
+ "text_encoder.forward": {
427
+ "seconds": 0.05702096223831177,
428
+ "calls": 1
429
+ },
430
+ "transformer.forward": {
431
+ "seconds": 16.22751172631979,
432
+ "calls": 8
433
+ },
434
+ "vae.decode": {
435
+ "seconds": 0.146262064576149,
436
+ "calls": 1
437
+ }
438
+ },
439
+ "stage_seconds_sum": 16.43079475313425,
440
+ "unattributed_seconds": 0.047873273491859436,
441
+ "empty_cache": true,
442
+ "instrument": true
443
+ },
444
+ {
445
+ "prompt_id": "02-technical-diagram",
446
+ "width": 1200,
447
+ "height": 896,
448
+ "seconds": 18.242334879934788,
449
+ "gpu_start_mib": 10096,
450
+ "gpu_end_mib": 36458,
451
+ "torch_peak_allocated_mib": 12063,
452
+ "torch_peak_reserved_mib": 48116,
453
+ "stages": {
454
+ "text_encoder.forward": {
455
+ "seconds": 0.05669167637825012,
456
+ "calls": 1
457
+ },
458
+ "transformer.forward": {
459
+ "seconds": 17.962014980614185,
460
+ "calls": 8
461
+ },
462
+ "vae.decode": {
463
+ "seconds": 0.147530660033226,
464
+ "calls": 1
465
+ }
466
+ },
467
+ "stage_seconds_sum": 18.16623731702566,
468
+ "unattributed_seconds": 0.07609756290912628,
469
+ "empty_cache": true,
470
+ "instrument": true
471
+ }
472
+ ],
473
+ "summary": {
474
+ "cold_seconds": 22.075800754129887,
475
+ "hot_mean_seconds": 18.168260062734287,
476
+ "hot_median_seconds": 17.778217256069183,
477
+ "hot_min_seconds": 16.71645325422287,
478
+ "hot_max_seconds": 20.010109677910805,
479
+ "hot_max_reserved_mib": 48116,
480
+ "hot_max_allocated_mib": 12064
481
+ }
482
+ },
483
+ {
484
+ "name": "qmm_none_empty_true",
485
+ "enabled_qmm_components": [],
486
+ "empty_cache": true,
487
+ "load": {
488
+ "seconds": 54.613617569208145,
489
+ "gpu_start_mib": 542,
490
+ "gpu_end_mib": 10094,
491
+ "gpu_peak_mib": 10094,
492
+ "torch_peak_allocated_mib": 9563,
493
+ "torch_peak_reserved_mib": 9572,
494
+ "qmm_states": {
495
+ "pe": {
496
+ "requested": false,
497
+ "actual": false
498
+ },
499
+ "text_encoder": {
500
+ "requested": false,
501
+ "actual": false
502
+ },
503
+ "transformer": {
504
+ "requested": false,
505
+ "actual": false
506
+ }
507
+ }
508
+ },
509
+ "speed_rows": [
510
+ {
511
+ "prompt_id": "00-cyrillic-poster",
512
+ "width": 1024,
513
+ "height": 1024,
514
+ "seconds": 21.959903195500374,
515
+ "gpu_start_mib": 10094,
516
+ "gpu_end_mib": 27822,
517
+ "torch_peak_allocated_mib": 12002,
518
+ "torch_peak_reserved_mib": 48100,
519
+ "stages": {},
520
+ "stage_seconds_sum": 0,
521
+ "unattributed_seconds": null,
522
+ "empty_cache": true,
523
+ "instrument": false
524
+ },
525
+ {
526
+ "prompt_id": "01-long-text-bakery-ad",
527
+ "width": 896,
528
+ "height": 1200,
529
+ "seconds": 19.555034309625626,
530
+ "gpu_start_mib": 10096,
531
+ "gpu_end_mib": 19358,
532
+ "torch_peak_allocated_mib": 12064,
533
+ "torch_peak_reserved_mib": 48098,
534
+ "stages": {},
535
+ "stage_seconds_sum": 0,
536
+ "unattributed_seconds": null,
537
+ "empty_cache": true,
538
+ "instrument": false
539
+ },
540
+ {
541
+ "prompt_id": "02-technical-diagram",
542
+ "width": 1200,
543
+ "height": 896,
544
+ "seconds": 17.357625499367714,
545
+ "gpu_start_mib": 10094,
546
+ "gpu_end_mib": 23640,
547
+ "torch_peak_allocated_mib": 12063,
548
+ "torch_peak_reserved_mib": 48106,
549
+ "stages": {},
550
+ "stage_seconds_sum": 0,
551
+ "unattributed_seconds": null,
552
+ "empty_cache": true,
553
+ "instrument": false
554
+ },
555
+ {
556
+ "prompt_id": "03-four-panel-comic",
557
+ "width": 1024,
558
+ "height": 1024,
559
+ "seconds": 17.224217273294926,
560
+ "gpu_start_mib": 10094,
561
+ "gpu_end_mib": 25464,
562
+ "torch_peak_allocated_mib": 12002,
563
+ "torch_peak_reserved_mib": 48108,
564
+ "stages": {},
565
+ "stage_seconds_sum": 0,
566
+ "unattributed_seconds": null,
567
+ "empty_cache": true,
568
+ "instrument": false
569
+ }
570
+ ],
571
+ "stage_rows": [
572
+ {
573
+ "prompt_id": "00-cyrillic-poster",
574
+ "width": 1024,
575
+ "height": 1024,
576
+ "seconds": 16.347896233201027,
577
+ "gpu_start_mib": 10094,
578
+ "gpu_end_mib": 27822,
579
+ "torch_peak_allocated_mib": 12002,
580
+ "torch_peak_reserved_mib": 48100,
581
+ "stages": {
582
+ "text_encoder.forward": {
583
+ "seconds": 0.054269738495349884,
584
+ "calls": 1
585
+ },
586
+ "transformer.forward": {
587
+ "seconds": 16.09693694859743,
588
+ "calls": 8
589
+ },
590
+ "vae.decode": {
591
+ "seconds": 0.16162345558404922,
592
+ "calls": 1
593
+ }
594
+ },
595
+ "stage_seconds_sum": 16.31283014267683,
596
+ "unattributed_seconds": 0.035066090524196625,
597
+ "empty_cache": true,
598
+ "instrument": true
599
+ },
600
+ {
601
+ "prompt_id": "02-technical-diagram",
602
+ "width": 1200,
603
+ "height": 896,
604
+ "seconds": 17.620373338460922,
605
+ "gpu_start_mib": 10096,
606
+ "gpu_end_mib": 23640,
607
+ "torch_peak_allocated_mib": 12063,
608
+ "torch_peak_reserved_mib": 48106,
609
+ "stages": {
610
+ "text_encoder.forward": {
611
+ "seconds": 0.05568608641624451,
612
+ "calls": 1
613
+ },
614
+ "transformer.forward": {
615
+ "seconds": 17.351328901946545,
616
+ "calls": 8
617
+ },
618
+ "vae.decode": {
619
+ "seconds": 0.16200313717126846,
620
+ "calls": 1
621
+ }
622
+ },
623
+ "stage_seconds_sum": 17.569018125534058,
624
+ "unattributed_seconds": 0.051355212926864624,
625
+ "empty_cache": true,
626
+ "instrument": true
627
+ }
628
+ ],
629
+ "summary": {
630
+ "cold_seconds": 21.959903195500374,
631
+ "hot_mean_seconds": 18.04562569409609,
632
+ "hot_median_seconds": 17.357625499367714,
633
+ "hot_min_seconds": 17.224217273294926,
634
+ "hot_max_seconds": 19.555034309625626,
635
+ "hot_max_reserved_mib": 48108,
636
+ "hot_max_allocated_mib": 12064
637
+ }
638
+ },
639
+ {
640
+ "name": "qmm_text_encoder_off_transformer_off_empty_false",
641
+ "enabled_qmm_components": [],
642
+ "empty_cache": false,
643
+ "load": {
644
+ "seconds": 52.21603240072727,
645
+ "gpu_start_mib": 542,
646
+ "gpu_end_mib": 10094,
647
+ "gpu_peak_mib": 10094,
648
+ "torch_peak_allocated_mib": 9563,
649
+ "torch_peak_reserved_mib": 9572,
650
+ "qmm_states": {
651
+ "pe": {
652
+ "requested": false,
653
+ "actual": false
654
+ },
655
+ "text_encoder": {
656
+ "requested": false,
657
+ "actual": false
658
+ },
659
+ "transformer": {
660
+ "requested": false,
661
+ "actual": false
662
+ }
663
+ }
664
+ },
665
+ "speed_rows": [
666
+ {
667
+ "prompt_id": "00-cyrillic-poster",
668
+ "width": 1024,
669
+ "height": 1024,
670
+ "seconds": 16.301372595131397,
671
+ "gpu_start_mib": 10094,
672
+ "gpu_end_mib": 27822,
673
+ "torch_peak_allocated_mib": 12002,
674
+ "torch_peak_reserved_mib": 48100,
675
+ "stages": {},
676
+ "stage_seconds_sum": 0,
677
+ "unattributed_seconds": null,
678
+ "empty_cache": false,
679
+ "instrument": false
680
+ },
681
+ {
682
+ "prompt_id": "01-long-text-bakery-ad",
683
+ "width": 896,
684
+ "height": 1200,
685
+ "seconds": 20.227899312973022,
686
+ "gpu_start_mib": 27822,
687
+ "gpu_end_mib": 19358,
688
+ "torch_peak_allocated_mib": 12064,
689
+ "torch_peak_reserved_mib": 48116,
690
+ "stages": {},
691
+ "stage_seconds_sum": 0,
692
+ "unattributed_seconds": null,
693
+ "empty_cache": false,
694
+ "instrument": false
695
+ },
696
+ {
697
+ "prompt_id": "02-technical-diagram",
698
+ "width": 1200,
699
+ "height": 896,
700
+ "seconds": 17.426542527973652,
701
+ "gpu_start_mib": 19358,
702
+ "gpu_end_mib": 32300,
703
+ "torch_peak_allocated_mib": 12063,
704
+ "torch_peak_reserved_mib": 48116,
705
+ "stages": {},
706
+ "stage_seconds_sum": 0,
707
+ "unattributed_seconds": null,
708
+ "empty_cache": false,
709
+ "instrument": false
710
+ },
711
+ {
712
+ "prompt_id": "03-four-panel-comic",
713
+ "width": 1024,
714
+ "height": 1024,
715
+ "seconds": 13.776540502905846,
716
+ "gpu_start_mib": 32300,
717
+ "gpu_end_mib": 32362,
718
+ "torch_peak_allocated_mib": 12002,
719
+ "torch_peak_reserved_mib": 48108,
720
+ "stages": {},
721
+ "stage_seconds_sum": 0,
722
+ "unattributed_seconds": null,
723
+ "empty_cache": false,
724
+ "instrument": false
725
+ }
726
+ ],
727
+ "stage_rows": [],
728
+ "summary": {
729
+ "cold_seconds": 16.301372595131397,
730
+ "hot_mean_seconds": 17.143660781284172,
731
+ "hot_median_seconds": 17.426542527973652,
732
+ "hot_min_seconds": 13.776540502905846,
733
+ "hot_max_seconds": 20.227899312973022,
734
+ "hot_max_reserved_mib": 48116,
735
+ "hot_max_allocated_mib": 12064
736
+ }
737
+ }
738
+ ],
739
+ "sdnq": "0.1.9"
740
+ }
runtime_config.json ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "runtime": {
3
+ "recommended_torch_cuda_alloc_conf": null,
4
+ "avoid_torch_cuda_alloc_conf": [
5
+ "expandable_segments:True,max_split_size_mb:32"
6
+ ],
7
+ "keep_model_resident": true,
8
+ "avoid_empty_cache_between_generations": true,
9
+ "use_pe_for_image_benchmarks": false
10
+ },
11
+ "sdnq": {
12
+ "requires_explicit_apply_quantized_matmul": true,
13
+ "apply_quantized_matmul_components": [
14
+ "pe",
15
+ "text_encoder",
16
+ "transformer"
17
+ ],
18
+ "apply_quantized_matmul_function": "sdnq.loader.apply_sdnq_options_to_model(component, use_quantized_matmul=True)"
19
+ },
20
+ "validated": {
21
+ "device": "NVIDIA RTX 6000 Ada Generation",
22
+ "torch": "2.8.0+cu128",
23
+ "sdnq": "0.1.9",
24
+ "num_inference_steps": 8,
25
+ "guidance_scale": 1.0,
26
+ "use_pe": false
27
+ },
28
+ "metrics": {
29
+ "explicit_quantized_matmul_default_allocator": "metrics/ernie_uint4_qmm_explicit_default_allocator_8step_metrics.json",
30
+ "allocator_debug": "metrics/runtime_allocator_debug_metrics.json"
31
+ }
32
+ }