Document corrected ERNIE qmm runtime profile

Adds runtime_config.json and corrected explicit quantized-matmul/default allocator metrics. Updates the model card to distinguish serialized config-path measurements from explicit qmm runtime, and documents the PYTORCH_CUDA_ALLOC_CONF allocator pitfall.

Files changed (5) hide show

README.md +30 -10
metrics/ernie_qmm_peoff_vs_flux2_klein_qmm_summary.json +1 -1
metrics/ernie_uint4_qmm_explicit_default_allocator_8step_metrics.json +158 -0
metrics/runtime_allocator_debug_metrics.json +740 -0
runtime_config.json +32 -0

README.md CHANGED Viewed

@@ -18,8 +18,8 @@ tags:
 # ERNIE-Image-Turbo SDNQ UINT4 Static
 This is a 4-bit SDNQ static quantization of [baidu/ERNIE-Image-Turbo](https://huggingface.co/baidu/ERNIE-Image-Turbo).
-The published SDNQ configs now set `use_quantized_matmul=true` for `pe`, `text_encoder`, `transformer`, and the pipeline-level config.
-Weights are unchanged from the UINT4 static quantization; this update makes the quantized-matmul runtime path explicit in the artifact metadata.
 ## Recipe
@@ -29,6 +29,8 @@ Weights are unchanged from the UINT4 static quantization; this update makes the
 - Runtime validation: `use_quantized_matmul=true`
 - Validation GPU: NVIDIA RTX 6000 Ada Generation
 - Validation settings: 10 fixed prompt/seed pairs, 8 inference steps, guidance scale 1.0, `use_pe=False`
 `use_pe=False` is used for the headline validation table to compare the image models directly. Stage-level debugging showed that `use_pe=True` can dominate latency: on the `1200x896` technical-diagram prompt, `pe.forward` accounted for most of the runtime, while the denoising transformer was much smaller.
@@ -37,9 +39,21 @@ Weights are unchanged from the UINT4 static quantization; this update makes the
 | Model | PE | Load s | Load peak VRAM MiB | Cold inference s | Cold peak VRAM MiB | Hot mean s/img | Hot median s/img | Hot peak VRAM MiB |
 |---|---:|---:|---:|---:|---:|---:|---:|---:|
 | Original BF16 | off | 91.84 | 29692 | 7.67 | 34840 | 7.69 | 7.67 | 34932 |
-| SDNQ UINT4 static + quantized matmul | off | 71.84 | 10172 | 16.10 | 15254 | 11.15 | 12.26 | 15390 |
-With PE disabled, this quant is about `1.45x` the original BF16 hot latency on the measured RTX 6000 Ada run, while reducing hot peak VRAM from `34932` MiB to `15390` MiB by `nvidia-smi` sampling.
 ## Visual Comparison
@@ -53,12 +67,18 @@ Individual prompt pairs are stored in `comparison/`, and full metrics are stored
 import torch
 import sdnq  # registers SDNQ support
 from diffusers import ErnieImagePipeline
 pipe = ErnieImagePipeline.from_pretrained(
     "WaveCut/ERNIE-Image-Turbo-SDNQ-uint4-static",
     torch_dtype=torch.bfloat16,
 ).to("cuda")
 image = pipe(
     prompt="A clean modern poster with readable Cyrillic typography",
     width=1024,
@@ -69,15 +89,14 @@ image = pipe(
 ).images[0]
 ```
-If your local SDNQ/Diffusers build ignores the serialized `use_quantized_matmul` flag, enable it explicitly after loading:
-```python
-from sdnq.loader import apply_sdnq_options_to_model
 for name in ("pe", "text_encoder", "transformer"):
-    component = getattr(pipe, name, None)
-    if component is not None:
-        setattr(pipe, name, apply_sdnq_options_to_model(component, use_quantized_matmul=True))
 ```
 ## Prompt Set
@@ -99,4 +118,5 @@ for name in ("pe", "text_encoder", "transformer"):
 - The comparison uses the same prompts, dimensions, seeds, 8 inference steps, and guidance scale for both original and quantized runs.
 - `use_pe=True` remains supported by the pipeline, but it measures prompt-enhancer behavior in addition to image generation.
 - This is an independent quantized artifact; see the original Baidu model card for upstream model details, benchmarks, and license terms.

 # ERNIE-Image-Turbo SDNQ UINT4 Static
 This is a 4-bit SDNQ static quantization of [baidu/ERNIE-Image-Turbo](https://huggingface.co/baidu/ERNIE-Image-Turbo).
+The published SDNQ configs set `use_quantized_matmul=true` for `pe`, `text_encoder`, `transformer`, and the pipeline-level config.
+For current SDNQ/Diffusers builds, enable quantized matmul explicitly after loading with `apply_sdnq_options_to_model`; the serialized flag is retained in metadata, but may not be applied automatically by `from_pretrained()`.
 ## Recipe
 - Runtime validation: `use_quantized_matmul=true`
 - Validation GPU: NVIDIA RTX 6000 Ada Generation
 - Validation settings: 10 fixed prompt/seed pairs, 8 inference steps, guidance scale 1.0, `use_pe=False`
+- Runtime note: do not set `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True,max_split_size_mb:32` for this pipeline; it caused allocator over-reservation and much slower denoising in validation.
+- Machine-readable runtime recommendations are stored in `runtime_config.json`.
 `use_pe=False` is used for the headline validation table to compare the image models directly. Stage-level debugging showed that `use_pe=True` can dominate latency: on the `1200x896` technical-diagram prompt, `pe.forward` accounted for most of the runtime, while the denoising transformer was much smaller.
 | Model | PE | Load s | Load peak VRAM MiB | Cold inference s | Cold peak VRAM MiB | Hot mean s/img | Hot median s/img | Hot peak VRAM MiB |
 |---|---:|---:|---:|---:|---:|---:|---:|---:|
 | Original BF16 | off | 91.84 | 29692 | 7.67 | 34840 | 7.69 | 7.67 | 34932 |
+| SDNQ UINT4 static, serialized config path | off | 71.84 | 10172 | 16.10 | 15254 | 11.15 | 12.26 | 15390 |
+The row above is preserved for reproducibility of the original validation run. A follow-up profiling pass found that current loaders may leave quantized matmul disabled unless it is applied explicitly after loading.
+### Explicit Quantized-Matmul Runtime
+With explicit `apply_sdnq_options_to_model(..., use_quantized_matmul=True)`, default PyTorch CUDA allocator settings, and no `torch.cuda.empty_cache()` between hot generations:
+| Runtime | PE | Cold s | Hot mean s/img | Hot median s/img | Hot range s/img | Hot peak torch reserved MiB | Hot peak torch allocated MiB |
+|---|---:|---:|---:|---:|---:|---:|---:|
+| SDNQ UINT4 static + explicit qmm | off | 8.34 | 6.08 | 5.81 | 5.55-6.94 | 19540 | 19391 |
+The slow component with PE disabled is the denoising transformer. In the corrected qmm profile, `transformer.forward` accounts for roughly `5.0-5.4s` of a `5.8-7.0s` hot generation on RTX 6000 Ada. `text_encoder.forward` is about `0.55-0.65s` after warmup, and `vae.decode` is usually about `0.15s`.
+The allocator pitfall is large: with `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True,max_split_size_mb:32`, the same explicit-qmm runtime reserved about `48 GiB` and measured `25.88s` hot median with `empty_cache=True`, or `15.86s` without `empty_cache`.
 ## Visual Comparison
 import torch
 import sdnq  # registers SDNQ support
 from diffusers import ErnieImagePipeline
+from sdnq.loader import apply_sdnq_options_to_model
 pipe = ErnieImagePipeline.from_pretrained(
     "WaveCut/ERNIE-Image-Turbo-SDNQ-uint4-static",
     torch_dtype=torch.bfloat16,
 ).to("cuda")
+for name in ("pe", "text_encoder", "transformer"):
+    component = getattr(pipe, name, None)
+    if component is not None:
+        setattr(pipe, name, apply_sdnq_options_to_model(component, use_quantized_matmul=True))
 image = pipe(
     prompt="A clean modern poster with readable Cyrillic typography",
     width=1024,
 ).images[0]
 ```
+If you need maximum throughput, keep the model resident and avoid calling `torch.cuda.empty_cache()` between requests.
+You can confirm the runtime state after loading:
+```python
 for name in ("pe", "text_encoder", "transformer"):
+    qcfg = getattr(getattr(pipe, name, None), "quantization_config", None)
+    print(name, getattr(qcfg, "use_quantized_matmul", None))
 ```
 ## Prompt Set
 - The comparison uses the same prompts, dimensions, seeds, 8 inference steps, and guidance scale for both original and quantized runs.
 - `use_pe=True` remains supported by the pipeline, but it measures prompt-enhancer behavior in addition to image generation.
+- Corrected qmm runtime metrics are stored in `metrics/ernie_uint4_qmm_explicit_default_allocator_8step_metrics.json`; allocator-debug metrics are stored in `metrics/runtime_allocator_debug_metrics.json`.
 - This is an independent quantized artifact; see the original Baidu model card for upstream model details, benchmarks, and license terms.

metrics/ernie_qmm_peoff_vs_flux2_klein_qmm_summary.json CHANGED Viewed

@@ -39,5 +39,5 @@
     }
   ],
   "hot_speed_ratio_ernie_over_flux": 5.11579273478586,
-  "note": "Hot mean excludes the first generation after loading; cold is prompt 00."
 }

     }
   ],
   "hot_speed_ratio_ernie_over_flux": 5.11579273478586,
+  "note": "Hot mean excludes the first generation after loading; cold is prompt 00. The ERNIE timing in this comparison was later found to be allocator-affected: PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True,max_split_size_mb:32 caused about 48 GiB torch reservation and slow transformer.forward. For corrected ERNIE explicit-qmm/default-allocator timings, see metrics/ernie_uint4_qmm_explicit_default_allocator_8step_metrics.json."
 }

metrics/ernie_uint4_qmm_explicit_default_allocator_8step_metrics.json ADDED Viewed

	@@ -0,0 +1,158 @@

+{
+  "label": "ernie_uint4_qmm_explicit_default_allocator_8step_repeat2",
+  "device": "NVIDIA RTX 6000 Ada Generation",
+  "torch": "2.8.0+cu128",
+  "sdnq": "0.1.9",
+  "env": {
+    "PYTORCH_CUDA_ALLOC_CONF": null
+  },
+  "settings": {
+    "num_inference_steps": 8,
+    "guidance_scale": 1.0,
+    "use_pe": false,
+    "explicit_apply_qmm": true,
+    "qmm_states": {
+      "pe": true,
+      "text_encoder": true,
+      "transformer": true
+    }
+  },
+  "load": {
+    "seconds": 63.15932087600231,
+    "gpu_start_mib": 436,
+    "gpu_end_mib": 10172,
+    "torch_peak_allocated_mib": 9724,
+    "torch_peak_reserved_mib": 9738
+  },
+  "generations": [
+    {
+      "prompt_id": "00-cyrillic-poster",
+      "title": "Cyrillic event poster",
+      "seed": 41001,
+      "width": 1024,
+      "height": 1024,
+      "seconds": 8.344464145600796,
+      "gpu_start_mib": 10172,
+      "gpu_end_mib": 11070,
+      "torch_peak_allocated_mib": 19150,
+      "torch_peak_reserved_mib": 19296
+    },
+    {
+      "prompt_id": "01-long-text-bakery-ad",
+      "title": "Long text product ad",
+      "seed": 41002,
+      "width": 896,
+      "height": 1200,
+      "seconds": 6.855839736759663,
+      "gpu_start_mib": 11070,
+      "gpu_end_mib": 11086,
+      "torch_peak_allocated_mib": 19391,
+      "torch_peak_reserved_mib": 19540
+    },
+    {
+      "prompt_id": "02-technical-diagram",
+      "title": "Technical diagram",
+      "seed": 41003,
+      "width": 1200,
+      "height": 896,
+      "seconds": 6.9446728229522705,
+      "gpu_start_mib": 11086,
+      "gpu_end_mib": 11086,
+      "torch_peak_allocated_mib": 19391,
+      "torch_peak_reserved_mib": 19540
+    },
+    {
+      "prompt_id": "03-four-panel-comic",
+      "title": "Four-panel comic",
+      "seed": 41004,
+      "width": 1024,
+      "height": 1024,
+      "seconds": 5.54518087208271,
+      "gpu_start_mib": 11086,
+      "gpu_end_mib": 13474,
+      "torch_peak_allocated_mib": 12172,
+      "torch_peak_reserved_mib": 12954
+    },
+    {
+      "prompt_id": "04-public-domain-painter-fusion",
+      "title": "Painterly style fusion",
+      "seed": 41005,
+      "width": 1024,
+      "height": 1024,
+      "seconds": 5.585576198995113,
+      "gpu_start_mib": 13474,
+      "gpu_end_mib": 13474,
+      "torch_peak_allocated_mib": 12171,
+      "torch_peak_reserved_mib": 12954
+    },
+    {
+      "prompt_id": "05-dashboard-ui",
+      "title": "Dense UI dashboard",
+      "seed": 41006,
+      "width": 1376,
+      "height": 768,
+      "seconds": 6.736225217580795,
+      "gpu_start_mib": 13474,
+      "gpu_end_mib": 11140,
+      "torch_peak_allocated_mib": 19226,
+      "torch_peak_reserved_mib": 19308
+    },
+    {
+      "prompt_id": "06-glass-still-life",
+      "title": "Glass and reflections",
+      "seed": 41007,
+      "width": 1024,
+      "height": 1024,
+      "seconds": 5.988675691187382,
+      "gpu_start_mib": 11140,
+      "gpu_end_mib": 13506,
+      "torch_peak_allocated_mib": 12171,
+      "torch_peak_reserved_mib": 12986
+    },
+    {
+      "prompt_id": "07-botanical-field-guide",
+      "title": "Field guide plate",
+      "seed": 41008,
+      "width": 896,
+      "height": 1200,
+      "seconds": 5.80891427397728,
+      "gpu_start_mib": 13506,
+      "gpu_end_mib": 15086,
+      "torch_peak_allocated_mib": 12236,
+      "torch_peak_reserved_mib": 14564
+    },
+    {
+      "prompt_id": "08-restaurant-menu-board",
+      "title": "Menu board text",
+      "seed": 41009,
+      "width": 1024,
+      "height": 1024,
+      "seconds": 5.700445763766766,
+      "gpu_start_mib": 15086,
+      "gpu_end_mib": 15098,
+      "torch_peak_allocated_mib": 12172,
+      "torch_peak_reserved_mib": 14576
+    },
+    {
+      "prompt_id": "09-isometric-city-map",
+      "title": "Isometric map",
+      "seed": 41010,
+      "width": 1200,
+      "height": 896,
+      "seconds": 5.566534325480461,
+      "gpu_start_mib": 15098,
+      "gpu_end_mib": 15888,
+      "torch_peak_allocated_mib": 12236,
+      "torch_peak_reserved_mib": 15366
+    }
+  ],
+  "summary": {
+    "cold_seconds": 8.344464145600796,
+    "hot_mean_seconds": 6.081340544753605,
+    "hot_median_seconds": 5.80891427397728,
+    "hot_min_seconds": 5.54518087208271,
+    "hot_max_seconds": 6.9446728229522705,
+    "hot_max_reserved_mib": 19540,
+    "hot_max_allocated_mib": 19391
+  }
+}

metrics/runtime_allocator_debug_metrics.json ADDED Viewed

	@@ -0,0 +1,740 @@

+{
+  "device": "NVIDIA RTX 6000 Ada Generation",
+  "torch": "2.8.0+cu128",
+  "cases": [
+    {
+      "name": "qmm_all_empty_true",
+      "enabled_qmm_components": [
+        "pe",
+        "text_encoder",
+        "transformer"
+      ],
+      "empty_cache": true,
+      "load": {
+        "seconds": 59.13287413865328,
+        "gpu_start_mib": 434,
+        "gpu_end_mib": 10006,
+        "gpu_peak_mib": 10006,
+        "torch_peak_allocated_mib": 9555,
+        "torch_peak_reserved_mib": 9572,
+        "qmm_states": {
+          "pe": {
+            "requested": true,
+            "actual": true
+          },
+          "text_encoder": {
+            "requested": true,
+            "actual": true
+          },
+          "transformer": {
+            "requested": true,
+            "actual": true
+          }
+        }
+      },
+      "speed_rows": [
+        {
+          "prompt_id": "00-cyrillic-poster",
+          "width": 1024,
+          "height": 1024,
+          "seconds": 48.675362937152386,
+          "gpu_start_mib": 10006,
+          "gpu_end_mib": 10658,
+          "torch_peak_allocated_mib": 18980,
+          "torch_peak_reserved_mib": 48118,
+          "stages": {},
+          "stage_seconds_sum": 0,
+          "unattributed_seconds": null,
+          "empty_cache": true,
+          "instrument": false
+        },
+        {
+          "prompt_id": "01-long-text-bakery-ad",
+          "width": 896,
+          "height": 1200,
+          "seconds": 25.80502188205719,
+          "gpu_start_mib": 10094,
+          "gpu_end_mib": 10676,
+          "torch_peak_allocated_mib": 19219,
+          "torch_peak_reserved_mib": 48118,
+          "stages": {},
+          "stage_seconds_sum": 0,
+          "unattributed_seconds": null,
+          "empty_cache": true,
+          "instrument": false
+        },
+        {
+          "prompt_id": "02-technical-diagram",
+          "width": 1200,
+          "height": 896,
+          "seconds": 25.88253455609083,
+          "gpu_start_mib": 10092,
+          "gpu_end_mib": 10696,
+          "torch_peak_allocated_mib": 19219,
+          "torch_peak_reserved_mib": 48116,
+          "stages": {},
+          "stage_seconds_sum": 0,
+          "unattributed_seconds": null,
+          "empty_cache": true,
+          "instrument": false
+        },
+        {
+          "prompt_id": "03-four-panel-comic",
+          "width": 1024,
+          "height": 1024,
+          "seconds": 26.319617360830307,
+          "gpu_start_mib": 10092,
+          "gpu_end_mib": 32080,
+          "torch_peak_allocated_mib": 12002,
+          "torch_peak_reserved_mib": 48108,
+          "stages": {},
+          "stage_seconds_sum": 0,
+          "unattributed_seconds": null,
+          "empty_cache": true,
+          "instrument": false
+        }
+      ],
+      "stage_rows": [
+        {
+          "prompt_id": "00-cyrillic-poster",
+          "width": 1024,
+          "height": 1024,
+          "seconds": 16.900417506694794,
+          "gpu_start_mib": 10092,
+          "gpu_end_mib": 29120,
+          "torch_peak_allocated_mib": 12002,
+          "torch_peak_reserved_mib": 48118,
+          "stages": {
+            "text_encoder.forward": {
+              "seconds": 0.080795519053936,
+              "calls": 1
+            },
+            "transformer.forward": {
+              "seconds": 16.636043712496758,
+              "calls": 8
+            },
+            "vae.decode": {
+              "seconds": 0.14374303817749023,
+              "calls": 1
+            }
+          },
+          "stage_seconds_sum": 16.860582269728184,
+          "unattributed_seconds": 0.039835236966609955,
+          "empty_cache": true,
+          "instrument": true
+        },
+        {
+          "prompt_id": "02-technical-diagram",
+          "width": 1200,
+          "height": 896,
+          "seconds": 17.653532460331917,
+          "gpu_start_mib": 10094,
+          "gpu_end_mib": 20096,
+          "torch_peak_allocated_mib": 12063,
+          "torch_peak_reserved_mib": 48116,
+          "stages": {
+            "text_encoder.forward": {
+              "seconds": 0.06983215361833572,
+              "calls": 1
+            },
+            "transformer.forward": {
+              "seconds": 17.393484614789486,
+              "calls": 8
+            },
+            "vae.decode": {
+              "seconds": 0.14451629668474197,
+              "calls": 1
+            }
+          },
+          "stage_seconds_sum": 17.607833065092564,
+          "unattributed_seconds": 0.04569939523935318,
+          "empty_cache": true,
+          "instrument": true
+        }
+      ],
+      "summary": {
+        "cold_seconds": 48.675362937152386,
+        "hot_mean_seconds": 26.00239126632611,
+        "hot_median_seconds": 25.88253455609083,
+        "hot_min_seconds": 25.80502188205719,
+        "hot_max_seconds": 26.319617360830307,
+        "hot_max_reserved_mib": 48118,
+        "hot_max_allocated_mib": 19219
+      }
+    },
+    {
+      "name": "qmm_all_empty_false",
+      "enabled_qmm_components": [
+        "pe",
+        "text_encoder",
+        "transformer"
+      ],
+      "empty_cache": false,
+      "load": {
+        "seconds": 54.08376982063055,
+        "gpu_start_mib": 540,
+        "gpu_end_mib": 10092,
+        "gpu_peak_mib": 10092,
+        "torch_peak_allocated_mib": 9563,
+        "torch_peak_reserved_mib": 9572,
+        "qmm_states": {
+          "pe": {
+            "requested": true,
+            "actual": true
+          },
+          "text_encoder": {
+            "requested": true,
+            "actual": true
+          },
+          "transformer": {
+            "requested": true,
+            "actual": true
+          }
+        }
+      },
+      "speed_rows": [
+        {
+          "prompt_id": "00-cyrillic-poster",
+          "width": 1024,
+          "height": 1024,
+          "seconds": 17.728631243109703,
+          "gpu_start_mib": 10092,
+          "gpu_end_mib": 20200,
+          "torch_peak_allocated_mib": 12002,
+          "torch_peak_reserved_mib": 48118,
+          "stages": {},
+          "stage_seconds_sum": 0,
+          "unattributed_seconds": null,
+          "empty_cache": false,
+          "instrument": false
+        },
+        {
+          "prompt_id": "01-long-text-bakery-ad",
+          "width": 896,
+          "height": 1200,
+          "seconds": 18.051423519849777,
+          "gpu_start_mib": 20200,
+          "gpu_end_mib": 13716,
+          "torch_peak_allocated_mib": 12064,
+          "torch_peak_reserved_mib": 48118,
+          "stages": {},
+          "stage_seconds_sum": 0,
+          "unattributed_seconds": null,
+          "empty_cache": false,
+          "instrument": false
+        },
+        {
+          "prompt_id": "02-technical-diagram",
+          "width": 1200,
+          "height": 896,
+          "seconds": 15.653381533920765,
+          "gpu_start_mib": 13716,
+          "gpu_end_mib": 36456,
+          "torch_peak_allocated_mib": 12063,
+          "torch_peak_reserved_mib": 48116,
+          "stages": {},
+          "stage_seconds_sum": 0,
+          "unattributed_seconds": null,
+          "empty_cache": false,
+          "instrument": false
+        },
+        {
+          "prompt_id": "03-four-panel-comic",
+          "width": 1024,
+          "height": 1024,
+          "seconds": 15.864265829324722,
+          "gpu_start_mib": 36456,
+          "gpu_end_mib": 39720,
+          "torch_peak_allocated_mib": 12002,
+          "torch_peak_reserved_mib": 48118,
+          "stages": {},
+          "stage_seconds_sum": 0,
+          "unattributed_seconds": null,
+          "empty_cache": false,
+          "instrument": false
+        }
+      ],
+      "stage_rows": [
+        {
+          "prompt_id": "00-cyrillic-poster",
+          "width": 1024,
+          "height": 1024,
+          "seconds": 14.801267191767693,
+          "gpu_start_mib": 39720,
+          "gpu_end_mib": 37980,
+          "torch_peak_allocated_mib": 12002,
+          "torch_peak_reserved_mib": 48118,
+          "stages": {
+            "text_encoder.forward": {
+              "seconds": 0.07214060425758362,
+              "calls": 1
+            },
+            "transformer.forward": {
+              "seconds": 14.541316010057926,
+              "calls": 8
+            },
+            "vae.decode": {
+              "seconds": 0.14533011615276337,
+              "calls": 1
+            }
+          },
+          "stage_seconds_sum": 14.758786730468273,
+          "unattributed_seconds": 0.0424804612994194,
+          "empty_cache": false,
+          "instrument": true
+        },
+        {
+          "prompt_id": "02-technical-diagram",
+          "width": 1200,
+          "height": 896,
+          "seconds": 16.745930925011635,
+          "gpu_start_mib": 37980,
+          "gpu_end_mib": 45636,
+          "torch_peak_allocated_mib": 12063,
+          "torch_peak_reserved_mib": 48116,
+          "stages": {
+            "text_encoder.forward": {
+              "seconds": 0.07216303050518036,
+              "calls": 1
+            },
+            "transformer.forward": {
+              "seconds": 16.474046893417835,
+              "calls": 8
+            },
+            "vae.decode": {
+              "seconds": 0.1505463719367981,
+              "calls": 1
+            }
+          },
+          "stage_seconds_sum": 16.696756295859814,
+          "unattributed_seconds": 0.049174629151821136,
+          "empty_cache": false,
+          "instrument": true
+        }
+      ],
+      "summary": {
+        "cold_seconds": 17.728631243109703,
+        "hot_mean_seconds": 16.52302362769842,
+        "hot_median_seconds": 15.864265829324722,
+        "hot_min_seconds": 15.653381533920765,
+        "hot_max_seconds": 18.051423519849777,
+        "hot_max_reserved_mib": 48118,
+        "hot_max_allocated_mib": 12064
+      }
+    },
+    {
+      "name": "qmm_transformer_only_empty_true",
+      "enabled_qmm_components": [
+        "transformer"
+      ],
+      "empty_cache": true,
+      "load": {
+        "seconds": 54.33865138143301,
+        "gpu_start_mib": 540,
+        "gpu_end_mib": 10092,
+        "gpu_peak_mib": 10092,
+        "torch_peak_allocated_mib": 9563,
+        "torch_peak_reserved_mib": 9572,
+        "qmm_states": {
+          "pe": {
+            "requested": false,
+            "actual": false
+          },
+          "text_encoder": {
+            "requested": false,
+            "actual": false
+          },
+          "transformer": {
+            "requested": true,
+            "actual": true
+          }
+        }
+      },
+      "speed_rows": [
+        {
+          "prompt_id": "00-cyrillic-poster",
+          "width": 1024,
+          "height": 1024,
+          "seconds": 22.075800754129887,
+          "gpu_start_mib": 10092,
+          "gpu_end_mib": 21082,
+          "torch_peak_allocated_mib": 12002,
+          "torch_peak_reserved_mib": 48098,
+          "stages": {},
+          "stage_seconds_sum": 0,
+          "unattributed_seconds": null,
+          "empty_cache": true,
+          "instrument": false
+        },
+        {
+          "prompt_id": "01-long-text-bakery-ad",
+          "width": 896,
+          "height": 1200,
+          "seconds": 20.010109677910805,
+          "gpu_start_mib": 10096,
+          "gpu_end_mib": 34960,
+          "torch_peak_allocated_mib": 12064,
+          "torch_peak_reserved_mib": 48110,
+          "stages": {},
+          "stage_seconds_sum": 0,
+          "unattributed_seconds": null,
+          "empty_cache": true,
+          "instrument": false
+        },
+        {
+          "prompt_id": "02-technical-diagram",
+          "width": 1200,
+          "height": 896,
+          "seconds": 16.71645325422287,
+          "gpu_start_mib": 10094,
+          "gpu_end_mib": 36458,
+          "torch_peak_allocated_mib": 12063,
+          "torch_peak_reserved_mib": 48116,
+          "stages": {},
+          "stage_seconds_sum": 0,
+          "unattributed_seconds": null,
+          "empty_cache": true,
+          "instrument": false
+        },
+        {
+          "prompt_id": "03-four-panel-comic",
+          "width": 1024,
+          "height": 1024,
+          "seconds": 17.778217256069183,
+          "gpu_start_mib": 10094,
+          "gpu_end_mib": 20264,
+          "torch_peak_allocated_mib": 12002,
+          "torch_peak_reserved_mib": 48108,
+          "stages": {},
+          "stage_seconds_sum": 0,
+          "unattributed_seconds": null,
+          "empty_cache": true,
+          "instrument": false
+        }
+      ],
+      "stage_rows": [
+        {
+          "prompt_id": "00-cyrillic-poster",
+          "width": 1024,
+          "height": 1024,
+          "seconds": 16.47866802662611,
+          "gpu_start_mib": 10094,
+          "gpu_end_mib": 21082,
+          "torch_peak_allocated_mib": 12002,
+          "torch_peak_reserved_mib": 48098,
+          "stages": {
+            "text_encoder.forward": {
+              "seconds": 0.05702096223831177,
+              "calls": 1
+            },
+            "transformer.forward": {
+              "seconds": 16.22751172631979,
+              "calls": 8
+            },
+            "vae.decode": {
+              "seconds": 0.146262064576149,
+              "calls": 1
+            }
+          },
+          "stage_seconds_sum": 16.43079475313425,
+          "unattributed_seconds": 0.047873273491859436,
+          "empty_cache": true,
+          "instrument": true
+        },
+        {
+          "prompt_id": "02-technical-diagram",
+          "width": 1200,
+          "height": 896,
+          "seconds": 18.242334879934788,
+          "gpu_start_mib": 10096,
+          "gpu_end_mib": 36458,
+          "torch_peak_allocated_mib": 12063,
+          "torch_peak_reserved_mib": 48116,
+          "stages": {
+            "text_encoder.forward": {
+              "seconds": 0.05669167637825012,
+              "calls": 1
+            },
+            "transformer.forward": {
+              "seconds": 17.962014980614185,
+              "calls": 8
+            },
+            "vae.decode": {
+              "seconds": 0.147530660033226,
+              "calls": 1
+            }
+          },
+          "stage_seconds_sum": 18.16623731702566,
+          "unattributed_seconds": 0.07609756290912628,
+          "empty_cache": true,
+          "instrument": true
+        }
+      ],
+      "summary": {
+        "cold_seconds": 22.075800754129887,
+        "hot_mean_seconds": 18.168260062734287,
+        "hot_median_seconds": 17.778217256069183,
+        "hot_min_seconds": 16.71645325422287,
+        "hot_max_seconds": 20.010109677910805,
+        "hot_max_reserved_mib": 48116,
+        "hot_max_allocated_mib": 12064
+      }
+    },
+    {
+      "name": "qmm_none_empty_true",
+      "enabled_qmm_components": [],
+      "empty_cache": true,
+      "load": {
+        "seconds": 54.613617569208145,
+        "gpu_start_mib": 542,
+        "gpu_end_mib": 10094,
+        "gpu_peak_mib": 10094,
+        "torch_peak_allocated_mib": 9563,
+        "torch_peak_reserved_mib": 9572,
+        "qmm_states": {
+          "pe": {
+            "requested": false,
+            "actual": false
+          },
+          "text_encoder": {
+            "requested": false,
+            "actual": false
+          },
+          "transformer": {
+            "requested": false,
+            "actual": false
+          }
+        }
+      },
+      "speed_rows": [
+        {
+          "prompt_id": "00-cyrillic-poster",
+          "width": 1024,
+          "height": 1024,
+          "seconds": 21.959903195500374,
+          "gpu_start_mib": 10094,
+          "gpu_end_mib": 27822,
+          "torch_peak_allocated_mib": 12002,
+          "torch_peak_reserved_mib": 48100,
+          "stages": {},
+          "stage_seconds_sum": 0,
+          "unattributed_seconds": null,
+          "empty_cache": true,
+          "instrument": false
+        },
+        {
+          "prompt_id": "01-long-text-bakery-ad",
+          "width": 896,
+          "height": 1200,
+          "seconds": 19.555034309625626,
+          "gpu_start_mib": 10096,
+          "gpu_end_mib": 19358,
+          "torch_peak_allocated_mib": 12064,
+          "torch_peak_reserved_mib": 48098,
+          "stages": {},
+          "stage_seconds_sum": 0,
+          "unattributed_seconds": null,
+          "empty_cache": true,
+          "instrument": false
+        },
+        {
+          "prompt_id": "02-technical-diagram",
+          "width": 1200,
+          "height": 896,
+          "seconds": 17.357625499367714,
+          "gpu_start_mib": 10094,
+          "gpu_end_mib": 23640,
+          "torch_peak_allocated_mib": 12063,
+          "torch_peak_reserved_mib": 48106,
+          "stages": {},
+          "stage_seconds_sum": 0,
+          "unattributed_seconds": null,
+          "empty_cache": true,
+          "instrument": false
+        },
+        {
+          "prompt_id": "03-four-panel-comic",
+          "width": 1024,
+          "height": 1024,
+          "seconds": 17.224217273294926,
+          "gpu_start_mib": 10094,
+          "gpu_end_mib": 25464,
+          "torch_peak_allocated_mib": 12002,
+          "torch_peak_reserved_mib": 48108,
+          "stages": {},
+          "stage_seconds_sum": 0,
+          "unattributed_seconds": null,
+          "empty_cache": true,
+          "instrument": false
+        }
+      ],
+      "stage_rows": [
+        {
+          "prompt_id": "00-cyrillic-poster",
+          "width": 1024,
+          "height": 1024,
+          "seconds": 16.347896233201027,
+          "gpu_start_mib": 10094,
+          "gpu_end_mib": 27822,
+          "torch_peak_allocated_mib": 12002,
+          "torch_peak_reserved_mib": 48100,
+          "stages": {
+            "text_encoder.forward": {
+              "seconds": 0.054269738495349884,
+              "calls": 1
+            },
+            "transformer.forward": {
+              "seconds": 16.09693694859743,
+              "calls": 8
+            },
+            "vae.decode": {
+              "seconds": 0.16162345558404922,
+              "calls": 1
+            }
+          },
+          "stage_seconds_sum": 16.31283014267683,
+          "unattributed_seconds": 0.035066090524196625,
+          "empty_cache": true,
+          "instrument": true
+        },
+        {
+          "prompt_id": "02-technical-diagram",
+          "width": 1200,
+          "height": 896,
+          "seconds": 17.620373338460922,
+          "gpu_start_mib": 10096,
+          "gpu_end_mib": 23640,
+          "torch_peak_allocated_mib": 12063,
+          "torch_peak_reserved_mib": 48106,
+          "stages": {
+            "text_encoder.forward": {
+              "seconds": 0.05568608641624451,
+              "calls": 1
+            },
+            "transformer.forward": {
+              "seconds": 17.351328901946545,
+              "calls": 8
+            },
+            "vae.decode": {
+              "seconds": 0.16200313717126846,
+              "calls": 1
+            }
+          },
+          "stage_seconds_sum": 17.569018125534058,
+          "unattributed_seconds": 0.051355212926864624,
+          "empty_cache": true,
+          "instrument": true
+        }
+      ],
+      "summary": {
+        "cold_seconds": 21.959903195500374,
+        "hot_mean_seconds": 18.04562569409609,
+        "hot_median_seconds": 17.357625499367714,
+        "hot_min_seconds": 17.224217273294926,
+        "hot_max_seconds": 19.555034309625626,
+        "hot_max_reserved_mib": 48108,
+        "hot_max_allocated_mib": 12064
+      }
+    },
+    {
+      "name": "qmm_text_encoder_off_transformer_off_empty_false",
+      "enabled_qmm_components": [],
+      "empty_cache": false,
+      "load": {
+        "seconds": 52.21603240072727,
+        "gpu_start_mib": 542,
+        "gpu_end_mib": 10094,
+        "gpu_peak_mib": 10094,
+        "torch_peak_allocated_mib": 9563,
+        "torch_peak_reserved_mib": 9572,
+        "qmm_states": {
+          "pe": {
+            "requested": false,
+            "actual": false
+          },
+          "text_encoder": {
+            "requested": false,
+            "actual": false
+          },
+          "transformer": {
+            "requested": false,
+            "actual": false
+          }
+        }
+      },
+      "speed_rows": [
+        {
+          "prompt_id": "00-cyrillic-poster",
+          "width": 1024,
+          "height": 1024,
+          "seconds": 16.301372595131397,
+          "gpu_start_mib": 10094,
+          "gpu_end_mib": 27822,
+          "torch_peak_allocated_mib": 12002,
+          "torch_peak_reserved_mib": 48100,
+          "stages": {},
+          "stage_seconds_sum": 0,
+          "unattributed_seconds": null,
+          "empty_cache": false,
+          "instrument": false
+        },
+        {
+          "prompt_id": "01-long-text-bakery-ad",
+          "width": 896,
+          "height": 1200,
+          "seconds": 20.227899312973022,
+          "gpu_start_mib": 27822,
+          "gpu_end_mib": 19358,
+          "torch_peak_allocated_mib": 12064,
+          "torch_peak_reserved_mib": 48116,
+          "stages": {},
+          "stage_seconds_sum": 0,
+          "unattributed_seconds": null,
+          "empty_cache": false,
+          "instrument": false
+        },
+        {
+          "prompt_id": "02-technical-diagram",
+          "width": 1200,
+          "height": 896,
+          "seconds": 17.426542527973652,
+          "gpu_start_mib": 19358,
+          "gpu_end_mib": 32300,
+          "torch_peak_allocated_mib": 12063,
+          "torch_peak_reserved_mib": 48116,
+          "stages": {},
+          "stage_seconds_sum": 0,
+          "unattributed_seconds": null,
+          "empty_cache": false,
+          "instrument": false
+        },
+        {
+          "prompt_id": "03-four-panel-comic",
+          "width": 1024,
+          "height": 1024,
+          "seconds": 13.776540502905846,
+          "gpu_start_mib": 32300,
+          "gpu_end_mib": 32362,
+          "torch_peak_allocated_mib": 12002,
+          "torch_peak_reserved_mib": 48108,
+          "stages": {},
+          "stage_seconds_sum": 0,
+          "unattributed_seconds": null,
+          "empty_cache": false,
+          "instrument": false
+        }
+      ],
+      "stage_rows": [],
+      "summary": {
+        "cold_seconds": 16.301372595131397,
+        "hot_mean_seconds": 17.143660781284172,
+        "hot_median_seconds": 17.426542527973652,
+        "hot_min_seconds": 13.776540502905846,
+        "hot_max_seconds": 20.227899312973022,
+        "hot_max_reserved_mib": 48116,
+        "hot_max_allocated_mib": 12064
+      }
+    }
+  ],
+  "sdnq": "0.1.9"
+}

runtime_config.json ADDED Viewed

	@@ -0,0 +1,32 @@

+{
+  "runtime": {
+    "recommended_torch_cuda_alloc_conf": null,
+    "avoid_torch_cuda_alloc_conf": [
+      "expandable_segments:True,max_split_size_mb:32"
+    ],
+    "keep_model_resident": true,
+    "avoid_empty_cache_between_generations": true,
+    "use_pe_for_image_benchmarks": false
+  },
+  "sdnq": {
+    "requires_explicit_apply_quantized_matmul": true,
+    "apply_quantized_matmul_components": [
+      "pe",
+      "text_encoder",
+      "transformer"
+    ],
+    "apply_quantized_matmul_function": "sdnq.loader.apply_sdnq_options_to_model(component, use_quantized_matmul=True)"
+  },
+  "validated": {
+    "device": "NVIDIA RTX 6000 Ada Generation",
+    "torch": "2.8.0+cu128",
+    "sdnq": "0.1.9",
+    "num_inference_steps": 8,
+    "guidance_scale": 1.0,
+    "use_pe": false
+  },
+  "metrics": {
+    "explicit_quantized_matmul_default_allocator": "metrics/ernie_uint4_qmm_explicit_default_allocator_8step_metrics.json",
+    "allocator_debug": "metrics/runtime_allocator_debug_metrics.json"
+  }
+}