WaveCut's picture
Document corrected ERNIE qmm runtime profile
b292728 verified
---
license: apache-2.0
base_model: baidu/ERNIE-Image-Turbo
pipeline_tag: text-to-image
library_name: diffusers
tags:
- text-to-image
- diffusers
- safetensors
- ernie-image
- sdnq
- quantized
- uint4
- static
- quantized-matmul
---
# ERNIE-Image-Turbo SDNQ UINT4 Static
This is a 4-bit SDNQ static quantization of [baidu/ERNIE-Image-Turbo](https://huggingface.co/baidu/ERNIE-Image-Turbo).
The published SDNQ configs set `use_quantized_matmul=true` for `pe`, `text_encoder`, `transformer`, and the pipeline-level config.
For current SDNQ/Diffusers builds, enable quantized matmul explicitly after loading with `apply_sdnq_options_to_model`; the serialized flag is retained in metadata, but may not be applied automatically by `from_pretrained()`.
## Recipe
- Base model: `baidu/ERNIE-Image-Turbo`
- Quantizer: `sdnq` / SDNQ UINT4 static, `dequantize_fp32=false`
- Quantized components: `pe`, `text_encoder`, `transformer`
- Runtime validation: `use_quantized_matmul=true`
- Validation GPU: NVIDIA RTX 6000 Ada Generation
- Validation settings: 10 fixed prompt/seed pairs, 8 inference steps, guidance scale 1.0, `use_pe=False`
- Runtime note: do not set `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True,max_split_size_mb:32` for this pipeline; it caused allocator over-reservation and much slower denoising in validation.
- Machine-readable runtime recommendations are stored in `runtime_config.json`.
`use_pe=False` is used for the headline validation table to compare the image models directly. Stage-level debugging showed that `use_pe=True` can dominate latency: on the `1200x896` technical-diagram prompt, `pe.forward` accounted for most of the runtime, while the denoising transformer was much smaller.
## Measured Results
| Model | PE | Load s | Load peak VRAM MiB | Cold inference s | Cold peak VRAM MiB | Hot mean s/img | Hot median s/img | Hot peak VRAM MiB |
|---|---:|---:|---:|---:|---:|---:|---:|---:|
| Original BF16 | off | 91.84 | 29692 | 7.67 | 34840 | 7.69 | 7.67 | 34932 |
| SDNQ UINT4 static, serialized config path | off | 71.84 | 10172 | 16.10 | 15254 | 11.15 | 12.26 | 15390 |
The row above is preserved for reproducibility of the original validation run. A follow-up profiling pass found that current loaders may leave quantized matmul disabled unless it is applied explicitly after loading.
### Explicit Quantized-Matmul Runtime
With explicit `apply_sdnq_options_to_model(..., use_quantized_matmul=True)`, default PyTorch CUDA allocator settings, and no `torch.cuda.empty_cache()` between hot generations:
| Runtime | PE | Cold s | Hot mean s/img | Hot median s/img | Hot range s/img | Hot peak torch reserved MiB | Hot peak torch allocated MiB |
|---|---:|---:|---:|---:|---:|---:|---:|
| SDNQ UINT4 static + explicit qmm | off | 8.34 | 6.08 | 5.81 | 5.55-6.94 | 19540 | 19391 |
The slow component with PE disabled is the denoising transformer. In the corrected qmm profile, `transformer.forward` accounts for roughly `5.0-5.4s` of a `5.8-7.0s` hot generation on RTX 6000 Ada. `text_encoder.forward` is about `0.55-0.65s` after warmup, and `vae.decode` is usually about `0.15s`.
The allocator pitfall is large: with `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True,max_split_size_mb:32`, the same explicit-qmm runtime reserved about `48 GiB` and measured `25.88s` hot median with `empty_cache=True`, or `15.86s` without `empty_cache`.
## Visual Comparison
[![Original BF16 vs SDNQ UINT4 static + quantized matmul, PE off](comparison/original_vs_sdnq_uint4_static_qmm_peoff_matrix.webp)](comparison/original_vs_sdnq_uint4_static_qmm_peoff_matrix.webp)
Individual prompt pairs are stored in `comparison/`, and full metrics are stored in `metrics/`.
## Usage
```python
import torch
import sdnq # registers SDNQ support
from diffusers import ErnieImagePipeline
from sdnq.loader import apply_sdnq_options_to_model
pipe = ErnieImagePipeline.from_pretrained(
"WaveCut/ERNIE-Image-Turbo-SDNQ-uint4-static",
torch_dtype=torch.bfloat16,
).to("cuda")
for name in ("pe", "text_encoder", "transformer"):
component = getattr(pipe, name, None)
if component is not None:
setattr(pipe, name, apply_sdnq_options_to_model(component, use_quantized_matmul=True))
image = pipe(
prompt="A clean modern poster with readable Cyrillic typography",
width=1024,
height=1024,
num_inference_steps=8,
guidance_scale=1.0,
use_pe=False,
).images[0]
```
If you need maximum throughput, keep the model resident and avoid calling `torch.cuda.empty_cache()` between requests.
You can confirm the runtime state after loading:
```python
for name in ("pe", "text_encoder", "transformer"):
qcfg = getattr(getattr(pipe, name, None), "quantization_config", None)
print(name, getattr(qcfg, "use_quantized_matmul", None))
```
## Prompt Set
| # | Prompt ID | Size | Seed | Focus |
|---:|---|---:|---:|---|
| 00 | `00-cyrillic-poster` | 1024x1024 | 41001 | Cyrillic event poster |
| 01 | `01-long-text-bakery-ad` | 896x1200 | 41002 | Long text product ad |
| 02 | `02-technical-diagram` | 1200x896 | 41003 | Technical diagram |
| 03 | `03-four-panel-comic` | 1024x1024 | 41004 | Four-panel comic |
| 04 | `04-public-domain-painter-fusion` | 1024x1024 | 41005 | Painterly style fusion |
| 05 | `05-dashboard-ui` | 1376x768 | 41006 | Dense UI dashboard |
| 06 | `06-glass-still-life` | 1024x1024 | 41007 | Glass and reflections |
| 07 | `07-botanical-field-guide` | 896x1200 | 41008 | Field guide plate |
| 08 | `08-restaurant-menu-board` | 1024x1024 | 41009 | Menu board text |
| 09 | `09-isometric-city-map` | 1200x896 | 41010 | Isometric map |
## Notes
- The comparison uses the same prompts, dimensions, seeds, 8 inference steps, and guidance scale for both original and quantized runs.
- `use_pe=True` remains supported by the pipeline, but it measures prompt-enhancer behavior in addition to image generation.
- Corrected qmm runtime metrics are stored in `metrics/ernie_uint4_qmm_explicit_default_allocator_8step_metrics.json`; allocator-debug metrics are stored in `metrics/runtime_allocator_debug_metrics.json`.
- This is an independent quantized artifact; see the original Baidu model card for upstream model details, benchmarks, and license terms.