Document corrected ERNIE qmm runtime profile

b292728 verified 16 days ago

6.22 kB

	---
	license: apache-2.0
	base_model: baidu/ERNIE-Image-Turbo
	pipeline_tag: text-to-image
	library_name: diffusers
	tags:
	- text-to-image
	- diffusers
	- safetensors
	- ernie-image
	- sdnq
	- quantized
	- uint4
	- static
	- quantized-matmul
	---

	# ERNIE-Image-Turbo SDNQ UINT4 Static

	This is a 4-bit SDNQ static quantization of [baidu/ERNIE-Image-Turbo](https://huggingface.co/baidu/ERNIE-Image-Turbo).
	The published SDNQ configs set `use_quantized_matmul=true` for `pe`, `text_encoder`, `transformer`, and the pipeline-level config.
	For current SDNQ/Diffusers builds, enable quantized matmul explicitly after loading with `apply_sdnq_options_to_model`; the serialized flag is retained in metadata, but may not be applied automatically by `from_pretrained()`.

	## Recipe

	- Base model: `baidu/ERNIE-Image-Turbo`
	- Quantizer: `sdnq` / SDNQ UINT4 static, `dequantize_fp32=false`
	- Quantized components: `pe`, `text_encoder`, `transformer`
	- Runtime validation: `use_quantized_matmul=true`
	- Validation GPU: NVIDIA RTX 6000 Ada Generation
	- Validation settings: 10 fixed prompt/seed pairs, 8 inference steps, guidance scale 1.0, `use_pe=False`
	- Runtime note: do not set `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True,max_split_size_mb:32` for this pipeline; it caused allocator over-reservation and much slower denoising in validation.
	- Machine-readable runtime recommendations are stored in `runtime_config.json`.

	`use_pe=False` is used for the headline validation table to compare the image models directly. Stage-level debugging showed that `use_pe=True` can dominate latency: on the `1200x896` technical-diagram prompt, `pe.forward` accounted for most of the runtime, while the denoising transformer was much smaller.

	## Measured Results

	\| Model \| PE \| Load s \| Load peak VRAM MiB \| Cold inference s \| Cold peak VRAM MiB \| Hot mean s/img \| Hot median s/img \| Hot peak VRAM MiB \|
	\|---\|---:\|---:\|---:\|---:\|---:\|---:\|---:\|---:\|
	\| Original BF16 \| off \| 91.84 \| 29692 \| 7.67 \| 34840 \| 7.69 \| 7.67 \| 34932 \|
	\| SDNQ UINT4 static, serialized config path \| off \| 71.84 \| 10172 \| 16.10 \| 15254 \| 11.15 \| 12.26 \| 15390 \|

	The row above is preserved for reproducibility of the original validation run. A follow-up profiling pass found that current loaders may leave quantized matmul disabled unless it is applied explicitly after loading.

	### Explicit Quantized-Matmul Runtime

	With explicit `apply_sdnq_options_to_model(..., use_quantized_matmul=True)`, default PyTorch CUDA allocator settings, and no `torch.cuda.empty_cache()` between hot generations:

	\| Runtime \| PE \| Cold s \| Hot mean s/img \| Hot median s/img \| Hot range s/img \| Hot peak torch reserved MiB \| Hot peak torch allocated MiB \|
	\|---\|---:\|---:\|---:\|---:\|---:\|---:\|---:\|
	\| SDNQ UINT4 static + explicit qmm \| off \| 8.34 \| 6.08 \| 5.81 \| 5.55-6.94 \| 19540 \| 19391 \|

	The slow component with PE disabled is the denoising transformer. In the corrected qmm profile, `transformer.forward` accounts for roughly `5.0-5.4s` of a `5.8-7.0s` hot generation on RTX 6000 Ada. `text_encoder.forward` is about `0.55-0.65s` after warmup, and `vae.decode` is usually about `0.15s`.

	The allocator pitfall is large: with `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True,max_split_size_mb:32`, the same explicit-qmm runtime reserved about `48 GiB` and measured `25.88s` hot median with `empty_cache=True`, or `15.86s` without `empty_cache`.

	## Visual Comparison

	[![Original BF16 vs SDNQ UINT4 static + quantized matmul, PE off](comparison/original_vs_sdnq_uint4_static_qmm_peoff_matrix.webp)](comparison/original_vs_sdnq_uint4_static_qmm_peoff_matrix.webp)

	Individual prompt pairs are stored in `comparison/`, and full metrics are stored in `metrics/`.

	## Usage

	```python
	import torch
	import sdnq # registers SDNQ support
	from diffusers import ErnieImagePipeline
	from sdnq.loader import apply_sdnq_options_to_model

	pipe = ErnieImagePipeline.from_pretrained(
	"WaveCut/ERNIE-Image-Turbo-SDNQ-uint4-static",
	torch_dtype=torch.bfloat16,
	).to("cuda")

	for name in ("pe", "text_encoder", "transformer"):
	component = getattr(pipe, name, None)
	if component is not None:
	setattr(pipe, name, apply_sdnq_options_to_model(component, use_quantized_matmul=True))

	image = pipe(
	prompt="A clean modern poster with readable Cyrillic typography",
	width=1024,
	height=1024,
	num_inference_steps=8,
	guidance_scale=1.0,
	use_pe=False,
	).images[0]
	```

	If you need maximum throughput, keep the model resident and avoid calling `torch.cuda.empty_cache()` between requests.

	You can confirm the runtime state after loading:

	```python
	for name in ("pe", "text_encoder", "transformer"):
	qcfg = getattr(getattr(pipe, name, None), "quantization_config", None)
	print(name, getattr(qcfg, "use_quantized_matmul", None))
	```

	## Prompt Set

	\| # \| Prompt ID \| Size \| Seed \| Focus \|
	\|---:\|---\|---:\|---:\|---\|
	\| 00 \| `00-cyrillic-poster` \| 1024x1024 \| 41001 \| Cyrillic event poster \|
	\| 01 \| `01-long-text-bakery-ad` \| 896x1200 \| 41002 \| Long text product ad \|
	\| 02 \| `02-technical-diagram` \| 1200x896 \| 41003 \| Technical diagram \|
	\| 03 \| `03-four-panel-comic` \| 1024x1024 \| 41004 \| Four-panel comic \|
	\| 04 \| `04-public-domain-painter-fusion` \| 1024x1024 \| 41005 \| Painterly style fusion \|
	\| 05 \| `05-dashboard-ui` \| 1376x768 \| 41006 \| Dense UI dashboard \|
	\| 06 \| `06-glass-still-life` \| 1024x1024 \| 41007 \| Glass and reflections \|
	\| 07 \| `07-botanical-field-guide` \| 896x1200 \| 41008 \| Field guide plate \|
	\| 08 \| `08-restaurant-menu-board` \| 1024x1024 \| 41009 \| Menu board text \|
	\| 09 \| `09-isometric-city-map` \| 1200x896 \| 41010 \| Isometric map \|

	## Notes

	- The comparison uses the same prompts, dimensions, seeds, 8 inference steps, and guidance scale for both original and quantized runs.
	- `use_pe=True` remains supported by the pipeline, but it measures prompt-enhancer behavior in addition to image generation.
	- Corrected qmm runtime metrics are stored in `metrics/ernie_uint4_qmm_explicit_default_allocator_8step_metrics.json`; allocator-debug metrics are stored in `metrics/runtime_allocator_debug_metrics.json`.
	- This is an independent quantized artifact; see the original Baidu model card for upstream model details, benchmarks, and license terms.