ERNIE-Image-Turbo SDNQ UINT4 Static

This is a 4-bit SDNQ static quantization of baidu/ERNIE-Image-Turbo. The published SDNQ configs set use_quantized_matmul=true for pe, text_encoder, transformer, and the pipeline-level config. For current SDNQ/Diffusers builds, enable quantized matmul explicitly after loading with apply_sdnq_options_to_model; the serialized flag is retained in metadata, but may not be applied automatically by from_pretrained().

Recipe

Base model: baidu/ERNIE-Image-Turbo
Quantizer: sdnq / SDNQ UINT4 static, dequantize_fp32=false
Quantized components: pe, text_encoder, transformer
Runtime validation: use_quantized_matmul=true
Validation GPU: NVIDIA RTX 6000 Ada Generation
Validation settings: 10 fixed prompt/seed pairs, 8 inference steps, guidance scale 1.0, use_pe=False
Runtime note: do not set PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True,max_split_size_mb:32 for this pipeline; it caused allocator over-reservation and much slower denoising in validation.
Machine-readable runtime recommendations are stored in runtime_config.json.

use_pe=False is used for the headline validation table to compare the image models directly. Stage-level debugging showed that use_pe=True can dominate latency: on the 1200x896 technical-diagram prompt, pe.forward accounted for most of the runtime, while the denoising transformer was much smaller.

Measured Results

Model	PE	Load s	Load peak VRAM MiB	Cold inference s	Cold peak VRAM MiB	Hot mean s/img	Hot median s/img	Hot peak VRAM MiB
Original BF16	off	91.84	29692	7.67	34840	7.69	7.67	34932
SDNQ UINT4 static, serialized config path	off	71.84	10172	16.10	15254	11.15	12.26	15390

The row above is preserved for reproducibility of the original validation run. A follow-up profiling pass found that current loaders may leave quantized matmul disabled unless it is applied explicitly after loading.

Explicit Quantized-Matmul Runtime

With explicit apply_sdnq_options_to_model(..., use_quantized_matmul=True), default PyTorch CUDA allocator settings, and no torch.cuda.empty_cache() between hot generations:

Runtime	PE	Cold s	Hot mean s/img	Hot median s/img	Hot range s/img	Hot peak torch reserved MiB	Hot peak torch allocated MiB
SDNQ UINT4 static + explicit qmm	off	8.34	6.08	5.81	5.55-6.94	19540	19391

The slow component with PE disabled is the denoising transformer. In the corrected qmm profile, transformer.forward accounts for roughly 5.0-5.4s of a 5.8-7.0s hot generation on RTX 6000 Ada. text_encoder.forward is about 0.55-0.65s after warmup, and vae.decode is usually about 0.15s.

The allocator pitfall is large: with PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True,max_split_size_mb:32, the same explicit-qmm runtime reserved about 48 GiB and measured 25.88s hot median with empty_cache=True, or 15.86s without empty_cache.

Visual Comparison

Individual prompt pairs are stored in comparison/, and full metrics are stored in metrics/.

Usage

import torch
import sdnq  # registers SDNQ support
from diffusers import ErnieImagePipeline
from sdnq.loader import apply_sdnq_options_to_model

pipe = ErnieImagePipeline.from_pretrained(
    "WaveCut/ERNIE-Image-Turbo-SDNQ-uint4-static",
    torch_dtype=torch.bfloat16,
).to("cuda")

for name in ("pe", "text_encoder", "transformer"):
    component = getattr(pipe, name, None)
    if component is not None:
        setattr(pipe, name, apply_sdnq_options_to_model(component, use_quantized_matmul=True))

image = pipe(
    prompt="A clean modern poster with readable Cyrillic typography",
    width=1024,
    height=1024,
    num_inference_steps=8,
    guidance_scale=1.0,
    use_pe=False,
).images[0]

If you need maximum throughput, keep the model resident and avoid calling torch.cuda.empty_cache() between requests.

You can confirm the runtime state after loading:

for name in ("pe", "text_encoder", "transformer"):
    qcfg = getattr(getattr(pipe, name, None), "quantization_config", None)
    print(name, getattr(qcfg, "use_quantized_matmul", None))

Prompt Set

#	Prompt ID	Size	Seed	Focus
00	`00-cyrillic-poster`	1024x1024	41001	Cyrillic event poster
01	`01-long-text-bakery-ad`	896x1200	41002	Long text product ad
02	`02-technical-diagram`	1200x896	41003	Technical diagram
03	`03-four-panel-comic`	1024x1024	41004	Four-panel comic
04	`04-public-domain-painter-fusion`	1024x1024	41005	Painterly style fusion
05	`05-dashboard-ui`	1376x768	41006	Dense UI dashboard
06	`06-glass-still-life`	1024x1024	41007	Glass and reflections
07	`07-botanical-field-guide`	896x1200	41008	Field guide plate
08	`08-restaurant-menu-board`	1024x1024	41009	Menu board text
09	`09-isometric-city-map`	1200x896	41010	Isometric map

Notes

The comparison uses the same prompts, dimensions, seeds, 8 inference steps, and guidance scale for both original and quantized runs.
use_pe=True remains supported by the pipeline, but it measures prompt-enhancer behavior in addition to image generation.
Corrected qmm runtime metrics are stored in metrics/ernie_uint4_qmm_explicit_default_allocator_8step_metrics.json; allocator-debug metrics are stored in metrics/runtime_allocator_debug_metrics.json.
This is an independent quantized artifact; see the original Baidu model card for upstream model details, benchmarks, and license terms.

Downloads last month: 26

Model tree for WaveCut/ERNIE-Image-Turbo-SDNQ-uint4-static

Base model

baidu/ERNIE-Image-Turbo

Finetuned

(11)

this model