How to use from the
Use from the
Diffusers library
pip install -U diffusers transformers accelerate
import torch
from diffusers import DiffusionPipeline

# switch to "mps" for apple devices
pipe = DiffusionPipeline.from_pretrained("WaveCut/ERNIE-Image-Turbo-SDNQ-uint4-static", dtype=torch.bfloat16, device_map="cuda")

prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
image = pipe(prompt).images[0]

ERNIE-Image-Turbo SDNQ UINT4 Static

This is a 4-bit SDNQ static quantization of baidu/ERNIE-Image-Turbo. The published SDNQ configs set use_quantized_matmul=true for pe, text_encoder, transformer, and the pipeline-level config. For current SDNQ/Diffusers builds, enable quantized matmul explicitly after loading with apply_sdnq_options_to_model; the serialized flag is retained in metadata, but may not be applied automatically by from_pretrained().

Recipe

  • Base model: baidu/ERNIE-Image-Turbo
  • Quantizer: sdnq / SDNQ UINT4 static, dequantize_fp32=false
  • Quantized components: pe, text_encoder, transformer
  • Runtime validation: use_quantized_matmul=true
  • Validation GPU: NVIDIA RTX 6000 Ada Generation
  • Validation settings: 10 fixed prompt/seed pairs, 8 inference steps, guidance scale 1.0, use_pe=False
  • Runtime note: do not set PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True,max_split_size_mb:32 for this pipeline; it caused allocator over-reservation and much slower denoising in validation.
  • Machine-readable runtime recommendations are stored in runtime_config.json.

use_pe=False is used for the headline validation table to compare the image models directly. Stage-level debugging showed that use_pe=True can dominate latency: on the 1200x896 technical-diagram prompt, pe.forward accounted for most of the runtime, while the denoising transformer was much smaller.

Measured Results

Model PE Load s Load peak VRAM MiB Cold inference s Cold peak VRAM MiB Hot mean s/img Hot median s/img Hot peak VRAM MiB
Original BF16 off 91.84 29692 7.67 34840 7.69 7.67 34932
SDNQ UINT4 static, serialized config path off 71.84 10172 16.10 15254 11.15 12.26 15390

The row above is preserved for reproducibility of the original validation run. A follow-up profiling pass found that current loaders may leave quantized matmul disabled unless it is applied explicitly after loading.

Explicit Quantized-Matmul Runtime

With explicit apply_sdnq_options_to_model(..., use_quantized_matmul=True), default PyTorch CUDA allocator settings, and no torch.cuda.empty_cache() between hot generations:

Runtime PE Cold s Hot mean s/img Hot median s/img Hot range s/img Hot peak torch reserved MiB Hot peak torch allocated MiB
SDNQ UINT4 static + explicit qmm off 8.34 6.08 5.81 5.55-6.94 19540 19391

The slow component with PE disabled is the denoising transformer. In the corrected qmm profile, transformer.forward accounts for roughly 5.0-5.4s of a 5.8-7.0s hot generation on RTX 6000 Ada. text_encoder.forward is about 0.55-0.65s after warmup, and vae.decode is usually about 0.15s.

The allocator pitfall is large: with PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True,max_split_size_mb:32, the same explicit-qmm runtime reserved about 48 GiB and measured 25.88s hot median with empty_cache=True, or 15.86s without empty_cache.

Visual Comparison

Original BF16 vs SDNQ UINT4 static + quantized matmul, PE off

Individual prompt pairs are stored in comparison/, and full metrics are stored in metrics/.

Usage

import torch
import sdnq  # registers SDNQ support
from diffusers import ErnieImagePipeline
from sdnq.loader import apply_sdnq_options_to_model

pipe = ErnieImagePipeline.from_pretrained(
    "WaveCut/ERNIE-Image-Turbo-SDNQ-uint4-static",
    torch_dtype=torch.bfloat16,
).to("cuda")

for name in ("pe", "text_encoder", "transformer"):
    component = getattr(pipe, name, None)
    if component is not None:
        setattr(pipe, name, apply_sdnq_options_to_model(component, use_quantized_matmul=True))

image = pipe(
    prompt="A clean modern poster with readable Cyrillic typography",
    width=1024,
    height=1024,
    num_inference_steps=8,
    guidance_scale=1.0,
    use_pe=False,
).images[0]

If you need maximum throughput, keep the model resident and avoid calling torch.cuda.empty_cache() between requests.

You can confirm the runtime state after loading:

for name in ("pe", "text_encoder", "transformer"):
    qcfg = getattr(getattr(pipe, name, None), "quantization_config", None)
    print(name, getattr(qcfg, "use_quantized_matmul", None))

Prompt Set

# Prompt ID Size Seed Focus
00 00-cyrillic-poster 1024x1024 41001 Cyrillic event poster
01 01-long-text-bakery-ad 896x1200 41002 Long text product ad
02 02-technical-diagram 1200x896 41003 Technical diagram
03 03-four-panel-comic 1024x1024 41004 Four-panel comic
04 04-public-domain-painter-fusion 1024x1024 41005 Painterly style fusion
05 05-dashboard-ui 1376x768 41006 Dense UI dashboard
06 06-glass-still-life 1024x1024 41007 Glass and reflections
07 07-botanical-field-guide 896x1200 41008 Field guide plate
08 08-restaurant-menu-board 1024x1024 41009 Menu board text
09 09-isometric-city-map 1200x896 41010 Isometric map

Notes

  • The comparison uses the same prompts, dimensions, seeds, 8 inference steps, and guidance scale for both original and quantized runs.
  • use_pe=True remains supported by the pipeline, but it measures prompt-enhancer behavior in addition to image generation.
  • Corrected qmm runtime metrics are stored in metrics/ernie_uint4_qmm_explicit_default_allocator_8step_metrics.json; allocator-debug metrics are stored in metrics/runtime_allocator_debug_metrics.json.
  • This is an independent quantized artifact; see the original Baidu model card for upstream model details, benchmarks, and license terms.
Downloads last month
62
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for WaveCut/ERNIE-Image-Turbo-SDNQ-uint4-static

Finetuned
(7)
this model