Instructions to use WaveCut/ERNIE-Image-Turbo-SDNQ-uint4-static with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Diffusers
How to use WaveCut/ERNIE-Image-Turbo-SDNQ-uint4-static with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("WaveCut/ERNIE-Image-Turbo-SDNQ-uint4-static", dtype=torch.bfloat16, device_map="cuda") prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" image = pipe(prompt).images[0] - Notebooks
- Google Colab
- Kaggle
- Local Apps
- Draw Things
- DiffusionBee
ERNIE-Image-Turbo SDNQ UINT4 Static
This is a 4-bit SDNQ static quantization of baidu/ERNIE-Image-Turbo.
The published SDNQ configs set use_quantized_matmul=true for pe, text_encoder, transformer, and the pipeline-level config.
For current SDNQ/Diffusers builds, enable quantized matmul explicitly after loading with apply_sdnq_options_to_model; the serialized flag is retained in metadata, but may not be applied automatically by from_pretrained().
Recipe
- Base model:
baidu/ERNIE-Image-Turbo - Quantizer:
sdnq/ SDNQ UINT4 static,dequantize_fp32=false - Quantized components:
pe,text_encoder,transformer - Runtime validation:
use_quantized_matmul=true - Validation GPU: NVIDIA RTX 6000 Ada Generation
- Validation settings: 10 fixed prompt/seed pairs, 8 inference steps, guidance scale 1.0,
use_pe=False - Runtime note: do not set
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True,max_split_size_mb:32for this pipeline; it caused allocator over-reservation and much slower denoising in validation. - Machine-readable runtime recommendations are stored in
runtime_config.json.
use_pe=False is used for the headline validation table to compare the image models directly. Stage-level debugging showed that use_pe=True can dominate latency: on the 1200x896 technical-diagram prompt, pe.forward accounted for most of the runtime, while the denoising transformer was much smaller.
Measured Results
| Model | PE | Load s | Load peak VRAM MiB | Cold inference s | Cold peak VRAM MiB | Hot mean s/img | Hot median s/img | Hot peak VRAM MiB |
|---|---|---|---|---|---|---|---|---|
| Original BF16 | off | 91.84 | 29692 | 7.67 | 34840 | 7.69 | 7.67 | 34932 |
| SDNQ UINT4 static, serialized config path | off | 71.84 | 10172 | 16.10 | 15254 | 11.15 | 12.26 | 15390 |
The row above is preserved for reproducibility of the original validation run. A follow-up profiling pass found that current loaders may leave quantized matmul disabled unless it is applied explicitly after loading.
Explicit Quantized-Matmul Runtime
With explicit apply_sdnq_options_to_model(..., use_quantized_matmul=True), default PyTorch CUDA allocator settings, and no torch.cuda.empty_cache() between hot generations:
| Runtime | PE | Cold s | Hot mean s/img | Hot median s/img | Hot range s/img | Hot peak torch reserved MiB | Hot peak torch allocated MiB |
|---|---|---|---|---|---|---|---|
| SDNQ UINT4 static + explicit qmm | off | 8.34 | 6.08 | 5.81 | 5.55-6.94 | 19540 | 19391 |
The slow component with PE disabled is the denoising transformer. In the corrected qmm profile, transformer.forward accounts for roughly 5.0-5.4s of a 5.8-7.0s hot generation on RTX 6000 Ada. text_encoder.forward is about 0.55-0.65s after warmup, and vae.decode is usually about 0.15s.
The allocator pitfall is large: with PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True,max_split_size_mb:32, the same explicit-qmm runtime reserved about 48 GiB and measured 25.88s hot median with empty_cache=True, or 15.86s without empty_cache.
Visual Comparison
Individual prompt pairs are stored in comparison/, and full metrics are stored in metrics/.
Usage
import torch
import sdnq # registers SDNQ support
from diffusers import ErnieImagePipeline
from sdnq.loader import apply_sdnq_options_to_model
pipe = ErnieImagePipeline.from_pretrained(
"WaveCut/ERNIE-Image-Turbo-SDNQ-uint4-static",
torch_dtype=torch.bfloat16,
).to("cuda")
for name in ("pe", "text_encoder", "transformer"):
component = getattr(pipe, name, None)
if component is not None:
setattr(pipe, name, apply_sdnq_options_to_model(component, use_quantized_matmul=True))
image = pipe(
prompt="A clean modern poster with readable Cyrillic typography",
width=1024,
height=1024,
num_inference_steps=8,
guidance_scale=1.0,
use_pe=False,
).images[0]
If you need maximum throughput, keep the model resident and avoid calling torch.cuda.empty_cache() between requests.
You can confirm the runtime state after loading:
for name in ("pe", "text_encoder", "transformer"):
qcfg = getattr(getattr(pipe, name, None), "quantization_config", None)
print(name, getattr(qcfg, "use_quantized_matmul", None))
Prompt Set
| # | Prompt ID | Size | Seed | Focus |
|---|---|---|---|---|
| 00 | 00-cyrillic-poster |
1024x1024 | 41001 | Cyrillic event poster |
| 01 | 01-long-text-bakery-ad |
896x1200 | 41002 | Long text product ad |
| 02 | 02-technical-diagram |
1200x896 | 41003 | Technical diagram |
| 03 | 03-four-panel-comic |
1024x1024 | 41004 | Four-panel comic |
| 04 | 04-public-domain-painter-fusion |
1024x1024 | 41005 | Painterly style fusion |
| 05 | 05-dashboard-ui |
1376x768 | 41006 | Dense UI dashboard |
| 06 | 06-glass-still-life |
1024x1024 | 41007 | Glass and reflections |
| 07 | 07-botanical-field-guide |
896x1200 | 41008 | Field guide plate |
| 08 | 08-restaurant-menu-board |
1024x1024 | 41009 | Menu board text |
| 09 | 09-isometric-city-map |
1200x896 | 41010 | Isometric map |
Notes
- The comparison uses the same prompts, dimensions, seeds, 8 inference steps, and guidance scale for both original and quantized runs.
use_pe=Trueremains supported by the pipeline, but it measures prompt-enhancer behavior in addition to image generation.- Corrected qmm runtime metrics are stored in
metrics/ernie_uint4_qmm_explicit_default_allocator_8step_metrics.json; allocator-debug metrics are stored inmetrics/runtime_allocator_debug_metrics.json. - This is an independent quantized artifact; see the original Baidu model card for upstream model details, benchmarks, and license terms.
- Downloads last month
- 62
Model tree for WaveCut/ERNIE-Image-Turbo-SDNQ-uint4-static
Base model
baidu/ERNIE-Image-Turbo