Text-to-Image
Diffusers
Safetensors
ErnieImagePipeline
ernie-image
sdnq
quantized
uint4
static
quantized-matmul
Instructions to use WaveCut/ERNIE-Image-Turbo-SDNQ-uint4-static with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Diffusers
How to use WaveCut/ERNIE-Image-Turbo-SDNQ-uint4-static with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("WaveCut/ERNIE-Image-Turbo-SDNQ-uint4-static", dtype=torch.bfloat16, device_map="cuda") prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" image = pipe(prompt).images[0] - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- Draw Things
- DiffusionBee
| license: apache-2.0 | |
| base_model: baidu/ERNIE-Image-Turbo | |
| pipeline_tag: text-to-image | |
| library_name: diffusers | |
| tags: | |
| - text-to-image | |
| - diffusers | |
| - safetensors | |
| - ernie-image | |
| - sdnq | |
| - quantized | |
| - uint4 | |
| - static | |
| - quantized-matmul | |
| # ERNIE-Image-Turbo SDNQ UINT4 Static | |
| This is a 4-bit SDNQ static quantization of [baidu/ERNIE-Image-Turbo](https://huggingface.co/baidu/ERNIE-Image-Turbo). | |
| The published SDNQ configs set `use_quantized_matmul=true` for `pe`, `text_encoder`, `transformer`, and the pipeline-level config. | |
| For current SDNQ/Diffusers builds, enable quantized matmul explicitly after loading with `apply_sdnq_options_to_model`; the serialized flag is retained in metadata, but may not be applied automatically by `from_pretrained()`. | |
| ## Recipe | |
| - Base model: `baidu/ERNIE-Image-Turbo` | |
| - Quantizer: `sdnq` / SDNQ UINT4 static, `dequantize_fp32=false` | |
| - Quantized components: `pe`, `text_encoder`, `transformer` | |
| - Runtime validation: `use_quantized_matmul=true` | |
| - Validation GPU: NVIDIA RTX 6000 Ada Generation | |
| - Validation settings: 10 fixed prompt/seed pairs, 8 inference steps, guidance scale 1.0, `use_pe=False` | |
| - Runtime note: do not set `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True,max_split_size_mb:32` for this pipeline; it caused allocator over-reservation and much slower denoising in validation. | |
| - Machine-readable runtime recommendations are stored in `runtime_config.json`. | |
| `use_pe=False` is used for the headline validation table to compare the image models directly. Stage-level debugging showed that `use_pe=True` can dominate latency: on the `1200x896` technical-diagram prompt, `pe.forward` accounted for most of the runtime, while the denoising transformer was much smaller. | |
| ## Measured Results | |
| | Model | PE | Load s | Load peak VRAM MiB | Cold inference s | Cold peak VRAM MiB | Hot mean s/img | Hot median s/img | Hot peak VRAM MiB | | |
| |---|---:|---:|---:|---:|---:|---:|---:|---:| | |
| | Original BF16 | off | 91.84 | 29692 | 7.67 | 34840 | 7.69 | 7.67 | 34932 | | |
| | SDNQ UINT4 static, serialized config path | off | 71.84 | 10172 | 16.10 | 15254 | 11.15 | 12.26 | 15390 | | |
| The row above is preserved for reproducibility of the original validation run. A follow-up profiling pass found that current loaders may leave quantized matmul disabled unless it is applied explicitly after loading. | |
| ### Explicit Quantized-Matmul Runtime | |
| With explicit `apply_sdnq_options_to_model(..., use_quantized_matmul=True)`, default PyTorch CUDA allocator settings, and no `torch.cuda.empty_cache()` between hot generations: | |
| | Runtime | PE | Cold s | Hot mean s/img | Hot median s/img | Hot range s/img | Hot peak torch reserved MiB | Hot peak torch allocated MiB | | |
| |---|---:|---:|---:|---:|---:|---:|---:| | |
| | SDNQ UINT4 static + explicit qmm | off | 8.34 | 6.08 | 5.81 | 5.55-6.94 | 19540 | 19391 | | |
| The slow component with PE disabled is the denoising transformer. In the corrected qmm profile, `transformer.forward` accounts for roughly `5.0-5.4s` of a `5.8-7.0s` hot generation on RTX 6000 Ada. `text_encoder.forward` is about `0.55-0.65s` after warmup, and `vae.decode` is usually about `0.15s`. | |
| The allocator pitfall is large: with `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True,max_split_size_mb:32`, the same explicit-qmm runtime reserved about `48 GiB` and measured `25.88s` hot median with `empty_cache=True`, or `15.86s` without `empty_cache`. | |
| ## Visual Comparison | |
| [](comparison/original_vs_sdnq_uint4_static_qmm_peoff_matrix.webp) | |
| Individual prompt pairs are stored in `comparison/`, and full metrics are stored in `metrics/`. | |
| ## Usage | |
| ```python | |
| import torch | |
| import sdnq # registers SDNQ support | |
| from diffusers import ErnieImagePipeline | |
| from sdnq.loader import apply_sdnq_options_to_model | |
| pipe = ErnieImagePipeline.from_pretrained( | |
| "WaveCut/ERNIE-Image-Turbo-SDNQ-uint4-static", | |
| torch_dtype=torch.bfloat16, | |
| ).to("cuda") | |
| for name in ("pe", "text_encoder", "transformer"): | |
| component = getattr(pipe, name, None) | |
| if component is not None: | |
| setattr(pipe, name, apply_sdnq_options_to_model(component, use_quantized_matmul=True)) | |
| image = pipe( | |
| prompt="A clean modern poster with readable Cyrillic typography", | |
| width=1024, | |
| height=1024, | |
| num_inference_steps=8, | |
| guidance_scale=1.0, | |
| use_pe=False, | |
| ).images[0] | |
| ``` | |
| If you need maximum throughput, keep the model resident and avoid calling `torch.cuda.empty_cache()` between requests. | |
| You can confirm the runtime state after loading: | |
| ```python | |
| for name in ("pe", "text_encoder", "transformer"): | |
| qcfg = getattr(getattr(pipe, name, None), "quantization_config", None) | |
| print(name, getattr(qcfg, "use_quantized_matmul", None)) | |
| ``` | |
| ## Prompt Set | |
| | # | Prompt ID | Size | Seed | Focus | | |
| |---:|---|---:|---:|---| | |
| | 00 | `00-cyrillic-poster` | 1024x1024 | 41001 | Cyrillic event poster | | |
| | 01 | `01-long-text-bakery-ad` | 896x1200 | 41002 | Long text product ad | | |
| | 02 | `02-technical-diagram` | 1200x896 | 41003 | Technical diagram | | |
| | 03 | `03-four-panel-comic` | 1024x1024 | 41004 | Four-panel comic | | |
| | 04 | `04-public-domain-painter-fusion` | 1024x1024 | 41005 | Painterly style fusion | | |
| | 05 | `05-dashboard-ui` | 1376x768 | 41006 | Dense UI dashboard | | |
| | 06 | `06-glass-still-life` | 1024x1024 | 41007 | Glass and reflections | | |
| | 07 | `07-botanical-field-guide` | 896x1200 | 41008 | Field guide plate | | |
| | 08 | `08-restaurant-menu-board` | 1024x1024 | 41009 | Menu board text | | |
| | 09 | `09-isometric-city-map` | 1200x896 | 41010 | Isometric map | | |
| ## Notes | |
| - The comparison uses the same prompts, dimensions, seeds, 8 inference steps, and guidance scale for both original and quantized runs. | |
| - `use_pe=True` remains supported by the pipeline, but it measures prompt-enhancer behavior in addition to image generation. | |
| - Corrected qmm runtime metrics are stored in `metrics/ernie_uint4_qmm_explicit_default_allocator_8step_metrics.json`; allocator-debug metrics are stored in `metrics/runtime_allocator_debug_metrics.json`. | |
| - This is an independent quantized artifact; see the original Baidu model card for upstream model details, benchmarks, and license terms. | |