| | --- |
| | pipeline_tag: text-to-image |
| | library_name: diffusers |
| | tags: |
| | - sdxl |
| | - quantization |
| | - svdquant |
| | - nunchaku |
| | - fp4 |
| | - int4 |
| | base_model: tonera/dvine_v70 |
| | base_model_relation: quantized |
| | license: apache-2.0 |
| | --- |
| | |
| | # Model Card (SVDQuant) |
| |
|
| | > **Language**: English | [中文](README_CN.md) |
| |
|
| | ## Model Name |
| |
|
| | - **Model repo**: `tonera/dvine_v70` |
| | - **Base (Diffusers weights path)**: `tonera/dvine_v70` (repo root) |
| | - **Quantized UNet weights**: `tonera/dvine_v70/svdq-<precision>_r32-dvine_v70.safetensors` |
| |
|
| | ## Quantization / Inference Tech |
| |
|
| | - **Inference engine**: Nunchaku (`https://github.com/nunchaku-ai/nunchaku`) |
| |
|
| | Nunchaku is a high-performance inference engine for **4-bit (FP4/INT4) low-bit neural networks**. Its goal is to significantly reduce VRAM usage and improve inference speed while preserving generation quality as much as possible. It implements and productionizes post-training quantization methods such as **SVDQuant**, and reduces the overhead introduced by low-rank branches via operator/kernel fusion and other optimizations. |
| |
|
| | The SDXL quantized weights in this repository (e.g. `svdq-*_r32-*.safetensors`) are intended to be used with Nunchaku for efficient inference on supported GPUs. |
| |
|
| | ## Quantization Quality (fp8) |
| |
|
| | ```text |
| | PSNR: mean=19.3156 p50=18.0907 p90=24.8075 best=28.2158 worst=14.2874 (N=25) |
| | SSIM: mean=0.787972 p50=0.782514 p90=0.896604 best=0.908375 worst=0.652052 (N=25) |
| | LPIPS: mean=0.27435 p50=0.240241 p90=0.453806 best=0.0970999 worst=0.524179 (N=25) |
| | ``` |
| |
|
| | ## Performance |
| |
|
| | Below is the inference performance comparison (Diffusers vs Nunchaku-UNet). |
| |
|
| | - **Inference config**: `bf16 / steps=30 / guidance_scale=5.0` |
| | - **Resolutions (5 images each, batch=5)**: `1024x1024`, `1024x768`, `768x1024`, `832x1216`, `1216x832` |
| | - **Software versions**: `torch 2.9` / `cuda 12.8` / `nunchaku 1.1.0+torch2.9` / `diffusers 0.37.0.dev0` |
| | - **Optimization switches**: no `torch.compile`, no explicit `cudnn` tuning flags |
| |
|
| | ### Cold-start performance (end-to-end for the first image) |
| |
|
| | | GPU | Metric | Diffusers | Nunchaku | Speedup | Gain | |
| | |-----|--------|-----------|----------|---------|------| |
| | | RTX 5090 | load | 3.505s | 3.432s | 1.02x | +2.1% | |
| | | RTX 5090 | cold_infer | 2.944s | 2.447s | 1.20x | +16.9% | |
| | | RTX 5090 | cold_e2e | 6.449s | 5.880s | 1.10x | +8.8% | |
| | | RTX 3090 | load | 3.787s | 3.442s | 1.10x | +9.1% | |
| | | RTX 3090 | cold_infer | 7.503s | 5.231s | 1.43x | +30.3% | |
| | | RTX 3090 | cold_e2e | 11.290s | 8.673s | 1.30x | +23.2% | |
| |
|
| | ### Steady-state performance (5 consecutive images after warmup) |
| |
|
| | | GPU | Metric | Diffusers | Nunchaku | Speedup | Gain | |
| | |-----|--------|-----------|----------|---------|------| |
| | | RTX 5090 | total (5 images) | 12.937s | 9.813s | 1.32x | +24.2% | |
| | | RTX 5090 | avg (per image) | 2.587s | 1.963s | 1.32x | +24.2% | |
| | | RTX 3090 | total (5 images) | 33.413s | 22.975s | 1.45x | +31.2% | |
| | | RTX 3090 | avg (per image) | 6.683s | 4.595s | 1.45x | +31.2% | |
| |
|
| | **Notes**: |
| | - The longer load time on RTX 3090 is due to extra one-time processing when loading quantized weights. |
| | - During inference (cold_infer and steady-state), Nunchaku shows clear speedups on both GPUs. |
| | |
| | ## Nunchaku Installation Required |
| | |
| | - **Official installation docs** (recommended source of truth): `https://nunchaku.tech/docs/nunchaku/installation/installation.html` |
| | |
| | ### (Recommended) Install the official prebuilt wheel |
| | |
| | - **Prerequisite**: `PyTorch >= 2.5` (follow the wheel requirements) |
| | - **Install Nunchaku wheel**: choose a wheel matching your torch/cuda/python versions from GitHub Releases / HuggingFace / ModelScope (note `cp311` means Python 3.11): |
| | - `https://github.com/nunchaku-ai/nunchaku/releases` |
| | |
| | ```bash |
| | # Example (select the correct wheel URL for your torch/cuda/python versions) |
| | pip install https://github.com/nunchaku-ai/nunchaku/releases/download/vX.Y.Z/nunchaku-X.Y.Z+torch2.9-cp311-cp311-linux_x86_64.whl |
| | ``` |
| | |
| | - **Tip (RTX 50 series)**: typically prefer `CUDA >= 12.8`, and prefer FP4 models for compatibility/performance (follow official docs). |
| | |
| | ## Usage Example (Diffusers + Nunchaku UNet) |
| | |
| | ```python |
| | import torch |
| | from diffusers import StableDiffusionXLPipeline |
| | |
| | from nunchaku.models.unets.unet_sdxl import NunchakuSDXLUNet2DConditionModel |
| | from nunchaku.utils import get_precision |
| | |
| | MODEL = "dvine_v70" # Replace with the actual model name before publishing (e.g. zavychromaxl_v100) |
| | REPO_ID = f"tonera/{MODEL}" |
| |
|
| | if __name__ == "__main__": |
| | unet = NunchakuSDXLUNet2DConditionModel.from_pretrained( |
| | f"{REPO_ID}/svdq-{get_precision()}_r32-{MODEL}.safetensors" |
| | ) |
| | |
| | pipe = StableDiffusionXLPipeline.from_pretrained( |
| | f"{REPO_ID}", |
| | unet=unet, |
| | torch_dtype=torch.bfloat16, |
| | use_safetensors=True, |
| | ).to("cuda") |
| | |
| | prompt = "Make Pikachu hold a sign that says 'Nunchaku is awesome', yarn art style, detailed, vibrant colors" |
| | image = pipe(prompt=prompt, guidance_scale=5.0, num_inference_steps=30).images[0] |
| | image.save("sdxl.png") |
| | ``` |
| | |
| |
|