oneObsession_v18 / README.md

Add files using upload-large-folder tool

85930a3 verified 17 days ago

4.98 kB

	---
	pipeline_tag: text-to-image
	library_name: diffusers
	tags:
	- sdxl
	- quantization
	- svdquant
	- nunchaku
	- fp4
	- int4
	base_model: tonera/oneObsession_v18
	base_model_relation: quantized
	license: apache-2.0
	---

	# Model Card (SVDQuant)

	> Language: English \| [中文](README_CN.md)

	## Model Name

	- Model repo: `tonera/oneObsession_v18`
	- Base (Diffusers weights path): `tonera/oneObsession_v18` (repo root)
	- Quantized UNet weights: `tonera/oneObsession_v18/svdq-<precision>_r32-oneObsession_v18.safetensors`

	## Quantization / Inference Tech

	- Inference engine: Nunchaku (`https://github.com/nunchaku-ai/nunchaku`)

	Nunchaku is a high-performance inference engine for 4-bit (FP4/INT4) low-bit neural networks. Its goal is to significantly reduce VRAM usage and improve inference speed while preserving generation quality as much as possible. It implements and productionizes post-training quantization methods such as SVDQuant, and reduces the overhead introduced by low-rank branches via operator/kernel fusion and other optimizations.

	The SDXL quantized weights in this repository (e.g. `svdq-_r32-.safetensors`) are intended to be used with Nunchaku for efficient inference on supported GPUs.

	## Quantization Quality (fp8)

	```text
	PSNR: mean=17.6924 p50=17.4895 p90=20.9097 best=23.9327 worst=11.4063 (N=25)
	SSIM: mean=0.726276 p50=0.734118 p90=0.834601 best=0.860543 worst=0.550507 (N=25)
	LPIPS: mean=0.323782 p50=0.261115 p90=0.492602 best=0.124099 worst=0.533022 (N=25)
	```

	## Performance

	Below is the inference performance comparison (Diffusers vs Nunchaku-UNet).

	- Inference config: `bf16 / steps=30 / guidance_scale=5.0`
	- Resolutions (5 images each, batch=5): `1024x1024`, `1024x768`, `768x1024`, `832x1216`, `1216x832`
	- Software versions: `torch 2.9` / `cuda 12.8` / `nunchaku 1.1.0+torch2.9` / `diffusers 0.37.0.dev0`
	- Optimization switches: no `torch.compile`, no explicit `cudnn` tuning flags

	### Cold-start performance (end-to-end for the first image)

	\| GPU \| Metric \| Diffusers \| Nunchaku \| Speedup \| Gain \|
	\|-----\|--------\|-----------\|----------\|---------\|------\|
	\| RTX 5090 \| load \| 3.505s \| 3.432s \| 1.02x \| +2.1% \|
	\| RTX 5090 \| cold_infer \| 2.944s \| 2.447s \| 1.20x \| +16.9% \|
	\| RTX 5090 \| cold_e2e \| 6.449s \| 5.880s \| 1.10x \| +8.8% \|
	\| RTX 3090 \| load \| 3.787s \| 3.442s \| 1.10x \| +9.1% \|
	\| RTX 3090 \| cold_infer \| 7.503s \| 5.231s \| 1.43x \| +30.3% \|
	\| RTX 3090 \| cold_e2e \| 11.290s \| 8.673s \| 1.30x \| +23.2% \|

	### Steady-state performance (5 consecutive images after warmup)

	\| GPU \| Metric \| Diffusers \| Nunchaku \| Speedup \| Gain \|
	\|-----\|--------\|-----------\|----------\|---------\|------\|
	\| RTX 5090 \| total (5 images) \| 12.937s \| 9.813s \| 1.32x \| +24.2% \|
	\| RTX 5090 \| avg (per image) \| 2.587s \| 1.963s \| 1.32x \| +24.2% \|
	\| RTX 3090 \| total (5 images) \| 33.413s \| 22.975s \| 1.45x \| +31.2% \|
	\| RTX 3090 \| avg (per image) \| 6.683s \| 4.595s \| 1.45x \| +31.2% \|

	Notes:
	- The longer load time on RTX 3090 is due to extra one-time processing when loading quantized weights.
	- During inference (cold_infer and steady-state), Nunchaku shows clear speedups on both GPUs.

	## Nunchaku Installation Required

	- Official installation docs (recommended source of truth): `https://nunchaku.tech/docs/nunchaku/installation/installation.html`

	### (Recommended) Install the official prebuilt wheel

	- Prerequisite: `PyTorch >= 2.5` (follow the wheel requirements)
	- Install Nunchaku wheel: choose a wheel matching your torch/cuda/python versions from GitHub Releases / HuggingFace / ModelScope (note `cp311` means Python 3.11):
	- `https://github.com/nunchaku-ai/nunchaku/releases`

	```bash
	# Example (select the correct wheel URL for your torch/cuda/python versions)
	pip install https://github.com/nunchaku-ai/nunchaku/releases/download/vX.Y.Z/nunchaku-X.Y.Z+torch2.9-cp311-cp311-linux_x86_64.whl
	```

	- Tip (RTX 50 series): typically prefer `CUDA >= 12.8`, and prefer FP4 models for compatibility/performance (follow official docs).

	## Usage Example (Diffusers + Nunchaku UNet)

	```python
	import torch
	from diffusers import StableDiffusionXLPipeline

	from nunchaku.models.unets.unet_sdxl import NunchakuSDXLUNet2DConditionModel
	from nunchaku.utils import get_precision

	MODEL = "oneObsession_v18" # Replace with the actual model name before publishing (e.g. zavychromaxl_v100)
	REPO_ID = f"tonera/{MODEL}"

	if __name__ == "__main__":
	unet = NunchakuSDXLUNet2DConditionModel.from_pretrained(
	f"{REPO_ID}/svdq-{get_precision()}_r32-{MODEL}.safetensors"
	)

	pipe = StableDiffusionXLPipeline.from_pretrained(
	f"{REPO_ID}",
	unet=unet,
	torch_dtype=torch.bfloat16,
	use_safetensors=True,
	).to("cuda")

	prompt = "Make Pikachu hold a sign that says 'Nunchaku is awesome', yarn art style, detailed, vibrant colors"
	image = pipe(prompt=prompt, guidance_scale=5.0, num_inference_steps=30).images[0]
	image.save("sdxl.png")
	```