Text-to-Image
Diffusers
Safetensors
StableDiffusionPipeline
diffusion
sd-turbo
quantization
pruning
distillation
edge-ai
mixed-precision
Instructions to use ChenHe727/EdgeDiffusion with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Diffusers
How to use ChenHe727/EdgeDiffusion with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("ChenHe727/EdgeDiffusion", dtype=torch.bfloat16, device_map="cuda") prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" image = pipe(prompt).images[0] - Notebooks
- Google Colab
- Kaggle
- Local Apps
- Draw Things
- DiffusionBee
File size: 6,145 Bytes
5700337 2f19f9e 5700337 2f19f9e 5700337 2f19f9e 5700337 2f19f9e 5700337 2f19f9e 5700337 2f19f9e 5700337 2f19f9e 5700337 2f19f9e 5700337 2f19f9e 5700337 2f19f9e 5700337 2f19f9e 5700337 2f19f9e 5700337 2f19f9e 5700337 2f19f9e 5700337 2f19f9e 5700337 2f19f9e 5700337 2f19f9e 5700337 2f19f9e 5700337 2f19f9e 5700337 2f19f9e 5700337 2f19f9e 5700337 2f19f9e 5700337 2f19f9e 5700337 2f19f9e 5700337 2f19f9e 5700337 2f19f9e 5700337 2f19f9e 5700337 2f19f9e 5700337 2f19f9e 5700337 2f19f9e 5700337 2f19f9e 5700337 2f19f9e 5700337 2f19f9e 5700337 2f19f9e | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 | ---
license: openrail++
library_name: diffusers
pipeline_tag: text-to-image
base_model: stabilityai/sd-turbo
tags:
- diffusion
- text-to-image
- sd-turbo
- quantization
- pruning
- distillation
- edge-ai
- mixed-precision
---
# EdgeDiffuse
Edge-deployable SD-Turbo via multi-stage compression: structural pruning β distillation β sensitivity-aware mixed-precision quantization (GPTQ) β QLoRA recovery.
**Code & paper-style writeup**: [github.com/SeanHe727/EdgeDiffusion](https://github.com/SeanHe727/EdgeDiffusion)
---
## What's in this repo
| File / dir | What it is |
|---|---|
| `unet/` | Mixed-precision quantized UNet (GPTQ-applied). 152 of 192 Linear layers quantized to INT4 (45) / INT8 (107); the rest stay fp16. Fake-quantized: values rounded to int grid, stored as bf16. |
| `text_encoder/`, `vae/`, `tokenizer/`, `scheduler/`, `model_index.json` | Standard `stabilityai/sd-turbo` components, unmodified |
| `lora_adapter.pt` | (Optional) QLoRA recovery adapter trained on top of the quantized UNet. Improves LPIPS by ~8 % when applied. See "Advanced: QLoRA recovery" below. |
| `mp_quant_metadata.json` | Per-layer bit-width assignment + GPTQ hyper-parameters for full reproducibility |
---
## Quick start
```python
from diffusers import StableDiffusionPipeline
import torch
pipe = StableDiffusionPipeline.from_pretrained(
"ChenHe727/EdgeDiffusion",
torch_dtype=torch.bfloat16, # required: INT4 layers use bf16 dtype
)
pipe = pipe.to("cuda")
image = pipe(
"a photo of a tabby cat sitting on a wooden chair, sharp focus",
num_inference_steps=2, # 2-step is the sweet spot for SD-Turbo derivatives
guidance_scale=0.0, # SD-Turbo doesn't use CFG
).images[0]
image.save("output.png")
```
### Why 2 inference steps?
SD-Turbo is fundamentally trained with **adversarial diffusion distillation** for 1-step generation. Empirically, 2 steps gives the best quality/speed trade-off for our compressed model: 28 % faster than 4 steps with marginally better LPIPS.
---
## Results
Benchmark on RTX 5070 (Blackwell), 512 Γ 512, 2-step inference:
| Variant | Params | Latency | VRAM | LPIPS vs original SD-Turbo | LPIPS vs fp16 baseline |
|---|---:|---:|---:|---:|---:|
| stabilityai/sd-turbo (original) | 860 M | 0.146 s | 3.05 GB | 0 | 0.278 |
| fp16 baseline (pruned + distilled) | 642 M | 0.142 s | 2.64 GB | 0.278 | 0 |
| **this repo (mp_quant PTQ)** | 642 M | 0.145 s | 2.64 GB | 0.277 | 0.062 |
| with LoRA adapter loaded | 642 M + 9 MB | 0.171 s | 2.65 GB | 0.278 | **0.057** |
**Key takeaway**: mixed-precision quantization adds essentially **zero perceptual cost** on top of the pruned + distilled baseline (LPIPS 0.062 vs fp16). The dominant quality cost in the pipeline is the pruning stage; quantization is "free".
---
## Advanced: QLoRA recovery adapter
The included `lora_adapter.pt` was trained for 500 steps with step-wise teacher-student distillation to recover residual PTQ quality loss. It reduces the LPIPS gap from 0.062 to 0.057 (~8 % improvement).
```python
import torch
from peft import LoraConfig, get_peft_model
from diffusers import StableDiffusionPipeline
from huggingface_hub import hf_hub_download
import json
# Load base pipeline
pipe = StableDiffusionPipeline.from_pretrained(
"ChenHe727/EdgeDiffusion", torch_dtype=torch.bfloat16,
).to("cuda")
# Discover which layers were quantized (LoRA targets these)
meta_path = hf_hub_download("ChenHe727/EdgeDiffusion", "mp_quant_metadata.json")
with open(meta_path) as f:
meta = json.load(f)
target_fqns = [fqn for fqn, bit in meta["quantization"]["assignment"].items() if bit != "fp16"]
# Re-attach LoRA structure and load adapter weights
lora_state = torch.load(hf_hub_download("ChenHe727/EdgeDiffusion", "lora_adapter.pt"),
weights_only=False, map_location="cuda")
sample_key = next(k for k in lora_state if "lora_A" in k)
rank = lora_state[sample_key].shape[0]
pipe.unet = get_peft_model(pipe.unet, LoraConfig(
r=rank, lora_alpha=rank * 2, target_modules=target_fqns,
lora_dropout=0.0, bias="none",
))
own = pipe.unet.state_dict()
for k, v in lora_state.items():
if k in own:
own[k].copy_(v.to(own[k].device, dtype=own[k].dtype))
pipe.unet.eval()
# Generate as usual
image = pipe("a cat", num_inference_steps=2, guidance_scale=0.0).images[0]
```
---
## Pipeline overview
The model in this repo is the output of a three-stage compression pipeline applied to `stabilityai/sd-turbo`:
```
stabilityai/sd-turbo (860 M)
β structural pruning + step-wise distillation
ChenHe727/EdgeDiffusion_distilled_feat_attn (642 M, fp16)
β sensitivity-aware mixed-precision GPTQ (this repo's UNet)
β QLoRA recovery training (this repo's lora_adapter.pt)
ChenHe727/EdgeDiffusion (this repo)
```
Full design rationale, ablations, and reproducibility instructions: see the [GitHub repo](https://github.com/SeanHe727/EdgeDiffusion).
---
## Limitations
- **Conv2d layers are not quantized in v1** β only `nn.Linear` (attention projections, FFN). Conv2d holds ~70 % of UNet parameters; full quantization is planned for v2.
- **Fake-quant storage**: weights are rounded to INT4/INT8 grids but stored as bf16 (2 bytes/value). Real packed INT4/INT8 storage would shrink the file from 1.22 GB to ~900 MB but requires a separate packing step.
- **LPIPS vs original SD-Turbo β 0.28** mostly comes from the upstream pruning + distillation stage. The quantization stage itself adds only 0.005-0.062.
- **2-step inference is the recommended default.** 1-step works (faster) but quality drops noticeably; 4-step is slower and not better.
---
## Acknowledgments
- **LD-Pruner** ([Castells et al. 2024](https://arxiv.org/abs/2404.11936)) β sensitivity metric
- **GPTQ** ([Frantar et al. 2023](https://arxiv.org/abs/2210.17323)) β Hessian-based PTQ (re-implemented from the paper in this repo)
- **QLoRA** ([Dettmers et al. 2023](https://arxiv.org/abs/2305.14314)) β parameter-efficient recovery
- **SD-Turbo** ([Sauer et al. 2023](https://stability.ai/research/adversarial-diffusion-distillation)) β base model
|