SDXL-Lightning 4-step (ONNX, fp16, ORT-Web compatible)
ONNX export of ByteDance/SDXL-Lightning 4-step UNet merged into stabilityai/stable-diffusion-xl-base-1.0, with the VAE decoder replaced by madebyollin/sdxl-vae-fp16-fix for fp16 stability. Layout is a diffusers-style pipeline with per-subfolder ONNX files and bundled tokenizers, intended for in-browser inference via ONNX Runtime Web (WebGPU EP).
Contents
| Subfolder | Size | Notes |
|---|---|---|
text_encoder/ |
236 MB | CLIP-L/14, fp16 |
text_encoder_2/ |
1.3 GB | CLIP-G/14 (OpenCLIP-ViT-bigG-14), fp16 |
unet/ |
4.8 GB | SDXL-Lightning 4-step UNet, fp16 |
vae_decoder/ |
95 MB | sdxl-vae-fp16-fix decoder, fp16 |
tokenizer/, tokenizer_2/ |
small | Fast tokenizers (tokenizer.json present) |
scheduler/ |
small | EulerDiscrete, timestep_spacing='trailing' |
model_index.json |
small | Diffusers pipeline manifest |
Total: ~6.5 GB.
Recommended usage
Designed for 4 denoising steps, classifier-free guidance disabled (CFG = 1.0).
Guidance > 1 breaks Lightning. Recommended scheduler is the bundled
EulerDiscreteScheduler with timestep_spacing="trailing".
The export targets ORT-Web's WebGPU EP. The CPU EP passes a sanity check locally; the WebGPU EP is the production target and tolerates a stricter op set. If a runtime op error appears in the browser, downgrade the opset or rebuild the export against a different toolchain version.
Production notes
Resizeops are kept at fp32 with auto-inserted casts at the boundary โ onnxconverter-common's default block list catches most cases, but thescalesConstant input was hand-patched back to fp32 after the converter failed to insert casts for Constant-produced inputs (2 in UNet, 3 in VAE).- I/O dtypes are fp32 throughout (
keep_io_types=True) so JavaScript callers can feed unconverted fp32 tensors and read fp32 outputs. - The
vae_encoder/from the original optimum export was dropped โ Lightning is text-to-image only.
Licenses
This is a derivative work combining three upstream sources, each with its own license. All three are permissive but you should read them before commercial use.
- SDXL-Lightning UNet โ ByteDance/SDXL-Lightning is licensed under CreativeML Open RAIL++-M.
- SDXL base-1.0 (everything except the UNet weights) โ stabilityai/stable-diffusion-xl-base-1.0 is licensed under CreativeML Open RAIL++-M.
- VAE decoder โ madebyollin/sdxl-vae-fp16-fix is MIT-licensed.
The combined work is released under CreativeML Open RAIL++-M (the more restrictive of the upstream licenses).
How it was built
Reproduction recipe (CPU-only Windows box):
- Construct the SDXL UNet from
stabilityai/stable-diffusion-xl-base-1.0config and loadsdxl_lightning_4step_unet.safetensorsfromByteDance/SDXL-Lightninginto it. - Save the merged pipeline as a full diffusers pipeline.
optimum-cli export onnx --task stable-diffusion-xl --framework ptโ per-subfolder ONNX at fp32 (~13 GB).- Convert UNet + both text encoders to fp16 in place via
onnxconverter-common.float16.convert_float_to_float16with a custom post-pass that reverts Resize-feeding Constants back to fp32. - Replace the original VAE decoder with a fresh ONNX export of
madebyollin/sdxl-vae-fp16-fix, fp16-converted with the same post-pass. - Build fast tokenizers (
tokenizer.json) from the slow-tokenizer files optimum-cli dropped, since transformers.js v3 has no slow-tokenizer fallback.
Built with torch==2.4.1+cpu, optimum[exporters]==1.23.3,
transformers==4.45.2, diffusers==0.30.3, onnx==1.17.0,
onnxruntime==1.20.1, onnxconverter-common==1.14.0.