Spaces:

Aatricks
/

LightDiffusion-Next

Running on Zero

App Files Files Community

LightDiffusion-Next / docs /stablefast.md

Aatricks

Deploy ZeroGPU Gradio Space snapshot

b701455 22 days ago

preview code

raw

history blame contribute delete

11.5 kB

	# Stable-Fast Compilation

	## Overview

	Stable-Fast is a JIT compilation framework that optimizes Stable Diffusion UNet models by tracing execution, fusing operators and optionally capturing CUDA graphs. It can provide significant speedup for SD1.5/SDXL batch workflows with zero quality loss.

	Unlike runtime attention optimizations (SageAttention, SpargeAttn), Stable-Fast performs ahead-of-time compilation on the first inference pass. The compiled model is cached and reused for subsequent generations with compatible shapes.

	## How It Works

	Stable-Fast applies three optimization layers:

	### 1. TorchScript Tracing

	The first forward pass through the UNet is recorded into a static computational graph:

	```python
	traced_model = torch.jit.trace(unet, example_inputs)
	```

	This eliminates Python interpreter overhead and enables downstream graph optimizations.

	### 2. Operator Fusion

	The traced graph undergoes pattern-based fusion:

	- Conv + BatchNorm fusion: Merges normalization into convolution weights
	- Activation fusion: Fuses ReLU/GELU/SiLU directly into linear/conv ops
	- Memory layout optimization: Converts to channels-last format for faster conv execution
	- Triton kernels: Replaces PyTorch ops with hand-tuned Triton implementations (if `enable_triton=True`)

	Example fusion:

	```python
	# Before:
	x = conv(input)
	x = batch_norm(x)
	x = relu(x)

	# After:
	x = fused_conv_bn_relu(input) # Single kernel launch
	```

	### 3. CUDA Graph Capture (Optional)

	When `enable_cuda_graph=True`, the entire forward pass is captured as a static CUDA graph:

	- Kernel launches are recorded once and replayed on subsequent runs
	- Eliminates CPU launch overhead (~10-15% speedup)
	- Requires fixed input shapes and batch sizes

	Trade-off: Higher VRAM usage (~500MB for graph buffers) and less flexibility.

	## Installation

	### Windows/Linux (Manual)

	Follow the [official guide](https://github.com/chengzeyi/stable-fast?tab=readme-ov-file#installation):

	```bash
	# Install from PyPI (recommended)
	pip install stable-fast

	# Or build from source for latest features
	git clone https://github.com/chengzeyi/stable-fast
	cd stable-fast
	pip install -e .
	```

	Prerequisites:
	- PyTorch 2.0+ with CUDA support
	- xformers (optional but recommended)
	- Triton (optional for Triton kernel fusion)

	### Docker

	Stable-Fast is included in the Docker image when `INSTALL_STABLE_FAST=1`:

	```bash
	docker-compose build --build-arg INSTALL_STABLE_FAST=1
	```

	Default is `0` (disabled) to reduce image size and build time.

	## Usage

	### Streamlit UI

	Enable in the Performance section of the sidebar:

	1. Check Stable Fast
	2. Generate images — the first run compiles the model (30-60s delay)
	3. Subsequent generations reuse the cached compiled model

	Visual indicator: The first generation shows "Compiling model..." in the progress bar.

	### REST API

	Pass `stable_fast: true` in the request payload:

	```bash
	curl -X POST http://localhost:7861/api/generate \
	-H "Content-Type: application/json" \
	-d '{
	"prompt": "a peaceful garden with cherry blossoms",
	"width": 768,
	"height": 512,
	"num_images": 1,
	"stable_fast": true
	}'
	```

	### Configuration

	Stable-Fast behavior is controlled by `CompilationConfig`:

	```python
	from sfast.compilers.diffusion_pipeline_compiler import CompilationConfig

	config = CompilationConfig.Default()
	config.enable_xformers = True # Use xformers attention
	config.enable_cuda_graph = False # CUDA graphs (set True for max speed)
	config.enable_jit_freeze = True # Freeze traced graph
	config.enable_cnn_optimization = True # Conv fusion
	config.enable_triton = False # Triton kernels (experimental)
	config.memory_format = torch.channels_last # Optimize memory layout
	```

	LightDiffusion-Next uses sensible defaults (CUDA graphs disabled by default for flexibility). To override:

	```python
	# In src/StableFast/StableFast.py
	def gen_stable_fast_config(enable_cuda_graph=False):
	config = CompilationConfig.Default()
	config.enable_cuda_graph = enable_cuda_graph # Pass True for max speed
	# ... rest of config
	```

	## Performance

	### Speedup Benchmarks

	Stable-Fast provides speedup through:
	- JIT compilation: Eliminates Python overhead
	- Operator fusion: Reduces kernel launches
	- CUDA graphs (optional): Further reduces CPU overhead

	Speedup varies significantly based on:
	- GPU architecture
	- Batch size and generation count
	- Model size (SD1.5 vs SDXL)
	- Whether CUDA graphs are enabled

	Note: Performance benefits are most noticeable for batch operations (50+ images). For single 20-step generations, compilation overhead may exceed speedup gains.

	### Compilation Time

	First-run compilation overhead:

	- SD1.5 UNet: ~30s (traced once per resolution/batch size)
	- SDXL UNet: ~60s (larger model)
	- Subsequent runs: <1s (cached)

	Cached compiled models persist in `~/.cache/torch_extensions/`. Clear this directory to force recompilation.

	## Stacking with Other Optimizations

	Stable-Fast is fully compatible with SageAttention, SpargeAttn and WaveSpeed:

	### Stable-Fast + SageAttention

	```yaml
	stable_fast: true
	# SageAttention auto-detected
	```

	Result: 70% (Stable-Fast) + 15% (SageAttention) = ~2x total speedup

	### Stable-Fast + SpargeAttn

	```yaml
	stable_fast: true
	# SpargeAttn auto-detected
	```

	Result: 70% (Stable-Fast) + 40% (SpargeAttn) = ~2.4x total speedup

	### Stable-Fast + SpargeAttn + DeepCache

	```yaml
	stable_fast: true
	deepcache:
	enabled: true
	interval: 3
	depth: 2
	# SpargeAttn auto-detected
	```

	Result: 70% × 40% × 150% (DeepCache 2-3x) = ~4-5x total speedup

	## Compatibility

	### Compatible With

	- ✅ Stable Diffusion 1.5
	- ✅ Stable Diffusion 2.1
	- ✅ SDXL
	- ✅ All samplers (Euler, DPM++, etc.)
	- ✅ LoRA adapters
	- ✅ Textual inversion embeddings
	- ✅ HiresFix
	- ✅ ADetailer
	- ✅ Img2Img (with fixed denoise strength)
	- ✅ SageAttention/SpargeAttn
	- ✅ WaveSpeed caching

	### Not Compatible With

	- ❌ Flux models (different architecture, no UNet)
	- ❌ Dynamic resolution changes after compilation
	- ❌ Dynamic batch size changes after compilation (with CUDA graphs)
	- ⚠️ Frequent model switching (recompiles each time)

	## Troubleshooting

	### Slow First Run / Repeated Recompilation

	Symptom: Every generation triggers compilation, even with identical settings.

	Causes:
	1. Cache directory not writable
	2. System clock incorrect (invalidates timestamps)
	3. Different model loaded (each model is cached separately)

	Fixes:
	```bash
	# Check cache permissions
	ls -la ~/.cache/torch_extensions

	# Ensure stable timestamps
	date # Should be correct

	# Mount cache in Docker to persist across container restarts
	docker run -v ~/.cache/torch_extensions:/root/.cache/torch_extensions ...
	```

	### CUDA Out of Memory During Compilation

	Symptom: OOM error on first run but not subsequent runs.

	Cause: Compilation allocates temporary buffers for tracing.

	Fixes:
	1. Disable CUDA graphs: `enable_cuda_graph=False` (saves ~500MB)
	2. Reduce batch size temporarily for first run
	3. Clear other VRAM consumers (close other apps, disable model caching)

	### Compilation Hangs or Crashes

	Symptom: Process freezes during "Compiling model..." step.

	Causes:
	1. Triton compilation error (if `enable_triton=True`)
	2. Driver incompatibility
	3. Insufficient CPU RAM for graph analysis

	Fixes:
	```bash
	# Disable Triton
	# In src/StableFast/StableFast.py:
	config.enable_triton = False

	# Update NVIDIA driver
	nvidia-smi # Check version, upgrade if < 525.x

	# Increase Docker memory limit
	# In docker-compose.yml:
	deploy:
	resources:
	limits:
	memory: 16G # Increase from default
	```

	### Error: `torch.jit.trace` fails

	Symptom: `RuntimeError: Could not trace model`

	Cause: Dynamic control flow in model (if/else statements depending on runtime values).

	Fix: This is rare with standard SD models. If it occurs:
	1. Check for custom LoRA/embeddings with dynamic logic
	2. Disable Stable-Fast for that specific generation
	3. Report issue with model details

	### Model Quality Degradation

	Symptom: Compiled model produces different outputs than baseline.

	Cause: Numeric precision differences from operator fusion (very rare).

	Fixes:
	```python
	# Disable aggressive optimizations
	config.enable_cnn_optimization = False
	config.memory_format = None # Use default layout
	```

	If issue persists, disable Stable-Fast and file a bug report.

	## Advanced Configuration

	### Custom Compilation Config

	Override defaults in `src/StableFast/StableFast.py`:

	```python
	def gen_stable_fast_config(enable_cuda_graph=False):
	config = CompilationConfig.Default()

	# Maximum speed (higher VRAM usage)
	config.enable_cuda_graph = True
	config.enable_triton = True
	config.prefer_lowp_gemm = True # Use FP16 matrix multiplies

	# Balanced (recommended)
	config.enable_cuda_graph = False
	config.enable_triton = False
	config.enable_cnn_optimization = True

	# Debug (no optimizations)
	config.enable_cuda_graph = False
	config.enable_jit_freeze = False
	config.enable_cnn_optimization = False

	return config
	```

	### Clear Cached Compilations

	```bash
	# Linux/Mac
	rm -rf ~/.cache/torch_extensions

	# Windows
	del /s /q %USERPROFILE%\.cache\torch_extensions

	# Docker (mount cache as volume)
	docker run -v my_cache:/root/.cache/torch_extensions ...
	docker volume rm my_cache # Clear cache
	```

	### Profile Compilation

	```bash
	# Enable debug logging
	export LD_SERVER_LOGLEVEL=DEBUG

	# Run generation and check logs
	cat logs/server.log \| grep "Stable"
	```

	## Best Practices

	### Production Deployments

	1. Pre-compile models during startup with a warm-up request (only for batch/long-running services)
	2. Mount cache volume to persist compilations across container restarts
	3. Disable CUDA graphs if serving multiple batch sizes
	4. Enable CUDA graphs for fixed-resolution APIs with consistent high-volume traffic
	5. Disable Stable-Fast entirely for single-shot API endpoints (compilation overhead exceeds benefit)

	Example warm-up:

	```python
	# In startup script
	def warmup_stable_fast(model, width=768, height=512):
	"""Pre-compile model with dummy input."""
	dummy_input = torch.randn(1, 4, height // 8, width // 8, device="cuda")
	dummy_timestep = torch.tensor([999], device="cuda")

	with torch.no_grad():
	model(dummy_input, dummy_timestep, c={})

	print("Stable-Fast compilation complete")
	```

	### Development Workflows

	1. Disable Stable-Fast when experimenting with new models/LoRAs (avoids repeated recompilation)
	2. Enable for final testing to verify production performance
	3. Clear cache after upgrading PyTorch/CUDA drivers

	## Citation

	If you use Stable-Fast in your work:

	```bibtex
	@misc{stable-fast,
	author = {Cheng Zeyi},
	title = {stable-fast: Fast Inference for Stable Diffusion},
	year = {2023},
	publisher = {GitHub},
	url = {https://github.com/chengzeyi/stable-fast}
	}
	```

	## Resources

	- [Stable-Fast Repository](https://github.com/chengzeyi/stable-fast)
	- [Installation Guide](https://github.com/chengzeyi/stable-fast?tab=readme-ov-file#installation)
	- [TorchScript Documentation](https://pytorch.org/docs/stable/jit.html)
	- [CUDA Graphs Guide](https://pytorch.org/blog/accelerating-pytorch-with-cuda-graphs/)