# Stable-Fast Compilation ## Overview Stable-Fast is a JIT compilation framework that optimizes Stable Diffusion UNet models by tracing execution, fusing operators and optionally capturing CUDA graphs. It can provide significant speedup for SD1.5/SDXL batch workflows with zero quality loss. Unlike runtime attention optimizations (SageAttention, SpargeAttn), Stable-Fast performs **ahead-of-time compilation** on the first inference pass. The compiled model is cached and reused for subsequent generations with compatible shapes. ## How It Works Stable-Fast applies three optimization layers: ### 1. TorchScript Tracing The first forward pass through the UNet is recorded into a static computational graph: ```python traced_model = torch.jit.trace(unet, example_inputs) ``` This eliminates Python interpreter overhead and enables downstream graph optimizations. ### 2. Operator Fusion The traced graph undergoes pattern-based fusion: - **Conv + BatchNorm fusion**: Merges normalization into convolution weights - **Activation fusion**: Fuses ReLU/GELU/SiLU directly into linear/conv ops - **Memory layout optimization**: Converts to channels-last format for faster conv execution - **Triton kernels**: Replaces PyTorch ops with hand-tuned Triton implementations (if `enable_triton=True`) Example fusion: ```python # Before: x = conv(input) x = batch_norm(x) x = relu(x) # After: x = fused_conv_bn_relu(input) # Single kernel launch ``` ### 3. CUDA Graph Capture (Optional) When `enable_cuda_graph=True`, the entire forward pass is captured as a static CUDA graph: - Kernel launches are recorded once and replayed on subsequent runs - Eliminates CPU launch overhead (~10-15% speedup) - Requires fixed input shapes and batch sizes **Trade-off:** Higher VRAM usage (~500MB for graph buffers) and less flexibility. ## Installation ### Windows/Linux (Manual) Follow the [official guide](https://github.com/chengzeyi/stable-fast?tab=readme-ov-file#installation): ```bash # Install from PyPI (recommended) pip install stable-fast # Or build from source for latest features git clone https://github.com/chengzeyi/stable-fast cd stable-fast pip install -e . ``` **Prerequisites:** - PyTorch 2.0+ with CUDA support - xformers (optional but recommended) - Triton (optional for Triton kernel fusion) ### Docker Stable-Fast is included in the Docker image when `INSTALL_STABLE_FAST=1`: ```bash docker-compose build --build-arg INSTALL_STABLE_FAST=1 ``` Default is `0` (disabled) to reduce image size and build time. ## Usage ### Streamlit UI Enable in the **Performance** section of the sidebar: 1. Check **Stable Fast** 2. Generate images — the first run compiles the model (30-60s delay) 3. Subsequent generations reuse the cached compiled model **Visual indicator:** The first generation shows "Compiling model..." in the progress bar. ### REST API Pass `stable_fast: true` in the request payload: ```bash curl -X POST http://localhost:7861/api/generate \ -H "Content-Type: application/json" \ -d '{ "prompt": "a peaceful garden with cherry blossoms", "width": 768, "height": 512, "num_images": 1, "stable_fast": true }' ``` ### Configuration Stable-Fast behavior is controlled by `CompilationConfig`: ```python from sfast.compilers.diffusion_pipeline_compiler import CompilationConfig config = CompilationConfig.Default() config.enable_xformers = True # Use xformers attention config.enable_cuda_graph = False # CUDA graphs (set True for max speed) config.enable_jit_freeze = True # Freeze traced graph config.enable_cnn_optimization = True # Conv fusion config.enable_triton = False # Triton kernels (experimental) config.memory_format = torch.channels_last # Optimize memory layout ``` LightDiffusion-Next uses sensible defaults (CUDA graphs disabled by default for flexibility). To override: ```python # In src/StableFast/StableFast.py def gen_stable_fast_config(enable_cuda_graph=False): config = CompilationConfig.Default() config.enable_cuda_graph = enable_cuda_graph # Pass True for max speed # ... rest of config ``` ## Performance ### Speedup Benchmarks Stable-Fast provides speedup through: - **JIT compilation**: Eliminates Python overhead - **Operator fusion**: Reduces kernel launches - **CUDA graphs** (optional): Further reduces CPU overhead Speedup varies significantly based on: - GPU architecture - Batch size and generation count - Model size (SD1.5 vs SDXL) - Whether CUDA graphs are enabled **Note:** Performance benefits are most noticeable for batch operations (50+ images). For single 20-step generations, compilation overhead may exceed speedup gains. ### Compilation Time First-run compilation overhead: - **SD1.5 UNet**: ~30s (traced once per resolution/batch size) - **SDXL UNet**: ~60s (larger model) - **Subsequent runs**: <1s (cached) Cached compiled models persist in `~/.cache/torch_extensions/`. Clear this directory to force recompilation. ## Stacking with Other Optimizations Stable-Fast is **fully compatible** with SageAttention, SpargeAttn and WaveSpeed: ### Stable-Fast + SageAttention ```yaml stable_fast: true # SageAttention auto-detected ``` **Result:** 70% (Stable-Fast) + 15% (SageAttention) = **~2x total speedup** ### Stable-Fast + SpargeAttn ```yaml stable_fast: true # SpargeAttn auto-detected ``` **Result:** 70% (Stable-Fast) + 40% (SpargeAttn) = **~2.4x total speedup** ### Stable-Fast + SpargeAttn + DeepCache ```yaml stable_fast: true deepcache: enabled: true interval: 3 depth: 2 # SpargeAttn auto-detected ``` **Result:** 70% × 40% × 150% (DeepCache 2-3x) = **~4-5x total speedup** ## Compatibility ### Compatible With - ✅ Stable Diffusion 1.5 - ✅ Stable Diffusion 2.1 - ✅ SDXL - ✅ All samplers (Euler, DPM++, etc.) - ✅ LoRA adapters - ✅ Textual inversion embeddings - ✅ HiresFix - ✅ ADetailer - ✅ Img2Img (with fixed denoise strength) - ✅ SageAttention/SpargeAttn - ✅ WaveSpeed caching ### Not Compatible With - ❌ Flux models (different architecture, no UNet) - ❌ Dynamic resolution changes after compilation - ❌ Dynamic batch size changes after compilation (with CUDA graphs) - ⚠️ Frequent model switching (recompiles each time) ## Troubleshooting ### Slow First Run / Repeated Recompilation **Symptom:** Every generation triggers compilation, even with identical settings. **Causes:** 1. Cache directory not writable 2. System clock incorrect (invalidates timestamps) 3. Different model loaded (each model is cached separately) **Fixes:** ```bash # Check cache permissions ls -la ~/.cache/torch_extensions # Ensure stable timestamps date # Should be correct # Mount cache in Docker to persist across container restarts docker run -v ~/.cache/torch_extensions:/root/.cache/torch_extensions ... ``` ### CUDA Out of Memory During Compilation **Symptom:** OOM error on first run but not subsequent runs. **Cause:** Compilation allocates temporary buffers for tracing. **Fixes:** 1. Disable CUDA graphs: `enable_cuda_graph=False` (saves ~500MB) 2. Reduce batch size temporarily for first run 3. Clear other VRAM consumers (close other apps, disable model caching) ### Compilation Hangs or Crashes **Symptom:** Process freezes during "Compiling model..." step. **Causes:** 1. Triton compilation error (if `enable_triton=True`) 2. Driver incompatibility 3. Insufficient CPU RAM for graph analysis **Fixes:** ```bash # Disable Triton # In src/StableFast/StableFast.py: config.enable_triton = False # Update NVIDIA driver nvidia-smi # Check version, upgrade if < 525.x # Increase Docker memory limit # In docker-compose.yml: deploy: resources: limits: memory: 16G # Increase from default ``` ### Error: `torch.jit.trace` fails **Symptom:** `RuntimeError: Could not trace model` **Cause:** Dynamic control flow in model (if/else statements depending on runtime values). **Fix:** This is rare with standard SD models. If it occurs: 1. Check for custom LoRA/embeddings with dynamic logic 2. Disable Stable-Fast for that specific generation 3. Report issue with model details ### Model Quality Degradation **Symptom:** Compiled model produces different outputs than baseline. **Cause:** Numeric precision differences from operator fusion (very rare). **Fixes:** ```python # Disable aggressive optimizations config.enable_cnn_optimization = False config.memory_format = None # Use default layout ``` If issue persists, disable Stable-Fast and file a bug report. ## Advanced Configuration ### Custom Compilation Config Override defaults in `src/StableFast/StableFast.py`: ```python def gen_stable_fast_config(enable_cuda_graph=False): config = CompilationConfig.Default() # Maximum speed (higher VRAM usage) config.enable_cuda_graph = True config.enable_triton = True config.prefer_lowp_gemm = True # Use FP16 matrix multiplies # Balanced (recommended) config.enable_cuda_graph = False config.enable_triton = False config.enable_cnn_optimization = True # Debug (no optimizations) config.enable_cuda_graph = False config.enable_jit_freeze = False config.enable_cnn_optimization = False return config ``` ### Clear Cached Compilations ```bash # Linux/Mac rm -rf ~/.cache/torch_extensions # Windows del /s /q %USERPROFILE%\.cache\torch_extensions # Docker (mount cache as volume) docker run -v my_cache:/root/.cache/torch_extensions ... docker volume rm my_cache # Clear cache ``` ### Profile Compilation ```bash # Enable debug logging export LD_SERVER_LOGLEVEL=DEBUG # Run generation and check logs cat logs/server.log | grep "Stable" ``` ## Best Practices ### Production Deployments 1. **Pre-compile models** during startup with a warm-up request (only for batch/long-running services) 2. **Mount cache volume** to persist compilations across container restarts 3. **Disable CUDA graphs** if serving multiple batch sizes 4. **Enable CUDA graphs** for fixed-resolution APIs with consistent high-volume traffic 5. **Disable Stable-Fast entirely** for single-shot API endpoints (compilation overhead exceeds benefit) Example warm-up: ```python # In startup script def warmup_stable_fast(model, width=768, height=512): """Pre-compile model with dummy input.""" dummy_input = torch.randn(1, 4, height // 8, width // 8, device="cuda") dummy_timestep = torch.tensor([999], device="cuda") with torch.no_grad(): model(dummy_input, dummy_timestep, c={}) print("Stable-Fast compilation complete") ``` ### Development Workflows 1. **Disable Stable-Fast** when experimenting with new models/LoRAs (avoids repeated recompilation) 2. **Enable for final testing** to verify production performance 3. **Clear cache** after upgrading PyTorch/CUDA drivers ## Citation If you use Stable-Fast in your work: ```bibtex @misc{stable-fast, author = {Cheng Zeyi}, title = {stable-fast: Fast Inference for Stable Diffusion}, year = {2023}, publisher = {GitHub}, url = {https://github.com/chengzeyi/stable-fast} } ``` ## Resources - [Stable-Fast Repository](https://github.com/chengzeyi/stable-fast) - [Installation Guide](https://github.com/chengzeyi/stable-fast?tab=readme-ov-file#installation) - [TorchScript Documentation](https://pytorch.org/docs/stable/jit.html) - [CUDA Graphs Guide](https://pytorch.org/blog/accelerating-pytorch-with-cuda-graphs/)