# Stable-Fast Compilation

## Overview

Stable-Fast is a JIT compilation framework that optimizes Stable Diffusion UNet models by tracing execution, fusing operators and optionally capturing CUDA graphs. It can provide significant speedup for SD1.5/SDXL batch workflows with zero quality loss.

Unlike runtime attention optimizations (SageAttention, SpargeAttn), Stable-Fast performs **ahead-of-time compilation** on the first inference pass. The compiled model is cached and reused for subsequent generations with compatible shapes.

## How It Works

Stable-Fast applies three optimization layers:

### 1. TorchScript Tracing

The first forward pass through the UNet is recorded into a static computational graph:

```python
traced_model = torch.jit.trace(unet, example_inputs)
```

This eliminates Python interpreter overhead and enables downstream graph optimizations.

### 2. Operator Fusion

The traced graph undergoes pattern-based fusion:

- **Conv + BatchNorm fusion**: Merges normalization into convolution weights
- **Activation fusion**: Fuses ReLU/GELU/SiLU directly into linear/conv ops
- **Memory layout optimization**: Converts to channels-last format for faster conv execution
- **Triton kernels**: Replaces PyTorch ops with hand-tuned Triton implementations (if `enable_triton=True`)

Example fusion:

```python
# Before:
x = conv(input)
x = batch_norm(x)
x = relu(x)

# After:
x = fused_conv_bn_relu(input)  # Single kernel launch
```

### 3. CUDA Graph Capture (Optional)

When `enable_cuda_graph=True`, the entire forward pass is captured as a static CUDA graph:

- Kernel launches are recorded once and replayed on subsequent runs
- Eliminates CPU launch overhead (~10-15% speedup)
- Requires fixed input shapes and batch sizes

**Trade-off:** Higher VRAM usage (~500MB for graph buffers) and less flexibility.

## Installation

### Windows/Linux (Manual)

Follow the [official guide](https://github.com/chengzeyi/stable-fast?tab=readme-ov-file#installation):

```bash
# Install from PyPI (recommended)
pip install stable-fast

# Or build from source for latest features
git clone https://github.com/chengzeyi/stable-fast
cd stable-fast
pip install -e .
```

**Prerequisites:**
- PyTorch 2.0+ with CUDA support
- xformers (optional but recommended)
- Triton (optional for Triton kernel fusion)

### Docker

Stable-Fast is included in the Docker image when `INSTALL_STABLE_FAST=1`:

```bash
docker-compose build --build-arg INSTALL_STABLE_FAST=1
```

Default is `0` (disabled) to reduce image size and build time.

## Usage

### Streamlit UI

Enable in the **Performance** section of the sidebar:

1. Check **Stable Fast**
2. Generate images — the first run compiles the model (30-60s delay)
3. Subsequent generations reuse the cached compiled model

**Visual indicator:** The first generation shows "Compiling model..." in the progress bar.

### REST API

Pass `stable_fast: true` in the request payload:

```bash
curl -X POST http://localhost:7861/api/generate \
  -H "Content-Type: application/json" \
  -d '{
        "prompt": "a peaceful garden with cherry blossoms",
        "width": 768,
        "height": 512,
        "num_images": 1,
        "stable_fast": true
      }'
```

### Configuration

Stable-Fast behavior is controlled by `CompilationConfig`:

```python
from sfast.compilers.diffusion_pipeline_compiler import CompilationConfig

config = CompilationConfig.Default()
config.enable_xformers = True           # Use xformers attention
config.enable_cuda_graph = False        # CUDA graphs (set True for max speed)
config.enable_jit_freeze = True         # Freeze traced graph
config.enable_cnn_optimization = True   # Conv fusion
config.enable_triton = False            # Triton kernels (experimental)
config.memory_format = torch.channels_last  # Optimize memory layout
```

LightDiffusion-Next uses sensible defaults (CUDA graphs disabled by default for flexibility). To override:

```python
# In src/StableFast/StableFast.py
def gen_stable_fast_config(enable_cuda_graph=False):
    config = CompilationConfig.Default()
    config.enable_cuda_graph = enable_cuda_graph  # Pass True for max speed
    # ... rest of config
```

## Performance

### Speedup Benchmarks

Stable-Fast provides speedup through:
- **JIT compilation**: Eliminates Python overhead
- **Operator fusion**: Reduces kernel launches
- **CUDA graphs** (optional): Further reduces CPU overhead

Speedup varies significantly based on:
- GPU architecture
- Batch size and generation count
- Model size (SD1.5 vs SDXL)
- Whether CUDA graphs are enabled

**Note:** Performance benefits are most noticeable for batch operations (50+ images). For single 20-step generations, compilation overhead may exceed speedup gains.

### Compilation Time

First-run compilation overhead:

- **SD1.5 UNet**: ~30s (traced once per resolution/batch size)
- **SDXL UNet**: ~60s (larger model)
- **Subsequent runs**: <1s (cached)

Cached compiled models persist in `~/.cache/torch_extensions/`. Clear this directory to force recompilation.

## Stacking with Other Optimizations

Stable-Fast is **fully compatible** with SageAttention, SpargeAttn and WaveSpeed:

### Stable-Fast + SageAttention

```yaml
stable_fast: true
# SageAttention auto-detected
```

**Result:** 70% (Stable-Fast) + 15% (SageAttention) = **~2x total speedup**

### Stable-Fast + SpargeAttn

```yaml
stable_fast: true
# SpargeAttn auto-detected
```

**Result:** 70% (Stable-Fast) + 40% (SpargeAttn) = **~2.4x total speedup**

### Stable-Fast + SpargeAttn + DeepCache

```yaml
stable_fast: true
deepcache:
  enabled: true
  interval: 3
  depth: 2
# SpargeAttn auto-detected
```

**Result:** 70% × 40% × 150% (DeepCache 2-3x) = **~4-5x total speedup**

## Compatibility

### Compatible With

- ✅ Stable Diffusion 1.5
- ✅ Stable Diffusion 2.1
- ✅ SDXL
- ✅ All samplers (Euler, DPM++, etc.)
- ✅ LoRA adapters
- ✅ Textual inversion embeddings
- ✅ HiresFix
- ✅ ADetailer
- ✅ Img2Img (with fixed denoise strength)
- ✅ SageAttention/SpargeAttn
- ✅ WaveSpeed caching

### Not Compatible With

- ❌ Flux models (different architecture, no UNet)
- ❌ Dynamic resolution changes after compilation
- ❌ Dynamic batch size changes after compilation (with CUDA graphs)
- ⚠️ Frequent model switching (recompiles each time)

## Troubleshooting

### Slow First Run / Repeated Recompilation

**Symptom:** Every generation triggers compilation, even with identical settings.

**Causes:**
1. Cache directory not writable
2. System clock incorrect (invalidates timestamps)
3. Different model loaded (each model is cached separately)

**Fixes:**
```bash
# Check cache permissions
ls -la ~/.cache/torch_extensions

# Ensure stable timestamps
date  # Should be correct

# Mount cache in Docker to persist across container restarts
docker run -v ~/.cache/torch_extensions:/root/.cache/torch_extensions ...
```

### CUDA Out of Memory During Compilation

**Symptom:** OOM error on first run but not subsequent runs.

**Cause:** Compilation allocates temporary buffers for tracing.

**Fixes:**
1. Disable CUDA graphs: `enable_cuda_graph=False` (saves ~500MB)
2. Reduce batch size temporarily for first run
3. Clear other VRAM consumers (close other apps, disable model caching)

### Compilation Hangs or Crashes

**Symptom:** Process freezes during "Compiling model..." step.

**Causes:**
1. Triton compilation error (if `enable_triton=True`)
2. Driver incompatibility
3. Insufficient CPU RAM for graph analysis

**Fixes:**
```bash
# Disable Triton
# In src/StableFast/StableFast.py:
config.enable_triton = False

# Update NVIDIA driver
nvidia-smi  # Check version, upgrade if < 525.x

# Increase Docker memory limit
# In docker-compose.yml:
deploy:
  resources:
    limits:
      memory: 16G  # Increase from default
```

### Error: `torch.jit.trace` fails

**Symptom:** `RuntimeError: Could not trace model`

**Cause:** Dynamic control flow in model (if/else statements depending on runtime values).

**Fix:** This is rare with standard SD models. If it occurs:
1. Check for custom LoRA/embeddings with dynamic logic
2. Disable Stable-Fast for that specific generation
3. Report issue with model details

### Model Quality Degradation

**Symptom:** Compiled model produces different outputs than baseline.

**Cause:** Numeric precision differences from operator fusion (very rare).

**Fixes:**
```python
# Disable aggressive optimizations
config.enable_cnn_optimization = False
config.memory_format = None  # Use default layout
```

If issue persists, disable Stable-Fast and file a bug report.

## Advanced Configuration

### Custom Compilation Config

Override defaults in `src/StableFast/StableFast.py`:

```python
def gen_stable_fast_config(enable_cuda_graph=False):
    config = CompilationConfig.Default()
    
    # Maximum speed (higher VRAM usage)
    config.enable_cuda_graph = True
    config.enable_triton = True
    config.prefer_lowp_gemm = True  # Use FP16 matrix multiplies
    
    # Balanced (recommended)
    config.enable_cuda_graph = False
    config.enable_triton = False
    config.enable_cnn_optimization = True
    
    # Debug (no optimizations)
    config.enable_cuda_graph = False
    config.enable_jit_freeze = False
    config.enable_cnn_optimization = False
    
    return config
```

### Clear Cached Compilations

```bash
# Linux/Mac
rm -rf ~/.cache/torch_extensions

# Windows
del /s /q %USERPROFILE%\.cache\torch_extensions

# Docker (mount cache as volume)
docker run -v my_cache:/root/.cache/torch_extensions ...
docker volume rm my_cache  # Clear cache
```

### Profile Compilation

```bash
# Enable debug logging
export LD_SERVER_LOGLEVEL=DEBUG

# Run generation and check logs
cat logs/server.log | grep "Stable"
```

## Best Practices

### Production Deployments

1. **Pre-compile models** during startup with a warm-up request (only for batch/long-running services)
2. **Mount cache volume** to persist compilations across container restarts
3. **Disable CUDA graphs** if serving multiple batch sizes
4. **Enable CUDA graphs** for fixed-resolution APIs with consistent high-volume traffic
5. **Disable Stable-Fast entirely** for single-shot API endpoints (compilation overhead exceeds benefit)

Example warm-up:

```python
# In startup script
def warmup_stable_fast(model, width=768, height=512):
    """Pre-compile model with dummy input."""
    dummy_input = torch.randn(1, 4, height // 8, width // 8, device="cuda")
    dummy_timestep = torch.tensor([999], device="cuda")
    
    with torch.no_grad():
        model(dummy_input, dummy_timestep, c={})
    
    print("Stable-Fast compilation complete")
```

### Development Workflows

1. **Disable Stable-Fast** when experimenting with new models/LoRAs (avoids repeated recompilation)
2. **Enable for final testing** to verify production performance
3. **Clear cache** after upgrading PyTorch/CUDA drivers

## Citation

If you use Stable-Fast in your work:

```bibtex
@misc{stable-fast,
  author = {Cheng Zeyi},
  title = {stable-fast: Fast Inference for Stable Diffusion},
  year = {2023},
  publisher = {GitHub},
  url = {https://github.com/chengzeyi/stable-fast}
}
```

## Resources

- [Stable-Fast Repository](https://github.com/chengzeyi/stable-fast)
- [Installation Guide](https://github.com/chengzeyi/stable-fast?tab=readme-ov-file#installation)
- [TorchScript Documentation](https://pytorch.org/docs/stable/jit.html)
- [CUDA Graphs Guide](https://pytorch.org/blog/accelerating-pytorch-with-cuda-graphs/)