Spaces:
Running on Zero
Running on Zero
File size: 10,635 Bytes
b701455 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 | # Performance Optimizations
LightDiffusion-Next achieves its industry-leading inference speed through a layered stack of training-free optimizations that can be selectively enabled based on your hardware and quality requirements. This page provides an overview of each acceleration technique and links to detailed guides.
For a detailed source-based report on what is implemented today, including server-side throughput optimizations and practical implementation notes, see the [Implemented Optimizations Report](implemented-optimizations-report.md).
## Optimization Stack Overview
The pipeline orchestrates six primary acceleration paths:
| Technique | Type | Speedup | Quality Impact | Requirements |
|-----------|------|---------|----------------|---------------|
| [AYS Scheduler](#ays-scheduler) | Sampling schedule | ~2x | None/Better | All models |
| [Prompt Caching](#prompt-caching) | Embedding cache | 5-15% | None | All models |
| [SageAttention](#sageattention--spargeattn) | Attention kernel | Moderate | None | All CUDA GPUs |
| [SpargeAttn](#sageattention--spargeattn) | Sparse attention | Significant | Minimal | Compute 8.0-9.0 |
| [Stable-Fast](#stable-fast) | Graph compilation | Significant* | None | >8GB VRAM, batch jobs |
| [WaveSpeed](#wavespeed-caching) | Feature caching | High | Tunable | All models |
*Speedup depends heavily on batch size and generation count
These optimizations **work together** β enabling multiple techniques simultaneously can provide substantial cumulative speedup with tunable quality trade-offs.
## Quick Comparison
### AYS Scheduler
**What it does:** Uses research-backed optimal timestep distributions that allow equivalent quality in approximately half the steps. Instead of uniform sigma spacing, AYS concentrates samples on noise levels that contribute most to image formation.
**When to use:**
- Always recommended for SD1.5, SDXL, and Flux models
- Txt2Img generation
- Production workflows where speed matters
- Any scenario where you'd normally use 20+ steps
**Trade-offs:** Images will differ slightly from standard schedulers (different sampling path), but quality is equivalent or better. Not ideal when exact reproduction of old results is required.
[β Full AYS Scheduler guide](ays-scheduler.md)
---
### Prompt Caching
**What it does:** Caches CLIP text embeddings for prompts that have been encoded before. When generating multiple images with the same or similar prompts, embeddings are retrieved from cache instead of being recomputed.
**When to use:**
- Batch generation with same prompt
- Testing different seeds or settings
- Iterative prompt refinement
- Any workflow with repeated prompts
**Trade-offs:** None β minimal memory overhead (~50-200MB), negligible CPU cost, automatically enabled by default.
[β Full Prompt Caching guide](prompt-caching.md)
---
### SageAttention & SpargeAttn {#sageattention--spargeattn}
**What it does:** Replaces PyTorch's default scaled dot-product attention with highly optimized CUDA kernels. SageAttention uses INT8 quantization for key/value tensors while maintaining FP16 query precision. SpargeAttn extends this with dynamic sparsity pruning, skipping redundant attention computations.
**When to use:**
- Always enable SageAttention if available (no quality loss, pure speed gain)
- SpargeAttn for maximum speed on supported hardware (RTX 30xx/40xx, A100, H100)
- Both work seamlessly with all samplers, LoRAs and post-processing stages
**Trade-offs:** None for SageAttention. SpargeAttn may introduce subtle texture variations at very high sparsity thresholds (default is conservative).
[β Full SageAttention/SpargeAttn guide](sageattention.md)
---
### CFG Samplers {#cfg-samplers}
CFG++ Samplers are advanced sampling algorithms that incorporate Classifier-Free Guidance directly into the sampling process, providing better quality and stability compared to standard CFG.
---
### Multi-Scale Diffusion {#multi-scale}
Multi-Scale Diffusion optimizes performance by processing images at multiple resolutions during generation, reducing computation for high-resolution areas.
**When to use:**
- High-resolution generation (>1024px)
- When memory is limited
- For faster previews
**Trade-offs:** May reduce detail in fine areas.
**Note:** In most cases, Multi-Scale Diffusion in quality mode gives better results than standard diffusion while giving a small speedup (this is explained by the upsampling process).
---
### Stable-Fast
**What it does:** JIT-compiles the UNet diffusion model into optimized TorchScript with optional CUDA graphs. The first forward pass traces execution, caches kernel launches and fuses operators for reduced overhead.
**When to use:**
- **Systems with >8GB VRAM** (preferably 12GB+)
- Batch jobs or workflows generating 50+ images with identical settings
- Long-running operations where 30-60s compilation amortizes over time
- Fixed resolutions and batch sizes
**When NOT to use:**
- Normal 20-step single image generation (compilation overhead > speedup gains)
- Systems with <8GB VRAM
- Flux workflows (different architecture)
- Quick prototyping or frequent model/resolution changes
**Trade-offs:** Compilation time on first run (30-60s), VRAM overhead (~500MB), reduced flexibility for dynamic shapes.
[β Full Stable-Fast guide](stablefast.md)
---
### WaveSpeed Caching
**What it does:** Exploits temporal redundancy in diffusion processes by reusing work across denoising steps. In the current project stack this primarily means DeepCache on supported UNet models, with additional Flux-oriented cache groundwork present in the codebase.
1. **DeepCache** β Reuses prior denoiser outputs on selected steps in UNet models (SD1.5, SDXL)
2. **First Block Cache (FBCache)** β Flux-oriented cache machinery available for specialized integration work
**When to use:**
- Any workflow where you can tolerate slight smoothing in exchange for 2-3x speedup
- Combine with conservative cache intervals (2-3) for minimal quality loss
- Works alongside SageAttention and Stable-Fast
**Trade-offs:** Reduced fine detail if interval is too high, slight VRAM increase for cached tensors.
[β Full WaveSpeed guide](wavespeed.md)
---
## Priority & Fallback System
LightDiffusion-Next automatically selects the best available attention backend at runtime:
```
SpargeAttn > SageAttention > xformers > PyTorch SDPA
```
If a kernel fails (e.g., unsupported head dimension), the system gracefully falls back to the next option. You can force PyTorch SDPA by setting `LD_DISABLE_SAGE_ATTENTION=1` for debugging.
Stable-Fast and WaveSpeed are opt-in toggles controlled via the UI or REST API.
## Recommended Configurations
### Maximum Speed - Batch Jobs (SD1.5, >8GB VRAM, 50+ images)
```yaml
stable_fast: true # Only for batch operations
sageattention: auto # or spargeattn if available
deepCache:
enabled: true
interval: 3
depth: 2
```
**Expected:** Maximum speedup for batch operations, some quality loss
**Note:** Disable stable_fast for single 20-step generations
### Balanced - Quick Generation (SD1.5, any VRAM)
```yaml
scheduler: ays # NEW: Use AYS for 2x speedup
steps: 10 # Reduced from 20 (same quality with AYS)
stable_fast: false # Disabled for normal generations
sageattention: auto
prompt_cache_enabled: true # Enabled by default
deepcache:
enabled: true
interval: 2
depth: 1
```
**Expected:** ~2-3x speedup with minimal quality loss
**Note:** AYS scheduler provides the main speedup; enable stable_fast only for batch jobs (50+ images)
### Quality-First (Flux)
```yaml
scheduler: ays_flux # NEW: Optimized for Flux models
steps: 10 # Reduced from 15 (same quality with AYS)
stable_fast: false # not supported
sageattention: auto
prompt_cache_enabled: true
deepcache:
enabled: true
interval: 2
```
**Expected:** ~2x speedup with minimal quality impact
### Production API - High Volume (>8GB VRAM)
```yaml
stable_fast: true # Only for sustained high-volume APIs
sageattention: auto
deepCache:
enabled: false # avoid variability across batch sizes
keep_models_loaded: true
```
**Expected:** Consistent latency for repeated identical requests
**Note:** For low-volume or single-shot APIs, use `stable_fast: false`
## Hardware-Specific Tips
### RTX 30xx / 40xx (Ampere/Ada)
- Enable SpargeAttn for best results
- Stable-Fast only for batch jobs (disable for quick 20-step generations)
- Stable-Fast + SpargeAttn + DeepCache stacks well for long operations
- Watch VRAM β Stable-Fast graphs consume ~500MB
### RTX 50xx (Blackwell)
- SageAttention only (SpargeAttn support pending)
- Stable-Fast works but recompiles for new CUDA arch
- DeepCache is your best additional speedup
### A100 / H100 (Datacenter)
- SpargeAttn + Stable-Fast + aggressive WaveSpeed
- Prefer larger batch sizes to amortize kernel overhead
- Use CUDA graphs (`enable_cuda_graph=True` in Stable-Fast config)
### Low VRAM (<8GB)
- **Always disable Stable-Fast** (requires >8GB VRAM)
- Use SageAttention (minimal overhead)
- Enable DeepCache with conservative intervals
- Set `vae_on_cpu=True` for HiRes workflows
## Debugging & Profiling
Check which optimizations are active:
```bash
# View startup logs
cat logs/server.log | grep -i "using\|enabled"
# Sample output:
# Using SpargeAttn (Sparse + SageAttention) cross attention
# Using SpargeAttn (Sparse + SageAttention) in VAE
# Stable-Fast compilation enabled
# DeepCache active: interval=3, depth=2
```
Monitor telemetry:
```bash
curl http://localhost:7861/api/telemetry | jq '.vram_usage_mb, .average_latency_ms'
```
Disable individual optimizations to isolate issues:
```bash
export LD_DISABLE_SAGE_ATTENTION=1 # Forces PyTorch SDPA
export LD_DISABLE_STABLE_FAST=1 # Skips compilation
export LD_DISABLE_WAVESPEED=1 # Disables all caching
```
## Further Reading
- [AYS Scheduler Deep Dive](ays-scheduler.md) β Theory, implementation, quality tuning
- [Prompt Caching Deep Dive](prompt-caching.md) β Implementation details, cache management, performance impact
- [SageAttention & SpargeAttn Deep Dive](sageattention.md) β Installation, technical details, head dimension handling
- [Stable-Fast Compilation Guide](stablefast.md) β Configuration, CUDA graphs, troubleshooting
- [WaveSpeed Caching Strategies](wavespeed.md) β DeepCache vs FBCache, tuning parameters, compatibility matrix
- [Performance Tuning](quirks.md) β VRAM management, slow first runs, recompilation fixes
---
Armed with this overview, dive into the technique-specific guides or experiment directly in the UI to find your optimal speed/quality balance.
|