# Performance Optimizations LightDiffusion-Next achieves its industry-leading inference speed through a layered stack of training-free optimizations that can be selectively enabled based on your hardware and quality requirements. This page provides an overview of each acceleration technique and links to detailed guides. For a detailed source-based report on what is implemented today, including server-side throughput optimizations and practical implementation notes, see the [Implemented Optimizations Report](implemented-optimizations-report.md). ## Optimization Stack Overview The pipeline orchestrates six primary acceleration paths: | Technique | Type | Speedup | Quality Impact | Requirements | |-----------|------|---------|----------------|---------------| | [AYS Scheduler](#ays-scheduler) | Sampling schedule | ~2x | None/Better | All models | | [Prompt Caching](#prompt-caching) | Embedding cache | 5-15% | None | All models | | [SageAttention](#sageattention--spargeattn) | Attention kernel | Moderate | None | All CUDA GPUs | | [SpargeAttn](#sageattention--spargeattn) | Sparse attention | Significant | Minimal | Compute 8.0-9.0 | | [Stable-Fast](#stable-fast) | Graph compilation | Significant* | None | >8GB VRAM, batch jobs | | [WaveSpeed](#wavespeed-caching) | Feature caching | High | Tunable | All models | *Speedup depends heavily on batch size and generation count These optimizations **work together** — enabling multiple techniques simultaneously can provide substantial cumulative speedup with tunable quality trade-offs. ## Quick Comparison ### AYS Scheduler **What it does:** Uses research-backed optimal timestep distributions that allow equivalent quality in approximately half the steps. Instead of uniform sigma spacing, AYS concentrates samples on noise levels that contribute most to image formation. **When to use:** - Always recommended for SD1.5, SDXL, and Flux models - Txt2Img generation - Production workflows where speed matters - Any scenario where you'd normally use 20+ steps **Trade-offs:** Images will differ slightly from standard schedulers (different sampling path), but quality is equivalent or better. Not ideal when exact reproduction of old results is required. [→ Full AYS Scheduler guide](ays-scheduler.md) --- ### Prompt Caching **What it does:** Caches CLIP text embeddings for prompts that have been encoded before. When generating multiple images with the same or similar prompts, embeddings are retrieved from cache instead of being recomputed. **When to use:** - Batch generation with same prompt - Testing different seeds or settings - Iterative prompt refinement - Any workflow with repeated prompts **Trade-offs:** None — minimal memory overhead (~50-200MB), negligible CPU cost, automatically enabled by default. [→ Full Prompt Caching guide](prompt-caching.md) --- ### SageAttention & SpargeAttn {#sageattention--spargeattn} **What it does:** Replaces PyTorch's default scaled dot-product attention with highly optimized CUDA kernels. SageAttention uses INT8 quantization for key/value tensors while maintaining FP16 query precision. SpargeAttn extends this with dynamic sparsity pruning, skipping redundant attention computations. **When to use:** - Always enable SageAttention if available (no quality loss, pure speed gain) - SpargeAttn for maximum speed on supported hardware (RTX 30xx/40xx, A100, H100) - Both work seamlessly with all samplers, LoRAs and post-processing stages **Trade-offs:** None for SageAttention. SpargeAttn may introduce subtle texture variations at very high sparsity thresholds (default is conservative). [→ Full SageAttention/SpargeAttn guide](sageattention.md) --- ### CFG Samplers {#cfg-samplers} CFG++ Samplers are advanced sampling algorithms that incorporate Classifier-Free Guidance directly into the sampling process, providing better quality and stability compared to standard CFG. --- ### Multi-Scale Diffusion {#multi-scale} Multi-Scale Diffusion optimizes performance by processing images at multiple resolutions during generation, reducing computation for high-resolution areas. **When to use:** - High-resolution generation (>1024px) - When memory is limited - For faster previews **Trade-offs:** May reduce detail in fine areas. **Note:** In most cases, Multi-Scale Diffusion in quality mode gives better results than standard diffusion while giving a small speedup (this is explained by the upsampling process). --- ### Stable-Fast **What it does:** JIT-compiles the UNet diffusion model into optimized TorchScript with optional CUDA graphs. The first forward pass traces execution, caches kernel launches and fuses operators for reduced overhead. **When to use:** - **Systems with >8GB VRAM** (preferably 12GB+) - Batch jobs or workflows generating 50+ images with identical settings - Long-running operations where 30-60s compilation amortizes over time - Fixed resolutions and batch sizes **When NOT to use:** - Normal 20-step single image generation (compilation overhead > speedup gains) - Systems with <8GB VRAM - Flux workflows (different architecture) - Quick prototyping or frequent model/resolution changes **Trade-offs:** Compilation time on first run (30-60s), VRAM overhead (~500MB), reduced flexibility for dynamic shapes. [→ Full Stable-Fast guide](stablefast.md) --- ### WaveSpeed Caching **What it does:** Exploits temporal redundancy in diffusion processes by reusing work across denoising steps. In the current project stack this primarily means DeepCache on supported UNet models, with additional Flux-oriented cache groundwork present in the codebase. 1. **DeepCache** — Reuses prior denoiser outputs on selected steps in UNet models (SD1.5, SDXL) 2. **First Block Cache (FBCache)** — Flux-oriented cache machinery available for specialized integration work **When to use:** - Any workflow where you can tolerate slight smoothing in exchange for 2-3x speedup - Combine with conservative cache intervals (2-3) for minimal quality loss - Works alongside SageAttention and Stable-Fast **Trade-offs:** Reduced fine detail if interval is too high, slight VRAM increase for cached tensors. [→ Full WaveSpeed guide](wavespeed.md) --- ## Priority & Fallback System LightDiffusion-Next automatically selects the best available attention backend at runtime: ``` SpargeAttn > SageAttention > xformers > PyTorch SDPA ``` If a kernel fails (e.g., unsupported head dimension), the system gracefully falls back to the next option. You can force PyTorch SDPA by setting `LD_DISABLE_SAGE_ATTENTION=1` for debugging. Stable-Fast and WaveSpeed are opt-in toggles controlled via the UI or REST API. ## Recommended Configurations ### Maximum Speed - Batch Jobs (SD1.5, >8GB VRAM, 50+ images) ```yaml stable_fast: true # Only for batch operations sageattention: auto # or spargeattn if available deepCache: enabled: true interval: 3 depth: 2 ``` **Expected:** Maximum speedup for batch operations, some quality loss **Note:** Disable stable_fast for single 20-step generations ### Balanced - Quick Generation (SD1.5, any VRAM) ```yaml scheduler: ays # NEW: Use AYS for 2x speedup steps: 10 # Reduced from 20 (same quality with AYS) stable_fast: false # Disabled for normal generations sageattention: auto prompt_cache_enabled: true # Enabled by default deepcache: enabled: true interval: 2 depth: 1 ``` **Expected:** ~2-3x speedup with minimal quality loss **Note:** AYS scheduler provides the main speedup; enable stable_fast only for batch jobs (50+ images) ### Quality-First (Flux) ```yaml scheduler: ays_flux # NEW: Optimized for Flux models steps: 10 # Reduced from 15 (same quality with AYS) stable_fast: false # not supported sageattention: auto prompt_cache_enabled: true deepcache: enabled: true interval: 2 ``` **Expected:** ~2x speedup with minimal quality impact ### Production API - High Volume (>8GB VRAM) ```yaml stable_fast: true # Only for sustained high-volume APIs sageattention: auto deepCache: enabled: false # avoid variability across batch sizes keep_models_loaded: true ``` **Expected:** Consistent latency for repeated identical requests **Note:** For low-volume or single-shot APIs, use `stable_fast: false` ## Hardware-Specific Tips ### RTX 30xx / 40xx (Ampere/Ada) - Enable SpargeAttn for best results - Stable-Fast only for batch jobs (disable for quick 20-step generations) - Stable-Fast + SpargeAttn + DeepCache stacks well for long operations - Watch VRAM — Stable-Fast graphs consume ~500MB ### RTX 50xx (Blackwell) - SageAttention only (SpargeAttn support pending) - Stable-Fast works but recompiles for new CUDA arch - DeepCache is your best additional speedup ### A100 / H100 (Datacenter) - SpargeAttn + Stable-Fast + aggressive WaveSpeed - Prefer larger batch sizes to amortize kernel overhead - Use CUDA graphs (`enable_cuda_graph=True` in Stable-Fast config) ### Low VRAM (<8GB) - **Always disable Stable-Fast** (requires >8GB VRAM) - Use SageAttention (minimal overhead) - Enable DeepCache with conservative intervals - Set `vae_on_cpu=True` for HiRes workflows ## Debugging & Profiling Check which optimizations are active: ```bash # View startup logs cat logs/server.log | grep -i "using\|enabled" # Sample output: # Using SpargeAttn (Sparse + SageAttention) cross attention # Using SpargeAttn (Sparse + SageAttention) in VAE # Stable-Fast compilation enabled # DeepCache active: interval=3, depth=2 ``` Monitor telemetry: ```bash curl http://localhost:7861/api/telemetry | jq '.vram_usage_mb, .average_latency_ms' ``` Disable individual optimizations to isolate issues: ```bash export LD_DISABLE_SAGE_ATTENTION=1 # Forces PyTorch SDPA export LD_DISABLE_STABLE_FAST=1 # Skips compilation export LD_DISABLE_WAVESPEED=1 # Disables all caching ``` ## Further Reading - [AYS Scheduler Deep Dive](ays-scheduler.md) — Theory, implementation, quality tuning - [Prompt Caching Deep Dive](prompt-caching.md) — Implementation details, cache management, performance impact - [SageAttention & SpargeAttn Deep Dive](sageattention.md) — Installation, technical details, head dimension handling - [Stable-Fast Compilation Guide](stablefast.md) — Configuration, CUDA graphs, troubleshooting - [WaveSpeed Caching Strategies](wavespeed.md) — DeepCache vs FBCache, tuning parameters, compatibility matrix - [Performance Tuning](quirks.md) — VRAM management, slow first runs, recompilation fixes --- Armed with this overview, dive into the technique-specific guides or experiment directly in the UI to find your optimal speed/quality balance.