File size: 10,635 Bytes
b701455
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
# Performance Optimizations

LightDiffusion-Next achieves its industry-leading inference speed through a layered stack of training-free optimizations that can be selectively enabled based on your hardware and quality requirements. This page provides an overview of each acceleration technique and links to detailed guides.

For a detailed source-based report on what is implemented today, including server-side throughput optimizations and practical implementation notes, see the [Implemented Optimizations Report](implemented-optimizations-report.md).

## Optimization Stack Overview

The pipeline orchestrates six primary acceleration paths:

| Technique | Type | Speedup | Quality Impact | Requirements |
|-----------|------|---------|----------------|---------------|
| [AYS Scheduler](#ays-scheduler) | Sampling schedule | ~2x | None/Better | All models |
| [Prompt Caching](#prompt-caching) | Embedding cache | 5-15% | None | All models |
| [SageAttention](#sageattention--spargeattn) | Attention kernel | Moderate | None | All CUDA GPUs |
| [SpargeAttn](#sageattention--spargeattn) | Sparse attention | Significant | Minimal | Compute 8.0-9.0 |
| [Stable-Fast](#stable-fast) | Graph compilation | Significant* | None | >8GB VRAM, batch jobs |
| [WaveSpeed](#wavespeed-caching) | Feature caching | High | Tunable | All models |

*Speedup depends heavily on batch size and generation count

These optimizations **work together** β€” enabling multiple techniques simultaneously can provide substantial cumulative speedup with tunable quality trade-offs.

## Quick Comparison

### AYS Scheduler

**What it does:** Uses research-backed optimal timestep distributions that allow equivalent quality in approximately half the steps. Instead of uniform sigma spacing, AYS concentrates samples on noise levels that contribute most to image formation.

**When to use:**
- Always recommended for SD1.5, SDXL, and Flux models
- Txt2Img generation
- Production workflows where speed matters
- Any scenario where you'd normally use 20+ steps

**Trade-offs:** Images will differ slightly from standard schedulers (different sampling path), but quality is equivalent or better. Not ideal when exact reproduction of old results is required.

[β†’ Full AYS Scheduler guide](ays-scheduler.md)

---

### Prompt Caching

**What it does:** Caches CLIP text embeddings for prompts that have been encoded before. When generating multiple images with the same or similar prompts, embeddings are retrieved from cache instead of being recomputed.

**When to use:**
- Batch generation with same prompt
- Testing different seeds or settings
- Iterative prompt refinement
- Any workflow with repeated prompts

**Trade-offs:** None β€” minimal memory overhead (~50-200MB), negligible CPU cost, automatically enabled by default.

[β†’ Full Prompt Caching guide](prompt-caching.md)

---

### SageAttention & SpargeAttn {#sageattention--spargeattn}

**What it does:** Replaces PyTorch's default scaled dot-product attention with highly optimized CUDA kernels. SageAttention uses INT8 quantization for key/value tensors while maintaining FP16 query precision. SpargeAttn extends this with dynamic sparsity pruning, skipping redundant attention computations.

**When to use:**
- Always enable SageAttention if available (no quality loss, pure speed gain)
- SpargeAttn for maximum speed on supported hardware (RTX 30xx/40xx, A100, H100)
- Both work seamlessly with all samplers, LoRAs and post-processing stages

**Trade-offs:** None for SageAttention. SpargeAttn may introduce subtle texture variations at very high sparsity thresholds (default is conservative).

[β†’ Full SageAttention/SpargeAttn guide](sageattention.md)

---

### CFG Samplers {#cfg-samplers}

CFG++ Samplers are advanced sampling algorithms that incorporate Classifier-Free Guidance directly into the sampling process, providing better quality and stability compared to standard CFG.

---

### Multi-Scale Diffusion {#multi-scale}

Multi-Scale Diffusion optimizes performance by processing images at multiple resolutions during generation, reducing computation for high-resolution areas.

**When to use:**
- High-resolution generation (>1024px)
- When memory is limited
- For faster previews

**Trade-offs:** May reduce detail in fine areas.

**Note:** In most cases, Multi-Scale Diffusion in quality mode gives better results than standard diffusion while giving a small speedup (this is explained by the upsampling process).

---

### Stable-Fast

**What it does:** JIT-compiles the UNet diffusion model into optimized TorchScript with optional CUDA graphs. The first forward pass traces execution, caches kernel launches and fuses operators for reduced overhead.

**When to use:**
- **Systems with >8GB VRAM** (preferably 12GB+)
- Batch jobs or workflows generating 50+ images with identical settings
- Long-running operations where 30-60s compilation amortizes over time
- Fixed resolutions and batch sizes

**When NOT to use:**
- Normal 20-step single image generation (compilation overhead > speedup gains)
- Systems with <8GB VRAM
- Flux workflows (different architecture)
- Quick prototyping or frequent model/resolution changes

**Trade-offs:** Compilation time on first run (30-60s), VRAM overhead (~500MB), reduced flexibility for dynamic shapes.

[β†’ Full Stable-Fast guide](stablefast.md)

---

### WaveSpeed Caching

**What it does:** Exploits temporal redundancy in diffusion processes by reusing work across denoising steps. In the current project stack this primarily means DeepCache on supported UNet models, with additional Flux-oriented cache groundwork present in the codebase.

1. **DeepCache** β€” Reuses prior denoiser outputs on selected steps in UNet models (SD1.5, SDXL)
2. **First Block Cache (FBCache)** β€” Flux-oriented cache machinery available for specialized integration work

**When to use:**
- Any workflow where you can tolerate slight smoothing in exchange for 2-3x speedup
- Combine with conservative cache intervals (2-3) for minimal quality loss
- Works alongside SageAttention and Stable-Fast

**Trade-offs:** Reduced fine detail if interval is too high, slight VRAM increase for cached tensors.

[β†’ Full WaveSpeed guide](wavespeed.md)

---

## Priority & Fallback System

LightDiffusion-Next automatically selects the best available attention backend at runtime:

```
SpargeAttn > SageAttention > xformers > PyTorch SDPA
```

If a kernel fails (e.g., unsupported head dimension), the system gracefully falls back to the next option. You can force PyTorch SDPA by setting `LD_DISABLE_SAGE_ATTENTION=1` for debugging.

Stable-Fast and WaveSpeed are opt-in toggles controlled via the UI or REST API.

## Recommended Configurations

### Maximum Speed - Batch Jobs (SD1.5, >8GB VRAM, 50+ images)
```yaml
stable_fast: true  # Only for batch operations
sageattention: auto  # or spargeattn if available
deepCache:
  enabled: true
  interval: 3
  depth: 2
```
**Expected:** Maximum speedup for batch operations, some quality loss
**Note:** Disable stable_fast for single 20-step generations

### Balanced - Quick Generation (SD1.5, any VRAM)
```yaml
scheduler: ays  # NEW: Use AYS for 2x speedup
steps: 10  # Reduced from 20 (same quality with AYS)
stable_fast: false  # Disabled for normal generations
sageattention: auto
prompt_cache_enabled: true  # Enabled by default
deepcache:
  enabled: true
  interval: 2
  depth: 1
```
**Expected:** ~2-3x speedup with minimal quality loss
**Note:** AYS scheduler provides the main speedup; enable stable_fast only for batch jobs (50+ images)

### Quality-First (Flux)
```yaml
scheduler: ays_flux  # NEW: Optimized for Flux models
steps: 10  # Reduced from 15 (same quality with AYS)
stable_fast: false  # not supported
sageattention: auto
prompt_cache_enabled: true
deepcache:
  enabled: true
  interval: 2
```
**Expected:** ~2x speedup with minimal quality impact

### Production API - High Volume (>8GB VRAM)
```yaml
stable_fast: true  # Only for sustained high-volume APIs
sageattention: auto
deepCache:
  enabled: false  # avoid variability across batch sizes
keep_models_loaded: true
```
**Expected:** Consistent latency for repeated identical requests
**Note:** For low-volume or single-shot APIs, use `stable_fast: false`

## Hardware-Specific Tips

### RTX 30xx / 40xx (Ampere/Ada)
- Enable SpargeAttn for best results
- Stable-Fast only for batch jobs (disable for quick 20-step generations)
- Stable-Fast + SpargeAttn + DeepCache stacks well for long operations
- Watch VRAM β€” Stable-Fast graphs consume ~500MB

### RTX 50xx (Blackwell)
- SageAttention only (SpargeAttn support pending)
- Stable-Fast works but recompiles for new CUDA arch
- DeepCache is your best additional speedup

### A100 / H100 (Datacenter)
- SpargeAttn + Stable-Fast + aggressive WaveSpeed
- Prefer larger batch sizes to amortize kernel overhead
- Use CUDA graphs (`enable_cuda_graph=True` in Stable-Fast config)

### Low VRAM (<8GB)
- **Always disable Stable-Fast** (requires >8GB VRAM)
- Use SageAttention (minimal overhead)
- Enable DeepCache with conservative intervals
- Set `vae_on_cpu=True` for HiRes workflows

## Debugging & Profiling

Check which optimizations are active:

```bash
# View startup logs
cat logs/server.log | grep -i "using\|enabled"

# Sample output:
# Using SpargeAttn (Sparse + SageAttention) cross attention
# Using SpargeAttn (Sparse + SageAttention) in VAE
# Stable-Fast compilation enabled
# DeepCache active: interval=3, depth=2
```

Monitor telemetry:

```bash
curl http://localhost:7861/api/telemetry | jq '.vram_usage_mb, .average_latency_ms'
```

Disable individual optimizations to isolate issues:

```bash
export LD_DISABLE_SAGE_ATTENTION=1      # Forces PyTorch SDPA
export LD_DISABLE_STABLE_FAST=1         # Skips compilation
export LD_DISABLE_WAVESPEED=1           # Disables all caching
```

## Further Reading
- [AYS Scheduler Deep Dive](ays-scheduler.md) β€” Theory, implementation, quality tuning
- [Prompt Caching Deep Dive](prompt-caching.md) β€” Implementation details, cache management, performance impact
- [SageAttention & SpargeAttn Deep Dive](sageattention.md) β€” Installation, technical details, head dimension handling
- [Stable-Fast Compilation Guide](stablefast.md) β€” Configuration, CUDA graphs, troubleshooting
- [WaveSpeed Caching Strategies](wavespeed.md) β€” DeepCache vs FBCache, tuning parameters, compatibility matrix
- [Performance Tuning](quirks.md) β€” VRAM management, slow first runs, recompilation fixes

---

Armed with this overview, dive into the technique-specific guides or experiment directly in the UI to find your optimal speed/quality balance.