File size: 13,460 Bytes
b701455
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
# WaveSpeed Caching

## Overview

WaveSpeed is the project's caching-oriented optimization layer for reusing work across denoising steps. In the current codebase, the integrated path is DeepCache for UNet-based models, and the repository also contains groundwork for a Flux-oriented First Block Cache path.

LightDiffusion-Next contains two WaveSpeed-related implementations:

1. **DeepCache** β€” Integrated for UNet-based models (SD1.5, SDXL)
2. **First Block Cache (FBCache)** β€” Flux-oriented cache machinery present in the codebase

Both are training-free. DeepCache is the user-facing path today; First Block Cache is codebase groundwork for a more specialized transformer caching path.

## How It Works

### Core Insight

Diffusion models denoise images iteratively over 20-50 steps. Researchers observed that:

- **High-level features** (semantic structure, composition) change slowly across steps
- **Low-level features** (fine details, textures) require frequent updates

WaveSpeed aims to reduce repeated computation across nearby denoising steps by reusing information from earlier steps where practical.

### DeepCache (UNet Models) {#deepcache}

DeepCache is the integrated WaveSpeed path for UNet models.

**Cache step (every N steps):**
1. Run the full denoiser path
2. Store the output for later reuse

**Reuse step (intermediate steps):**
1. Reuse the cached denoiser output
2. Skip the full model recomputation for that step

**Speedup:** ~50-70% time saved per reuse step β†’ 2-3x total speedup with `interval=3`

### First Block Cache (Flux Models)

Flux uses Transformer blocks instead of UNet convolutions. The repository includes a First Block Cache implementation for this architecture family:

```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ First Transformer Block (always run)    β”‚ ← Computes initial features
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Remaining Blocks (cached if similar)    β”‚ ← FBCache caching zone
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

**Cache decision logic:**
1. Run first Transformer block
2. Compare output to previous step's output
3. If difference < threshold: reuse cached remaining blocks
4. If difference β‰₯ threshold: run all blocks and update cache

In the current project structure, this cache path is implementation groundwork rather than a standard generation toggle like DeepCache.

## DeepCache Configuration

### Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `cache_interval` | int | 3 | Steps between cache updates (higher = faster, lower quality) |
| `cache_depth` | int | 2 | UNet depth for caching (0-12, higher = more aggressive) |
| `start_step` | int | 0 | Timestep to start caching (0-1000) |
| `end_step` | int | 1000 | Timestep to stop caching (0-1000) |

### Streamlit UI

Enable in the **⚑ DeepCache Acceleration** expander:

1. Check **Enable DeepCache**
2. Adjust sliders:
   - **Cache Interval**: 1-10 (default: 3)
   - **Cache Depth**: 0-12 (default: 2)
   - **Start/End Steps**: 0-1000 (default: 0/1000)
3. Generate images β€” caching applies transparently

### REST API

```bash
curl -X POST http://localhost:7861/api/generate \
  -H "Content-Type: application/json" \
  -d '{
        "prompt": "a misty forest at twilight",
        "width": 768,
        "height": 512,
        "deepcache_enabled": true,
        "deepcache_interval": 3,
        "deepcache_depth": 2
      }'
```

### Recommended Presets

#### Balanced (Default)
```yaml
cache_interval: 3
cache_depth: 2
start_step: 0
end_step: 1000
```
- **Speedup:** 2-2.3x
- **Quality loss:** Very slight (1-2%)
- **Use case:** Everyday generation

#### Maximum Speed
```yaml
cache_interval: 5
cache_depth: 3
start_step: 0
end_step: 1000
```
- **Speedup:** 2.5-3x
- **Quality loss:** Noticeable (5-7%)
- **Use case:** Rapid prototyping, batch jobs

#### Maximum Quality
```yaml
cache_interval: 2
cache_depth: 1
start_step: 0
end_step: 1000
```
- **Speedup:** 1.5-2x
- **Quality loss:** Minimal (<1%)
- **Use case:** Final renders, client work

#### Partial Caching (Critical Steps Only)
```yaml
cache_interval: 3
cache_depth: 2
start_step: 200
end_step: 800
```
- **Speedup:** 1.8-2.2x
- **Quality loss:** Minimal
- **Use case:** Preserve early structure, late details

## First Block Cache (FBCache) Configuration

### Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `residual_diff_threshold` | float | 0.05 | Max feature difference to trigger cache reuse (0.0-1.0) |

### Usage

First Block Cache is not currently exposed as a standard per-generation toggle. The implementation is available in the codebase for specialized integration work:

```python
# In src/user/pipeline.py
from src.WaveSpeed import fbcache_nodes

# Create cache context
cache_context = fbcache_nodes.create_cache_context()

# Apply caching to a Flux-style model
with fbcache_nodes.cache_context(cache_context):
    patched_model = fbcache_nodes.create_patch_flux_forward_orig(
        flux_model,
        residual_diff_threshold=0.05,  # Lower = stricter caching
    )
    # Generate images...
```

### Tuning Threshold

- **Lower threshold (0.01-0.03)**: Stricter caching, recomputes more often, higher quality
- **Higher threshold (0.05-0.1)**: Looser caching, reuses more often, higher speedup
- **Recommended:** 0.05 (balances quality and speed)

## Performance

### Speedup Guidance

Speedup scales with cache interval and depth:

| Model | Cache Interval | Expected Behavior |
|-------|---------------|-------------------|
| SD1.5 | 2 | Moderate speedup, minimal quality loss |
| SD1.5 | 3 | Good speedup, slight quality loss |
| SD1.5 | 5 | High speedup, noticeable quality loss |
| SDXL | 3 | Good speedup, slight quality loss |
| Flux-style caching paths | implementation-specific | Depends on the integration path |

**Performance varies based on:**
- GPU architecture
- Model size
- Resolution
- Sampler choice
- Number of steps

**Recommendation:** Start with `interval=3` and adjust based on your quality requirements.### VRAM Impact

Caching increases VRAM usage slightly (50-200MB depending on resolution):

| Model | Baseline VRAM | + DeepCache | Increase |
|-------|--------------|-------------|----------|
| SD1.5 (768Γ—512) | 3.2 GB | 3.4 GB | +200 MB |
| SDXL (1024Γ—1024) | 6.8 GB | 7.0 GB | +200 MB |
| Flux (832Γ—1216) | 12.5 GB | 12.6 GB | +100 MB |

## Stacking with Other Optimizations

WaveSpeed is **fully compatible** with SageAttention, SpargeAttn and Stable-Fast:

### DeepCache + SageAttention

```yaml
deepcache_enabled: true
deepcache_interval: 3
# SageAttention auto-detected
```

**Result:** 2.2x (DeepCache) Γ— 1.15 (SageAttention) = **~2.5x total speedup**

### DeepCache + SpargeAttn

```yaml
deepcache_enabled: true
deepcache_interval: 3
# SpargeAttn auto-detected
```

**Result:** Enhanced speedup from caching and sparse attention

### DeepCache + Stable-Fast + SpargeAttn

```yaml
stable_fast: true
deepcache_enabled: true
deepcache_interval: 3
# SpargeAttn auto-detected
```

**Result:** Maximum combined speedup (all optimizations active, batch operations only)

## Compatibility

### DeepCache Compatible With

- βœ… Stable Diffusion 1.5
- βœ… Stable Diffusion 2.1
- βœ… SDXL
- βœ… All samplers (Euler, DPM++, etc.)
- βœ… LoRA adapters
- βœ… Textual inversion embeddings
- βœ… HiresFix
- βœ… ADetailer
- βœ… Multi-scale diffusion
- βœ… SageAttention/SpargeAttn
- βœ… Stable-Fast

### DeepCache NOT Compatible With

- ❌ Flux models (use FBCache instead)
- ❌ Img2Img mode (can cause artifacts)

### FBCache Compatible With

- βœ… Flux models
- βœ… SageAttention/SpargeAttn
- βœ… All Flux-compatible features

### FBCache NOT Compatible With

- ❌ SD1.5/SDXL (use DeepCache instead)
- ❌ Stable-Fast (Flux not supported by Stable-Fast)

## Troubleshooting

### No Speedup Observed

**Causes:**
1. DeepCache disabled or not applied to correct model type
2. Cache interval too low (interval=1 provides no caching)
3. Model loaded incorrectly

**Fixes:**
```bash
# Check logs for DeepCache activation
cat logs/server.log | grep -i "deepcache\|cache"

# Verify UI toggle is enabled
# Streamlit: Check "Enable DeepCache" checkbox
# API: Ensure "deepcache_enabled": true in payload

# Try higher interval
deepcache_interval: 3  # Instead of 1 or 2
```

### Quality Degradation

**Symptoms:**
- Blurry details
- Smoothed textures
- Loss of fine patterns

**Causes:**
1. Cache interval too high
2. Cache depth too aggressive
3. Wrong model type (Flux using DeepCache)

**Fixes:**
```yaml
# Reduce cache interval
deepcache_interval: 2  # Down from 5

# Reduce cache depth
deepcache_depth: 1  # Down from 3

# Disable caching for critical phases
deepcache_start_step: 200  # Skip early structure formation
deepcache_end_step: 800    # Skip late detail refinement
```

### Artifacts in Img2Img

**Symptom:** Visible seams, inconsistent styles when using DeepCache with Img2Img.

**Cause:** Img2Img starts from a noisy input image, which violates DeepCache's assumptions about feature consistency.

**Fix:** Disable DeepCache for Img2Img:
```yaml
deepcache_enabled: false  # When img2img_enabled: true
```

### VRAM Increase

**Symptom:** OOM errors after enabling DeepCache.

**Cause:** Cached features consume additional VRAM.

**Fixes:**
1. Reduce batch size
2. Lower resolution
3. Disable other VRAM-heavy features (Stable-Fast CUDA graphs)
4. Use lower cache depth:
   ```yaml
   deepcache_depth: 1  # Minimal caching
   ```

### Flux FBCache Not Working

**Symptom:** No speedup with Flux generation.

**Cause:** FBCache implementation is more subtle β€” check logs for cache hit rate.

**Debugging:**
```bash
# Enable debug logging
export LD_SERVER_LOGLEVEL=DEBUG

# Check cache statistics
cat logs/server.log | grep "cache"
```

If no cache hits, try adjusting threshold:
```python
# In pipeline.py
residual_diff_threshold=0.1  # Increase from 0.05 for more cache reuse
```

## Quality Comparison

Visual impact of different cache intervals:

| Interval | Speed | Visual Difference |
|----------|-------|-------------------|
| Disabled | Baseline | Baseline (100% quality) |
| 2 | Faster | Virtually identical |
| 3 | Much faster | Very subtle smoothing |
| 5 | Very fast | Noticeable detail loss |
| 7+ | Fastest | Obvious quality degradation |

**Recommendation:** Start with `interval=3` and adjust based on visual results.

## Technical Details

### DeepCache Implementation

Simplified pseudocode:

```python
class DeepCacheWrapper:
    def __init__(self, model, interval, depth):
        self.model = model
        self.interval = interval
        self.cached_output = None
        self.current_step = 0
    
    def forward(self, x, timestep):
        is_cache_step = (self.current_step % self.interval == 0)
        
        if is_cache_step:
            # Run full model, cache output
            output = self.model(x, timestep)
            self.cached_output = output.clone()
        else:
            # Reuse cached output (skip expensive computation)
            output = self.cached_output
        
        self.current_step += 1
        return output
```

Actual implementation in `src/WaveSpeed/deepcache_nodes.py` includes:
- Proper timestep tracking
- Cache invalidation on batch changes
- Error handling and fallback to full forward

### FBCache Residual Comparison

```python
# Compute first block output
first_output = first_transformer_block(hidden_states)

# Compare to previous step
residual = first_output - previous_first_output
residual_norm = residual.abs().mean() / first_output.abs().mean()

if residual_norm < threshold:
    # Feature change is small β€” reuse cached blocks
    hidden_states = apply_cached_residual(first_output)
else:
    # Feature change is large β€” recompute all blocks
    hidden_states = run_remaining_blocks(first_output)
    cache_residual(hidden_states)
```

## Best Practices

### For Everyday Use

1. **Enable DeepCache** with default settings (`interval=3`, `depth=2`)
2. **Stack with SageAttention** for 2.5x+ total speedup
3. **Disable for final client renders** if absolute quality is critical

### For Batch Processing

1. **Use aggressive caching** (`interval=5`, `depth=3`)
2. **Pre-generate previews** at high speed, re-render winners at full quality
3. **Disable TAESD previews** to avoid overhead (set `enable_preview=false`)

### For Low VRAM

1. **Use conservative caching** (`interval=2`, `depth=1`)
2. **Avoid stacking** with Stable-Fast CUDA graphs
3. **Monitor VRAM** via `/api/telemetry` endpoint

## Citation

If you use WaveSpeed/DeepCache in your work:

```bibtex
@inproceedings{ma2023deepcache,
  title={DeepCache: Accelerating Diffusion Models for Free},
  author={Ma, Xinyin and Fang, Gongfan and Wang, Xinchao},
  booktitle={CVPR},
  year={2024}
}
```

## Resources

- [DeepCache Paper](https://arxiv.org/abs/2312.00858)
- [DeepCache Repository](https://github.com/horseee/DeepCache)
- [ComfyUI DeepCache Implementation](https://gist.github.com/laksjdjf/435c512bc19636e9c9af4ee7bea9eb86) (reference for LightDiffusion-Next)
- [First Block Cache Discussion](https://github.com/comfyanonymous/ComfyUI/discussions/3491)