Spaces:
Running on Zero
Running on Zero
| # WaveSpeed Caching | |
| ## Overview | |
| WaveSpeed is the project's caching-oriented optimization layer for reusing work across denoising steps. In the current codebase, the integrated path is DeepCache for UNet-based models, and the repository also contains groundwork for a Flux-oriented First Block Cache path. | |
| LightDiffusion-Next contains two WaveSpeed-related implementations: | |
| 1. **DeepCache** β Integrated for UNet-based models (SD1.5, SDXL) | |
| 2. **First Block Cache (FBCache)** β Flux-oriented cache machinery present in the codebase | |
| Both are training-free. DeepCache is the user-facing path today; First Block Cache is codebase groundwork for a more specialized transformer caching path. | |
| ## How It Works | |
| ### Core Insight | |
| Diffusion models denoise images iteratively over 20-50 steps. Researchers observed that: | |
| - **High-level features** (semantic structure, composition) change slowly across steps | |
| - **Low-level features** (fine details, textures) require frequent updates | |
| WaveSpeed aims to reduce repeated computation across nearby denoising steps by reusing information from earlier steps where practical. | |
| ### DeepCache (UNet Models) {#deepcache} | |
| DeepCache is the integrated WaveSpeed path for UNet models. | |
| **Cache step (every N steps):** | |
| 1. Run the full denoiser path | |
| 2. Store the output for later reuse | |
| **Reuse step (intermediate steps):** | |
| 1. Reuse the cached denoiser output | |
| 2. Skip the full model recomputation for that step | |
| **Speedup:** ~50-70% time saved per reuse step β 2-3x total speedup with `interval=3` | |
| ### First Block Cache (Flux Models) | |
| Flux uses Transformer blocks instead of UNet convolutions. The repository includes a First Block Cache implementation for this architecture family: | |
| ``` | |
| βββββββββββββββββββββββββββββββββββββββββββ | |
| β First Transformer Block (always run) β β Computes initial features | |
| βββββββββββββββββββββββββββββββββββββββββββ€ | |
| β Remaining Blocks (cached if similar) β β FBCache caching zone | |
| βββββββββββββββββββββββββββββββββββββββββββ | |
| ``` | |
| **Cache decision logic:** | |
| 1. Run first Transformer block | |
| 2. Compare output to previous step's output | |
| 3. If difference < threshold: reuse cached remaining blocks | |
| 4. If difference β₯ threshold: run all blocks and update cache | |
| In the current project structure, this cache path is implementation groundwork rather than a standard generation toggle like DeepCache. | |
| ## DeepCache Configuration | |
| ### Parameters | |
| | Parameter | Type | Default | Description | | |
| |-----------|------|---------|-------------| | |
| | `cache_interval` | int | 3 | Steps between cache updates (higher = faster, lower quality) | | |
| | `cache_depth` | int | 2 | UNet depth for caching (0-12, higher = more aggressive) | | |
| | `start_step` | int | 0 | Timestep to start caching (0-1000) | | |
| | `end_step` | int | 1000 | Timestep to stop caching (0-1000) | | |
| ### Streamlit UI | |
| Enable in the **β‘ DeepCache Acceleration** expander: | |
| 1. Check **Enable DeepCache** | |
| 2. Adjust sliders: | |
| - **Cache Interval**: 1-10 (default: 3) | |
| - **Cache Depth**: 0-12 (default: 2) | |
| - **Start/End Steps**: 0-1000 (default: 0/1000) | |
| 3. Generate images β caching applies transparently | |
| ### REST API | |
| ```bash | |
| curl -X POST http://localhost:7861/api/generate \ | |
| -H "Content-Type: application/json" \ | |
| -d '{ | |
| "prompt": "a misty forest at twilight", | |
| "width": 768, | |
| "height": 512, | |
| "deepcache_enabled": true, | |
| "deepcache_interval": 3, | |
| "deepcache_depth": 2 | |
| }' | |
| ``` | |
| ### Recommended Presets | |
| #### Balanced (Default) | |
| ```yaml | |
| cache_interval: 3 | |
| cache_depth: 2 | |
| start_step: 0 | |
| end_step: 1000 | |
| ``` | |
| - **Speedup:** 2-2.3x | |
| - **Quality loss:** Very slight (1-2%) | |
| - **Use case:** Everyday generation | |
| #### Maximum Speed | |
| ```yaml | |
| cache_interval: 5 | |
| cache_depth: 3 | |
| start_step: 0 | |
| end_step: 1000 | |
| ``` | |
| - **Speedup:** 2.5-3x | |
| - **Quality loss:** Noticeable (5-7%) | |
| - **Use case:** Rapid prototyping, batch jobs | |
| #### Maximum Quality | |
| ```yaml | |
| cache_interval: 2 | |
| cache_depth: 1 | |
| start_step: 0 | |
| end_step: 1000 | |
| ``` | |
| - **Speedup:** 1.5-2x | |
| - **Quality loss:** Minimal (<1%) | |
| - **Use case:** Final renders, client work | |
| #### Partial Caching (Critical Steps Only) | |
| ```yaml | |
| cache_interval: 3 | |
| cache_depth: 2 | |
| start_step: 200 | |
| end_step: 800 | |
| ``` | |
| - **Speedup:** 1.8-2.2x | |
| - **Quality loss:** Minimal | |
| - **Use case:** Preserve early structure, late details | |
| ## First Block Cache (FBCache) Configuration | |
| ### Parameters | |
| | Parameter | Type | Default | Description | | |
| |-----------|------|---------|-------------| | |
| | `residual_diff_threshold` | float | 0.05 | Max feature difference to trigger cache reuse (0.0-1.0) | | |
| ### Usage | |
| First Block Cache is not currently exposed as a standard per-generation toggle. The implementation is available in the codebase for specialized integration work: | |
| ```python | |
| # In src/user/pipeline.py | |
| from src.WaveSpeed import fbcache_nodes | |
| # Create cache context | |
| cache_context = fbcache_nodes.create_cache_context() | |
| # Apply caching to a Flux-style model | |
| with fbcache_nodes.cache_context(cache_context): | |
| patched_model = fbcache_nodes.create_patch_flux_forward_orig( | |
| flux_model, | |
| residual_diff_threshold=0.05, # Lower = stricter caching | |
| ) | |
| # Generate images... | |
| ``` | |
| ### Tuning Threshold | |
| - **Lower threshold (0.01-0.03)**: Stricter caching, recomputes more often, higher quality | |
| - **Higher threshold (0.05-0.1)**: Looser caching, reuses more often, higher speedup | |
| - **Recommended:** 0.05 (balances quality and speed) | |
| ## Performance | |
| ### Speedup Guidance | |
| Speedup scales with cache interval and depth: | |
| | Model | Cache Interval | Expected Behavior | | |
| |-------|---------------|-------------------| | |
| | SD1.5 | 2 | Moderate speedup, minimal quality loss | | |
| | SD1.5 | 3 | Good speedup, slight quality loss | | |
| | SD1.5 | 5 | High speedup, noticeable quality loss | | |
| | SDXL | 3 | Good speedup, slight quality loss | | |
| | Flux-style caching paths | implementation-specific | Depends on the integration path | | |
| **Performance varies based on:** | |
| - GPU architecture | |
| - Model size | |
| - Resolution | |
| - Sampler choice | |
| - Number of steps | |
| **Recommendation:** Start with `interval=3` and adjust based on your quality requirements.### VRAM Impact | |
| Caching increases VRAM usage slightly (50-200MB depending on resolution): | |
| | Model | Baseline VRAM | + DeepCache | Increase | | |
| |-------|--------------|-------------|----------| | |
| | SD1.5 (768Γ512) | 3.2 GB | 3.4 GB | +200 MB | | |
| | SDXL (1024Γ1024) | 6.8 GB | 7.0 GB | +200 MB | | |
| | Flux (832Γ1216) | 12.5 GB | 12.6 GB | +100 MB | | |
| ## Stacking with Other Optimizations | |
| WaveSpeed is **fully compatible** with SageAttention, SpargeAttn and Stable-Fast: | |
| ### DeepCache + SageAttention | |
| ```yaml | |
| deepcache_enabled: true | |
| deepcache_interval: 3 | |
| # SageAttention auto-detected | |
| ``` | |
| **Result:** 2.2x (DeepCache) Γ 1.15 (SageAttention) = **~2.5x total speedup** | |
| ### DeepCache + SpargeAttn | |
| ```yaml | |
| deepcache_enabled: true | |
| deepcache_interval: 3 | |
| # SpargeAttn auto-detected | |
| ``` | |
| **Result:** Enhanced speedup from caching and sparse attention | |
| ### DeepCache + Stable-Fast + SpargeAttn | |
| ```yaml | |
| stable_fast: true | |
| deepcache_enabled: true | |
| deepcache_interval: 3 | |
| # SpargeAttn auto-detected | |
| ``` | |
| **Result:** Maximum combined speedup (all optimizations active, batch operations only) | |
| ## Compatibility | |
| ### DeepCache Compatible With | |
| - β Stable Diffusion 1.5 | |
| - β Stable Diffusion 2.1 | |
| - β SDXL | |
| - β All samplers (Euler, DPM++, etc.) | |
| - β LoRA adapters | |
| - β Textual inversion embeddings | |
| - β HiresFix | |
| - β ADetailer | |
| - β Multi-scale diffusion | |
| - β SageAttention/SpargeAttn | |
| - β Stable-Fast | |
| ### DeepCache NOT Compatible With | |
| - β Flux models (use FBCache instead) | |
| - β Img2Img mode (can cause artifacts) | |
| ### FBCache Compatible With | |
| - β Flux models | |
| - β SageAttention/SpargeAttn | |
| - β All Flux-compatible features | |
| ### FBCache NOT Compatible With | |
| - β SD1.5/SDXL (use DeepCache instead) | |
| - β Stable-Fast (Flux not supported by Stable-Fast) | |
| ## Troubleshooting | |
| ### No Speedup Observed | |
| **Causes:** | |
| 1. DeepCache disabled or not applied to correct model type | |
| 2. Cache interval too low (interval=1 provides no caching) | |
| 3. Model loaded incorrectly | |
| **Fixes:** | |
| ```bash | |
| # Check logs for DeepCache activation | |
| cat logs/server.log | grep -i "deepcache\|cache" | |
| # Verify UI toggle is enabled | |
| # Streamlit: Check "Enable DeepCache" checkbox | |
| # API: Ensure "deepcache_enabled": true in payload | |
| # Try higher interval | |
| deepcache_interval: 3 # Instead of 1 or 2 | |
| ``` | |
| ### Quality Degradation | |
| **Symptoms:** | |
| - Blurry details | |
| - Smoothed textures | |
| - Loss of fine patterns | |
| **Causes:** | |
| 1. Cache interval too high | |
| 2. Cache depth too aggressive | |
| 3. Wrong model type (Flux using DeepCache) | |
| **Fixes:** | |
| ```yaml | |
| # Reduce cache interval | |
| deepcache_interval: 2 # Down from 5 | |
| # Reduce cache depth | |
| deepcache_depth: 1 # Down from 3 | |
| # Disable caching for critical phases | |
| deepcache_start_step: 200 # Skip early structure formation | |
| deepcache_end_step: 800 # Skip late detail refinement | |
| ``` | |
| ### Artifacts in Img2Img | |
| **Symptom:** Visible seams, inconsistent styles when using DeepCache with Img2Img. | |
| **Cause:** Img2Img starts from a noisy input image, which violates DeepCache's assumptions about feature consistency. | |
| **Fix:** Disable DeepCache for Img2Img: | |
| ```yaml | |
| deepcache_enabled: false # When img2img_enabled: true | |
| ``` | |
| ### VRAM Increase | |
| **Symptom:** OOM errors after enabling DeepCache. | |
| **Cause:** Cached features consume additional VRAM. | |
| **Fixes:** | |
| 1. Reduce batch size | |
| 2. Lower resolution | |
| 3. Disable other VRAM-heavy features (Stable-Fast CUDA graphs) | |
| 4. Use lower cache depth: | |
| ```yaml | |
| deepcache_depth: 1 # Minimal caching | |
| ``` | |
| ### Flux FBCache Not Working | |
| **Symptom:** No speedup with Flux generation. | |
| **Cause:** FBCache implementation is more subtle β check logs for cache hit rate. | |
| **Debugging:** | |
| ```bash | |
| # Enable debug logging | |
| export LD_SERVER_LOGLEVEL=DEBUG | |
| # Check cache statistics | |
| cat logs/server.log | grep "cache" | |
| ``` | |
| If no cache hits, try adjusting threshold: | |
| ```python | |
| # In pipeline.py | |
| residual_diff_threshold=0.1 # Increase from 0.05 for more cache reuse | |
| ``` | |
| ## Quality Comparison | |
| Visual impact of different cache intervals: | |
| | Interval | Speed | Visual Difference | | |
| |----------|-------|-------------------| | |
| | Disabled | Baseline | Baseline (100% quality) | | |
| | 2 | Faster | Virtually identical | | |
| | 3 | Much faster | Very subtle smoothing | | |
| | 5 | Very fast | Noticeable detail loss | | |
| | 7+ | Fastest | Obvious quality degradation | | |
| **Recommendation:** Start with `interval=3` and adjust based on visual results. | |
| ## Technical Details | |
| ### DeepCache Implementation | |
| Simplified pseudocode: | |
| ```python | |
| class DeepCacheWrapper: | |
| def __init__(self, model, interval, depth): | |
| self.model = model | |
| self.interval = interval | |
| self.cached_output = None | |
| self.current_step = 0 | |
| def forward(self, x, timestep): | |
| is_cache_step = (self.current_step % self.interval == 0) | |
| if is_cache_step: | |
| # Run full model, cache output | |
| output = self.model(x, timestep) | |
| self.cached_output = output.clone() | |
| else: | |
| # Reuse cached output (skip expensive computation) | |
| output = self.cached_output | |
| self.current_step += 1 | |
| return output | |
| ``` | |
| Actual implementation in `src/WaveSpeed/deepcache_nodes.py` includes: | |
| - Proper timestep tracking | |
| - Cache invalidation on batch changes | |
| - Error handling and fallback to full forward | |
| ### FBCache Residual Comparison | |
| ```python | |
| # Compute first block output | |
| first_output = first_transformer_block(hidden_states) | |
| # Compare to previous step | |
| residual = first_output - previous_first_output | |
| residual_norm = residual.abs().mean() / first_output.abs().mean() | |
| if residual_norm < threshold: | |
| # Feature change is small β reuse cached blocks | |
| hidden_states = apply_cached_residual(first_output) | |
| else: | |
| # Feature change is large β recompute all blocks | |
| hidden_states = run_remaining_blocks(first_output) | |
| cache_residual(hidden_states) | |
| ``` | |
| ## Best Practices | |
| ### For Everyday Use | |
| 1. **Enable DeepCache** with default settings (`interval=3`, `depth=2`) | |
| 2. **Stack with SageAttention** for 2.5x+ total speedup | |
| 3. **Disable for final client renders** if absolute quality is critical | |
| ### For Batch Processing | |
| 1. **Use aggressive caching** (`interval=5`, `depth=3`) | |
| 2. **Pre-generate previews** at high speed, re-render winners at full quality | |
| 3. **Disable TAESD previews** to avoid overhead (set `enable_preview=false`) | |
| ### For Low VRAM | |
| 1. **Use conservative caching** (`interval=2`, `depth=1`) | |
| 2. **Avoid stacking** with Stable-Fast CUDA graphs | |
| 3. **Monitor VRAM** via `/api/telemetry` endpoint | |
| ## Citation | |
| If you use WaveSpeed/DeepCache in your work: | |
| ```bibtex | |
| @inproceedings{ma2023deepcache, | |
| title={DeepCache: Accelerating Diffusion Models for Free}, | |
| author={Ma, Xinyin and Fang, Gongfan and Wang, Xinchao}, | |
| booktitle={CVPR}, | |
| year={2024} | |
| } | |
| ``` | |
| ## Resources | |
| - [DeepCache Paper](https://arxiv.org/abs/2312.00858) | |
| - [DeepCache Repository](https://github.com/horseee/DeepCache) | |
| - [ComfyUI DeepCache Implementation](https://gist.github.com/laksjdjf/435c512bc19636e9c9af4ee7bea9eb86) (reference for LightDiffusion-Next) | |
| - [First Block Cache Discussion](https://github.com/comfyanonymous/ComfyUI/discussions/3491) | |