Spaces:
Running on Zero
Running on Zero
| # Token Merging (ToMe) | |
| ## Overview | |
| Token Merging (ToMe) is a **performance optimization** that accelerates diffusion models by intelligently merging similar tokens in the attention mechanism. By identifying and combining redundant computations, ToMe achieves **20-60% speedup** with minimal quality impact. | |
| Unlike feature caching (DeepCache, WaveSpeed), ToMe reduces the computational graph itself β fewer tokens means fewer attention operations, less memory bandwidth, and faster generation. | |
| This is a **training-free**, **drop-in optimization** that works with all Stable Diffusion models (SD1.5, SDXL) and can be combined with other speedup techniques. | |
| ## How It Works | |
| ### The Token Redundancy Problem | |
| Diffusion models process images as sequences of tokens (patches): | |
| ``` | |
| Input Image (512Γ512) β Tokenize β 4096 tokens (64Γ64 grid of 8Γ8 patches) | |
| ``` | |
| At each attention layer, **every token attends to every other token**: | |
| $$ | |
| \text{Attention Cost} = O(N^2 \cdot D) | |
| $$ | |
| Where: | |
| - $N$ = number of tokens (e.g., 4096 for 512Γ512) | |
| - $D$ = embedding dimension (e.g., 768 or 1024) | |
| **Key insight:** Many tokens are highly similar (e.g., sky regions, uniform backgrounds, smooth gradients). Computing attention between nearly-identical tokens is redundant. | |
| ### The ToMe Solution | |
| Token Merging reduces redundancy through **bipartite matching**: | |
| ``` | |
| Step 1: Split tokens into two sets | |
| βββββββββββββββββββββββ¬ββββββββββββββββββββββ | |
| β Destination Set (dst)β Source Set (src) β | |
| β [Token 1, 3, 5, ...] β [Token 2, 4, 6, ...] β | |
| βββββββββββββββββββββββ΄ββββββββββββββββββββββ | |
| Step 2: Compute similarity (cosine distance) | |
| dst[0] β src[0]: 0.92 (highly similar!) | |
| dst[0] β src[1]: 0.34 | |
| dst[0] β src[2]: 0.18 | |
| ... | |
| Step 3: Merge most similar pairs | |
| merged_token[0] = (dst[0] + src[0]) / 2 | |
| Step 4: Continue with fewer tokens | |
| 4096 tokens β 2048 tokens (50% merge ratio) | |
| Attention cost reduced by ~4x | |
| ``` | |
| This happens **per attention layer**, with merge ratio dynamically adjusting based on layer depth. | |
| ## Configuration | |
| ### Parameters | |
| | Parameter | Type | Default | Range | Description | | |
| |-----------|------|---------|-------|-------------| | |
| | `tome_enabled` | bool | `False` | - | Enable Token Merging | | |
| | `tome_ratio` | float | `0.5` | 0.0-0.9 | Percentage of tokens to merge (higher = faster, lower quality) | | |
| | `tome_max_downsample` | int | `1` | 1, 2, 4, 8 | Apply ToMe to layers with downsampling β€ this value | | |
| ### Choosing `tome_max_downsample` | |
| Controls which UNet layers apply ToMe: | |
| | Value | Layers Affected | Speed vs Quality | | |
| |-------|----------------|------------------| | |
| | **1** | Only full-resolution layers (4/15) | Conservative, minimal quality impact | | |
| | **2** | Half-resolution layers (8/15) | Balanced (recommended) | | |
| | **4** | Quarter-resolution layers (12/15) | Aggressive | | |
| | **8** | All layers (15/15) | Maximum speedup, noticeable quality loss | | |
| **Recommendation:** Start with `max_downsample=1`. Only increase if you need more speedup and can tolerate quality reduction. | |
| ## Usage | |
| ### Streamlit UI | |
| Enable in the **π Token Merging (ToMe)** expander: | |
| 1. Check **Enable Token Merging** | |
| 2. Select a preset: | |
| - **Conservative** β 30% merge, max_downsample=2 (minimal impact) | |
| - **Balanced** β 50% merge, max_downsample=1 (recommended) | |
| - **Aggressive** β 70% merge, max_downsample=1 (maximum speed) | |
| - **Custom** β Manual slider control | |
| 3. Generate images β console confirms activation | |
| **Visual feedback:** | |
| ``` | |
| β Token Merging ACTIVE: 50% merge ratio, max_downsample=1 | |
| ``` | |
| ### REST API | |
| Include in your generation request: | |
| ```bash | |
| curl -X POST http://localhost:7861/api/generate \ | |
| -H "Content-Type: application/json" \ | |
| -d '{ | |
| "prompt": "a cyberpunk cityscape at night, neon lights", | |
| "width": 1024, | |
| "height": 512, | |
| "steps": 25, | |
| "tome_enabled": true, | |
| "tome_ratio": 0.5, | |
| "tome_max_downsample": 1 | |
| }' | |
| ``` | |
| ### Python API | |
| ```python | |
| from src.user.pipeline import pipeline | |
| pipeline( | |
| prompt="a detailed fantasy castle on a cliff", | |
| w=768, | |
| h=1024, | |
| steps=30, | |
| sampler="dpmpp_sde_cfgpp", | |
| scheduler="ays", | |
| tome_enabled=True, | |
| tome_ratio=0.5, | |
| tome_max_downsample=1, | |
| number=4 # Generate multiple images faster | |
| ) | |
| ``` | |
| ## Troubleshooting | |
| ### "No speedup detected" | |
| **Possible causes:** | |
| 1. **tomesd not installed** β Install with `pip install tomesd` | |
| 2. **Other bottlenecks** β Enable only ToMe for isolated testing | |
| 3. **Very low resolution** β ToMe benefits are minimal below 512px | |
| **Solutions:** | |
| ```bash | |
| # Check installation | |
| python -c "import tomesd; print('ToMe available')" | |
| # Test in isolation at 1024Γ512 (ideal resolution for ToMe) | |
| python quick_tome_test.py | |
| ``` | |
| ### "Images look blurry or soft" | |
| **Cause:** `tome_ratio` too high (>0.6) or `max_downsample` too aggressive (>2). | |
| **Solutions:** | |
| - Reduce `tome_ratio` to 0.4-0.5 | |
| - Lower `max_downsample` to 1 | |
| - Increase `steps` to 30-35 for better convergence | |
| - Disable ToMe for final high-quality renders | |
| ### "Minimal speedup despite 70% merge" | |
| **Cause:** Other optimizations (DeepCache, Multi-Scale) already bottlenecked elsewhere (VAE decode, sampling overhead). | |
| **Solutions:** | |
| - Profile with isolated tests (disable all other optimizations) | |
| - Ensure GPU isn't memory-bound (reduce batch size) | |
| - Check system monitoring for CPU/disk bottlenecks | |
| ### "Model fails to load / tomesd errors" | |
| **Cause:** Outdated tomesd version or incompatible model architecture. | |
| **Solutions:** | |
| ```bash | |
| # Update tomesd | |
| pip install --upgrade tomesd | |
| # Check compatibility (ToMe only works with UNet-based models) | |
| # Flux/Transformer models require different ToMe variant (not yet supported) | |
| ``` | |
| ## Technical Details | |
| ### Implementation | |
| ToMe is applied via the `ModelPatcher` class (`src/Model/ModelPatcher.py`): | |
| ```python | |
| def apply_tome(self, ratio: float = 0.5, max_downsample: int = 1) -> bool: | |
| """Apply Token Merging to the diffusion model.""" | |
| # Remove any existing patch (handles cached models) | |
| try: | |
| tomesd.remove_patch(self) | |
| except: | |
| pass | |
| # Apply ToMe patch | |
| tomesd.apply_patch( | |
| self, # ModelPatcher with .model.diffusion_model structure | |
| ratio=ratio, | |
| max_downsample=max_downsample | |
| ) | |
| self.tome_enabled = True | |
| return True | |
| ``` | |
| **Cache handling:** ToMe patches are removed after each generation and re-applied as needed, ensuring correct behavior with model caching. | |
| ### Bipartite Matching Algorithm | |
| ToMe uses **proportional attention-based matching**: | |
| 1. **Partition tokens:** | |
| $$ | |
| T_{\text{dst}}, T_{\text{src}} = \text{partition}(T, \text{stride}=(2,2)) | |
| $$ | |
| 2. **Compute similarity matrix:** | |
| $$ | |
| S_{ij} = \frac{T_{\text{dst}}[i] \cdot T_{\text{src}}[j]}{||T_{\text{dst}}[i]|| \cdot ||T_{\text{src}}[j]||} | |
| $$ | |
| 3. **Find top-k matches:** | |
| $$ | |
| k = \lfloor \text{ratio} \times |T_{\text{src}}| \rfloor | |
| $$ | |
| 4. **Merge tokens:** | |
| $$ | |
| T'[i] = \frac{T_{\text{dst}}[i] + T_{\text{src}}[\text{match}(i)]}{2} | |
| $$ | |
| ## Compatibility | |
| | Feature | Compatible? | Notes | | |
| |---------|-------------|-------| | |
| | **SD1.5 models** | β | Full support, tested extensively | | |
| | **SDXL models** | β | Full support, larger speedup | | |
| | **Flux models** | β | UNet-specific, Transformer variant TBD | | |
| | **All samplers** | β | ToMe patches attention, agnostic to sampler | | |
| | **CFG-Free** | β | No interaction, both apply independently | | |
| | **DeepCache** | β | Excellent combination, speedups multiply | | |
| | **Multi-Scale** | β | Compatible, benefits stack | | |
| | **HiRes Fix** | β | Applied to all upscaling passes | | |
| | **ADetailer** | β | Applied to detail-enhancement passes | | |
| | **Stable-Fast** | β | Can combine for maximum speedup | | |
| ## Limitations | |
| 1. **UNet-only:** Transformer architectures (Flux) use different attention patterns β dedicated Transformer-ToMe needed | |
| 2. **Detail sensitivity:** High-frequency textures (fabric weave, individual hairs) see most quality impact | |
| 3. **Diminishing returns:** Beyond 60% merge, quality degrades faster than speed improves | |
| 4. **One-time patch:** Doesn't adapt merge ratio dynamically during generation | |
| ## Related Optimizations | |
| - **[DeepCache](wavespeed.md#deepcache)**: Feature caching β complements ToMe, speedups multiply (~2.8x combined) | |
| - **[Multi-Scale Diffusion](optimizations.md#multi-scale)**: Resolution-based optimization β also reduces token count | |
| - **[Stable-Fast](stablefast.md)**: Compilation-based speedup β can combine for maximum performance | |
| ## References & Further Reading | |
| - **Original Paper:** [Token Merging for Fast Stable Diffusion](https://arxiv.org/abs/2303.17604) (Bolya & Hoffman, 2023) | |
| - **tomesd Library:** https://github.com/dbolya/tomesd | |
| - **ToMe for Vision Transformers:** https://github.com/facebookresearch/ToMe | |