# Token Merging (ToMe) ## Overview Token Merging (ToMe) is a **performance optimization** that accelerates diffusion models by intelligently merging similar tokens in the attention mechanism. By identifying and combining redundant computations, ToMe achieves **20-60% speedup** with minimal quality impact. Unlike feature caching (DeepCache, WaveSpeed), ToMe reduces the computational graph itself — fewer tokens means fewer attention operations, less memory bandwidth, and faster generation. This is a **training-free**, **drop-in optimization** that works with all Stable Diffusion models (SD1.5, SDXL) and can be combined with other speedup techniques. ## How It Works ### The Token Redundancy Problem Diffusion models process images as sequences of tokens (patches): ``` Input Image (512×512) → Tokenize → 4096 tokens (64×64 grid of 8×8 patches) ``` At each attention layer, **every token attends to every other token**: $$ \text{Attention Cost} = O(N^2 \cdot D) $$ Where: - $N$ = number of tokens (e.g., 4096 for 512×512) - $D$ = embedding dimension (e.g., 768 or 1024) **Key insight:** Many tokens are highly similar (e.g., sky regions, uniform backgrounds, smooth gradients). Computing attention between nearly-identical tokens is redundant. ### The ToMe Solution Token Merging reduces redundancy through **bipartite matching**: ``` Step 1: Split tokens into two sets ┌─────────────────────┬─────────────────────┐ │ Destination Set (dst)│ Source Set (src) │ │ [Token 1, 3, 5, ...] │ [Token 2, 4, 6, ...] │ └─────────────────────┴─────────────────────┘ Step 2: Compute similarity (cosine distance) dst[0] ↔ src[0]: 0.92 (highly similar!) dst[0] ↔ src[1]: 0.34 dst[0] ↔ src[2]: 0.18 ... Step 3: Merge most similar pairs merged_token[0] = (dst[0] + src[0]) / 2 Step 4: Continue with fewer tokens 4096 tokens → 2048 tokens (50% merge ratio) Attention cost reduced by ~4x ``` This happens **per attention layer**, with merge ratio dynamically adjusting based on layer depth. ## Configuration ### Parameters | Parameter | Type | Default | Range | Description | |-----------|------|---------|-------|-------------| | `tome_enabled` | bool | `False` | - | Enable Token Merging | | `tome_ratio` | float | `0.5` | 0.0-0.9 | Percentage of tokens to merge (higher = faster, lower quality) | | `tome_max_downsample` | int | `1` | 1, 2, 4, 8 | Apply ToMe to layers with downsampling ≤ this value | ### Choosing `tome_max_downsample` Controls which UNet layers apply ToMe: | Value | Layers Affected | Speed vs Quality | |-------|----------------|------------------| | **1** | Only full-resolution layers (4/15) | Conservative, minimal quality impact | | **2** | Half-resolution layers (8/15) | Balanced (recommended) | | **4** | Quarter-resolution layers (12/15) | Aggressive | | **8** | All layers (15/15) | Maximum speedup, noticeable quality loss | **Recommendation:** Start with `max_downsample=1`. Only increase if you need more speedup and can tolerate quality reduction. ## Usage ### Streamlit UI Enable in the **🔀 Token Merging (ToMe)** expander: 1. Check **Enable Token Merging** 2. Select a preset: - **Conservative** — 30% merge, max_downsample=2 (minimal impact) - **Balanced** — 50% merge, max_downsample=1 (recommended) - **Aggressive** — 70% merge, max_downsample=1 (maximum speed) - **Custom** — Manual slider control 3. Generate images — console confirms activation **Visual feedback:** ``` ✓ Token Merging ACTIVE: 50% merge ratio, max_downsample=1 ``` ### REST API Include in your generation request: ```bash curl -X POST http://localhost:7861/api/generate \ -H "Content-Type: application/json" \ -d '{ "prompt": "a cyberpunk cityscape at night, neon lights", "width": 1024, "height": 512, "steps": 25, "tome_enabled": true, "tome_ratio": 0.5, "tome_max_downsample": 1 }' ``` ### Python API ```python from src.user.pipeline import pipeline pipeline( prompt="a detailed fantasy castle on a cliff", w=768, h=1024, steps=30, sampler="dpmpp_sde_cfgpp", scheduler="ays", tome_enabled=True, tome_ratio=0.5, tome_max_downsample=1, number=4 # Generate multiple images faster ) ``` ## Troubleshooting ### "No speedup detected" **Possible causes:** 1. **tomesd not installed** — Install with `pip install tomesd` 2. **Other bottlenecks** — Enable only ToMe for isolated testing 3. **Very low resolution** — ToMe benefits are minimal below 512px **Solutions:** ```bash # Check installation python -c "import tomesd; print('ToMe available')" # Test in isolation at 1024×512 (ideal resolution for ToMe) python quick_tome_test.py ``` ### "Images look blurry or soft" **Cause:** `tome_ratio` too high (>0.6) or `max_downsample` too aggressive (>2). **Solutions:** - Reduce `tome_ratio` to 0.4-0.5 - Lower `max_downsample` to 1 - Increase `steps` to 30-35 for better convergence - Disable ToMe for final high-quality renders ### "Minimal speedup despite 70% merge" **Cause:** Other optimizations (DeepCache, Multi-Scale) already bottlenecked elsewhere (VAE decode, sampling overhead). **Solutions:** - Profile with isolated tests (disable all other optimizations) - Ensure GPU isn't memory-bound (reduce batch size) - Check system monitoring for CPU/disk bottlenecks ### "Model fails to load / tomesd errors" **Cause:** Outdated tomesd version or incompatible model architecture. **Solutions:** ```bash # Update tomesd pip install --upgrade tomesd # Check compatibility (ToMe only works with UNet-based models) # Flux/Transformer models require different ToMe variant (not yet supported) ``` ## Technical Details ### Implementation ToMe is applied via the `ModelPatcher` class (`src/Model/ModelPatcher.py`): ```python def apply_tome(self, ratio: float = 0.5, max_downsample: int = 1) -> bool: """Apply Token Merging to the diffusion model.""" # Remove any existing patch (handles cached models) try: tomesd.remove_patch(self) except: pass # Apply ToMe patch tomesd.apply_patch( self, # ModelPatcher with .model.diffusion_model structure ratio=ratio, max_downsample=max_downsample ) self.tome_enabled = True return True ``` **Cache handling:** ToMe patches are removed after each generation and re-applied as needed, ensuring correct behavior with model caching. ### Bipartite Matching Algorithm ToMe uses **proportional attention-based matching**: 1. **Partition tokens:** $$ T_{\text{dst}}, T_{\text{src}} = \text{partition}(T, \text{stride}=(2,2)) $$ 2. **Compute similarity matrix:** $$ S_{ij} = \frac{T_{\text{dst}}[i] \cdot T_{\text{src}}[j]}{||T_{\text{dst}}[i]|| \cdot ||T_{\text{src}}[j]||} $$ 3. **Find top-k matches:** $$ k = \lfloor \text{ratio} \times |T_{\text{src}}| \rfloor $$ 4. **Merge tokens:** $$ T'[i] = \frac{T_{\text{dst}}[i] + T_{\text{src}}[\text{match}(i)]}{2} $$ ## Compatibility | Feature | Compatible? | Notes | |---------|-------------|-------| | **SD1.5 models** | ✓ | Full support, tested extensively | | **SDXL models** | ✓ | Full support, larger speedup | | **Flux models** | ✗ | UNet-specific, Transformer variant TBD | | **All samplers** | ✓ | ToMe patches attention, agnostic to sampler | | **CFG-Free** | ✓ | No interaction, both apply independently | | **DeepCache** | ✓ | Excellent combination, speedups multiply | | **Multi-Scale** | ✓ | Compatible, benefits stack | | **HiRes Fix** | ✓ | Applied to all upscaling passes | | **ADetailer** | ✓ | Applied to detail-enhancement passes | | **Stable-Fast** | ✓ | Can combine for maximum speedup | ## Limitations 1. **UNet-only:** Transformer architectures (Flux) use different attention patterns — dedicated Transformer-ToMe needed 2. **Detail sensitivity:** High-frequency textures (fabric weave, individual hairs) see most quality impact 3. **Diminishing returns:** Beyond 60% merge, quality degrades faster than speed improves 4. **One-time patch:** Doesn't adapt merge ratio dynamically during generation ## Related Optimizations - **[DeepCache](wavespeed.md#deepcache)**: Feature caching — complements ToMe, speedups multiply (~2.8x combined) - **[Multi-Scale Diffusion](optimizations.md#multi-scale)**: Resolution-based optimization — also reduces token count - **[Stable-Fast](stablefast.md)**: Compilation-based speedup — can combine for maximum performance ## References & Further Reading - **Original Paper:** [Token Merging for Fast Stable Diffusion](https://arxiv.org/abs/2303.17604) (Bolya & Hoffman, 2023) - **tomesd Library:** https://github.com/dbolya/tomesd - **ToMe for Vision Transformers:** https://github.com/facebookresearch/ToMe