Aatricks's picture
Deploy ZeroGPU Gradio Space snapshot
b701455
# Token Merging (ToMe)
## Overview
Token Merging (ToMe) is a **performance optimization** that accelerates diffusion models by intelligently merging similar tokens in the attention mechanism. By identifying and combining redundant computations, ToMe achieves **20-60% speedup** with minimal quality impact.
Unlike feature caching (DeepCache, WaveSpeed), ToMe reduces the computational graph itself β€” fewer tokens means fewer attention operations, less memory bandwidth, and faster generation.
This is a **training-free**, **drop-in optimization** that works with all Stable Diffusion models (SD1.5, SDXL) and can be combined with other speedup techniques.
## How It Works
### The Token Redundancy Problem
Diffusion models process images as sequences of tokens (patches):
```
Input Image (512Γ—512) β†’ Tokenize β†’ 4096 tokens (64Γ—64 grid of 8Γ—8 patches)
```
At each attention layer, **every token attends to every other token**:
$$
\text{Attention Cost} = O(N^2 \cdot D)
$$
Where:
- $N$ = number of tokens (e.g., 4096 for 512Γ—512)
- $D$ = embedding dimension (e.g., 768 or 1024)
**Key insight:** Many tokens are highly similar (e.g., sky regions, uniform backgrounds, smooth gradients). Computing attention between nearly-identical tokens is redundant.
### The ToMe Solution
Token Merging reduces redundancy through **bipartite matching**:
```
Step 1: Split tokens into two sets
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Destination Set (dst)β”‚ Source Set (src) β”‚
β”‚ [Token 1, 3, 5, ...] β”‚ [Token 2, 4, 6, ...] β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Step 2: Compute similarity (cosine distance)
dst[0] ↔ src[0]: 0.92 (highly similar!)
dst[0] ↔ src[1]: 0.34
dst[0] ↔ src[2]: 0.18
...
Step 3: Merge most similar pairs
merged_token[0] = (dst[0] + src[0]) / 2
Step 4: Continue with fewer tokens
4096 tokens β†’ 2048 tokens (50% merge ratio)
Attention cost reduced by ~4x
```
This happens **per attention layer**, with merge ratio dynamically adjusting based on layer depth.
## Configuration
### Parameters
| Parameter | Type | Default | Range | Description |
|-----------|------|---------|-------|-------------|
| `tome_enabled` | bool | `False` | - | Enable Token Merging |
| `tome_ratio` | float | `0.5` | 0.0-0.9 | Percentage of tokens to merge (higher = faster, lower quality) |
| `tome_max_downsample` | int | `1` | 1, 2, 4, 8 | Apply ToMe to layers with downsampling ≀ this value |
### Choosing `tome_max_downsample`
Controls which UNet layers apply ToMe:
| Value | Layers Affected | Speed vs Quality |
|-------|----------------|------------------|
| **1** | Only full-resolution layers (4/15) | Conservative, minimal quality impact |
| **2** | Half-resolution layers (8/15) | Balanced (recommended) |
| **4** | Quarter-resolution layers (12/15) | Aggressive |
| **8** | All layers (15/15) | Maximum speedup, noticeable quality loss |
**Recommendation:** Start with `max_downsample=1`. Only increase if you need more speedup and can tolerate quality reduction.
## Usage
### Streamlit UI
Enable in the **πŸ”€ Token Merging (ToMe)** expander:
1. Check **Enable Token Merging**
2. Select a preset:
- **Conservative** β€” 30% merge, max_downsample=2 (minimal impact)
- **Balanced** β€” 50% merge, max_downsample=1 (recommended)
- **Aggressive** β€” 70% merge, max_downsample=1 (maximum speed)
- **Custom** β€” Manual slider control
3. Generate images β€” console confirms activation
**Visual feedback:**
```
βœ“ Token Merging ACTIVE: 50% merge ratio, max_downsample=1
```
### REST API
Include in your generation request:
```bash
curl -X POST http://localhost:7861/api/generate \
-H "Content-Type: application/json" \
-d '{
"prompt": "a cyberpunk cityscape at night, neon lights",
"width": 1024,
"height": 512,
"steps": 25,
"tome_enabled": true,
"tome_ratio": 0.5,
"tome_max_downsample": 1
}'
```
### Python API
```python
from src.user.pipeline import pipeline
pipeline(
prompt="a detailed fantasy castle on a cliff",
w=768,
h=1024,
steps=30,
sampler="dpmpp_sde_cfgpp",
scheduler="ays",
tome_enabled=True,
tome_ratio=0.5,
tome_max_downsample=1,
number=4 # Generate multiple images faster
)
```
## Troubleshooting
### "No speedup detected"
**Possible causes:**
1. **tomesd not installed** β€” Install with `pip install tomesd`
2. **Other bottlenecks** β€” Enable only ToMe for isolated testing
3. **Very low resolution** β€” ToMe benefits are minimal below 512px
**Solutions:**
```bash
# Check installation
python -c "import tomesd; print('ToMe available')"
# Test in isolation at 1024Γ—512 (ideal resolution for ToMe)
python quick_tome_test.py
```
### "Images look blurry or soft"
**Cause:** `tome_ratio` too high (>0.6) or `max_downsample` too aggressive (>2).
**Solutions:**
- Reduce `tome_ratio` to 0.4-0.5
- Lower `max_downsample` to 1
- Increase `steps` to 30-35 for better convergence
- Disable ToMe for final high-quality renders
### "Minimal speedup despite 70% merge"
**Cause:** Other optimizations (DeepCache, Multi-Scale) already bottlenecked elsewhere (VAE decode, sampling overhead).
**Solutions:**
- Profile with isolated tests (disable all other optimizations)
- Ensure GPU isn't memory-bound (reduce batch size)
- Check system monitoring for CPU/disk bottlenecks
### "Model fails to load / tomesd errors"
**Cause:** Outdated tomesd version or incompatible model architecture.
**Solutions:**
```bash
# Update tomesd
pip install --upgrade tomesd
# Check compatibility (ToMe only works with UNet-based models)
# Flux/Transformer models require different ToMe variant (not yet supported)
```
## Technical Details
### Implementation
ToMe is applied via the `ModelPatcher` class (`src/Model/ModelPatcher.py`):
```python
def apply_tome(self, ratio: float = 0.5, max_downsample: int = 1) -> bool:
"""Apply Token Merging to the diffusion model."""
# Remove any existing patch (handles cached models)
try:
tomesd.remove_patch(self)
except:
pass
# Apply ToMe patch
tomesd.apply_patch(
self, # ModelPatcher with .model.diffusion_model structure
ratio=ratio,
max_downsample=max_downsample
)
self.tome_enabled = True
return True
```
**Cache handling:** ToMe patches are removed after each generation and re-applied as needed, ensuring correct behavior with model caching.
### Bipartite Matching Algorithm
ToMe uses **proportional attention-based matching**:
1. **Partition tokens:**
$$
T_{\text{dst}}, T_{\text{src}} = \text{partition}(T, \text{stride}=(2,2))
$$
2. **Compute similarity matrix:**
$$
S_{ij} = \frac{T_{\text{dst}}[i] \cdot T_{\text{src}}[j]}{||T_{\text{dst}}[i]|| \cdot ||T_{\text{src}}[j]||}
$$
3. **Find top-k matches:**
$$
k = \lfloor \text{ratio} \times |T_{\text{src}}| \rfloor
$$
4. **Merge tokens:**
$$
T'[i] = \frac{T_{\text{dst}}[i] + T_{\text{src}}[\text{match}(i)]}{2}
$$
## Compatibility
| Feature | Compatible? | Notes |
|---------|-------------|-------|
| **SD1.5 models** | βœ“ | Full support, tested extensively |
| **SDXL models** | βœ“ | Full support, larger speedup |
| **Flux models** | βœ— | UNet-specific, Transformer variant TBD |
| **All samplers** | βœ“ | ToMe patches attention, agnostic to sampler |
| **CFG-Free** | βœ“ | No interaction, both apply independently |
| **DeepCache** | βœ“ | Excellent combination, speedups multiply |
| **Multi-Scale** | βœ“ | Compatible, benefits stack |
| **HiRes Fix** | βœ“ | Applied to all upscaling passes |
| **ADetailer** | βœ“ | Applied to detail-enhancement passes |
| **Stable-Fast** | βœ“ | Can combine for maximum speedup |
## Limitations
1. **UNet-only:** Transformer architectures (Flux) use different attention patterns β€” dedicated Transformer-ToMe needed
2. **Detail sensitivity:** High-frequency textures (fabric weave, individual hairs) see most quality impact
3. **Diminishing returns:** Beyond 60% merge, quality degrades faster than speed improves
4. **One-time patch:** Doesn't adapt merge ratio dynamically during generation
## Related Optimizations
- **[DeepCache](wavespeed.md#deepcache)**: Feature caching β€” complements ToMe, speedups multiply (~2.8x combined)
- **[Multi-Scale Diffusion](optimizations.md#multi-scale)**: Resolution-based optimization β€” also reduces token count
- **[Stable-Fast](stablefast.md)**: Compilation-based speedup β€” can combine for maximum performance
## References & Further Reading
- **Original Paper:** [Token Merging for Fast Stable Diffusion](https://arxiv.org/abs/2303.17604) (Bolya & Hoffman, 2023)
- **tomesd Library:** https://github.com/dbolya/tomesd
- **ToMe for Vision Transformers:** https://github.com/facebookresearch/ToMe