File size: 9,029 Bytes
b701455
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
# Token Merging (ToMe)

## Overview

Token Merging (ToMe) is a **performance optimization** that accelerates diffusion models by intelligently merging similar tokens in the attention mechanism. By identifying and combining redundant computations, ToMe achieves **20-60% speedup** with minimal quality impact.

Unlike feature caching (DeepCache, WaveSpeed), ToMe reduces the computational graph itself β€” fewer tokens means fewer attention operations, less memory bandwidth, and faster generation.

This is a **training-free**, **drop-in optimization** that works with all Stable Diffusion models (SD1.5, SDXL) and can be combined with other speedup techniques.

## How It Works

### The Token Redundancy Problem

Diffusion models process images as sequences of tokens (patches):

```
Input Image (512Γ—512) β†’ Tokenize β†’ 4096 tokens (64Γ—64 grid of 8Γ—8 patches)
```

At each attention layer, **every token attends to every other token**:

$$
\text{Attention Cost} = O(N^2 \cdot D)
$$

Where:
- $N$ = number of tokens (e.g., 4096 for 512Γ—512)
- $D$ = embedding dimension (e.g., 768 or 1024)

**Key insight:** Many tokens are highly similar (e.g., sky regions, uniform backgrounds, smooth gradients). Computing attention between nearly-identical tokens is redundant.

### The ToMe Solution

Token Merging reduces redundancy through **bipartite matching**:

```
Step 1: Split tokens into two sets
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Destination Set (dst)β”‚ Source Set (src)    β”‚
β”‚ [Token 1, 3, 5, ...] β”‚ [Token 2, 4, 6, ...] β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Step 2: Compute similarity (cosine distance)
   dst[0] ↔ src[0]: 0.92  (highly similar!)
   dst[0] ↔ src[1]: 0.34
   dst[0] ↔ src[2]: 0.18
   ...

Step 3: Merge most similar pairs
   merged_token[0] = (dst[0] + src[0]) / 2
   
Step 4: Continue with fewer tokens
   4096 tokens β†’ 2048 tokens (50% merge ratio)
   Attention cost reduced by ~4x
```

This happens **per attention layer**, with merge ratio dynamically adjusting based on layer depth.

## Configuration

### Parameters

| Parameter | Type | Default | Range | Description |
|-----------|------|---------|-------|-------------|
| `tome_enabled` | bool | `False` | - | Enable Token Merging |
| `tome_ratio` | float | `0.5` | 0.0-0.9 | Percentage of tokens to merge (higher = faster, lower quality) |
| `tome_max_downsample` | int | `1` | 1, 2, 4, 8 | Apply ToMe to layers with downsampling ≀ this value |

### Choosing `tome_max_downsample`

Controls which UNet layers apply ToMe:

| Value | Layers Affected | Speed vs Quality |
|-------|----------------|------------------|
| **1** | Only full-resolution layers (4/15) | Conservative, minimal quality impact |
| **2** | Half-resolution layers (8/15) | Balanced (recommended) |
| **4** | Quarter-resolution layers (12/15) | Aggressive |
| **8** | All layers (15/15) | Maximum speedup, noticeable quality loss |

**Recommendation:** Start with `max_downsample=1`. Only increase if you need more speedup and can tolerate quality reduction.

## Usage

### Streamlit UI

Enable in the **πŸ”€ Token Merging (ToMe)** expander:

1. Check **Enable Token Merging**
2. Select a preset:
   - **Conservative** β€” 30% merge, max_downsample=2 (minimal impact)
   - **Balanced** β€” 50% merge, max_downsample=1 (recommended)
   - **Aggressive** β€” 70% merge, max_downsample=1 (maximum speed)
   - **Custom** β€” Manual slider control
3. Generate images β€” console confirms activation

**Visual feedback:**
```
βœ“ Token Merging ACTIVE: 50% merge ratio, max_downsample=1
```

### REST API

Include in your generation request:

```bash
curl -X POST http://localhost:7861/api/generate \
  -H "Content-Type: application/json" \
  -d '{
        "prompt": "a cyberpunk cityscape at night, neon lights",
        "width": 1024,
        "height": 512,
        "steps": 25,
        "tome_enabled": true,
        "tome_ratio": 0.5,
        "tome_max_downsample": 1
      }'
```

### Python API

```python
from src.user.pipeline import pipeline

pipeline(
    prompt="a detailed fantasy castle on a cliff",
    w=768,
    h=1024,
    steps=30,
    sampler="dpmpp_sde_cfgpp",
    scheduler="ays",
    tome_enabled=True,
    tome_ratio=0.5,
    tome_max_downsample=1,
    number=4  # Generate multiple images faster
)
```

## Troubleshooting

### "No speedup detected"

**Possible causes:**
1. **tomesd not installed** β€” Install with `pip install tomesd`
2. **Other bottlenecks** β€” Enable only ToMe for isolated testing
3. **Very low resolution** β€” ToMe benefits are minimal below 512px

**Solutions:**
```bash
# Check installation
python -c "import tomesd; print('ToMe available')"

# Test in isolation at 1024Γ—512 (ideal resolution for ToMe)
python quick_tome_test.py
```

### "Images look blurry or soft"

**Cause:** `tome_ratio` too high (>0.6) or `max_downsample` too aggressive (>2).

**Solutions:**
- Reduce `tome_ratio` to 0.4-0.5
- Lower `max_downsample` to 1
- Increase `steps` to 30-35 for better convergence
- Disable ToMe for final high-quality renders

### "Minimal speedup despite 70% merge"

**Cause:** Other optimizations (DeepCache, Multi-Scale) already bottlenecked elsewhere (VAE decode, sampling overhead).

**Solutions:**
- Profile with isolated tests (disable all other optimizations)
- Ensure GPU isn't memory-bound (reduce batch size)
- Check system monitoring for CPU/disk bottlenecks

### "Model fails to load / tomesd errors"

**Cause:** Outdated tomesd version or incompatible model architecture.

**Solutions:**
```bash
# Update tomesd
pip install --upgrade tomesd

# Check compatibility (ToMe only works with UNet-based models)
# Flux/Transformer models require different ToMe variant (not yet supported)
```

## Technical Details

### Implementation

ToMe is applied via the `ModelPatcher` class (`src/Model/ModelPatcher.py`):

```python
def apply_tome(self, ratio: float = 0.5, max_downsample: int = 1) -> bool:
    """Apply Token Merging to the diffusion model."""
    # Remove any existing patch (handles cached models)
    try:
        tomesd.remove_patch(self)
    except:
        pass
    
    # Apply ToMe patch
    tomesd.apply_patch(
        self,  # ModelPatcher with .model.diffusion_model structure
        ratio=ratio,
        max_downsample=max_downsample
    )
    self.tome_enabled = True
    return True
```

**Cache handling:** ToMe patches are removed after each generation and re-applied as needed, ensuring correct behavior with model caching.

### Bipartite Matching Algorithm

ToMe uses **proportional attention-based matching**:

1. **Partition tokens:**
   $$
   T_{\text{dst}}, T_{\text{src}} = \text{partition}(T, \text{stride}=(2,2))
   $$

2. **Compute similarity matrix:**
   $$
   S_{ij} = \frac{T_{\text{dst}}[i] \cdot T_{\text{src}}[j]}{||T_{\text{dst}}[i]|| \cdot ||T_{\text{src}}[j]||}
   $$

3. **Find top-k matches:**
   $$
   k = \lfloor \text{ratio} \times |T_{\text{src}}| \rfloor
   $$

4. **Merge tokens:**
   $$
   T'[i] = \frac{T_{\text{dst}}[i] + T_{\text{src}}[\text{match}(i)]}{2}
   $$

## Compatibility

| Feature | Compatible? | Notes |
|---------|-------------|-------|
| **SD1.5 models** | βœ“ | Full support, tested extensively |
| **SDXL models** | βœ“ | Full support, larger speedup |
| **Flux models** | βœ— | UNet-specific, Transformer variant TBD |
| **All samplers** | βœ“ | ToMe patches attention, agnostic to sampler |
| **CFG-Free** | βœ“ | No interaction, both apply independently |
| **DeepCache** | βœ“ | Excellent combination, speedups multiply |
| **Multi-Scale** | βœ“ | Compatible, benefits stack |
| **HiRes Fix** | βœ“ | Applied to all upscaling passes |
| **ADetailer** | βœ“ | Applied to detail-enhancement passes |
| **Stable-Fast** | βœ“ | Can combine for maximum speedup |

## Limitations

1. **UNet-only:** Transformer architectures (Flux) use different attention patterns β€” dedicated Transformer-ToMe needed
2. **Detail sensitivity:** High-frequency textures (fabric weave, individual hairs) see most quality impact
3. **Diminishing returns:** Beyond 60% merge, quality degrades faster than speed improves
4. **One-time patch:** Doesn't adapt merge ratio dynamically during generation

## Related Optimizations

- **[DeepCache](wavespeed.md#deepcache)**: Feature caching β€” complements ToMe, speedups multiply (~2.8x combined)
- **[Multi-Scale Diffusion](optimizations.md#multi-scale)**: Resolution-based optimization β€” also reduces token count
- **[Stable-Fast](stablefast.md)**: Compilation-based speedup β€” can combine for maximum performance

## References & Further Reading

- **Original Paper:** [Token Merging for Fast Stable Diffusion](https://arxiv.org/abs/2303.17604) (Bolya & Hoffman, 2023)
- **tomesd Library:** https://github.com/dbolya/tomesd
- **ToMe for Vision Transformers:** https://github.com/facebookresearch/ToMe