Spaces:

Aatricks
/

LightDiffusion-Next

Running on Zero

App Files Files Community

LightDiffusion-Next / docs /tome.md

Aatricks

Deploy ZeroGPU Gradio Space snapshot

b701455 28 days ago

preview code

raw

history blame contribute delete

9.03 kB

	# Token Merging (ToMe)

	## Overview

	Token Merging (ToMe) is a performance optimization that accelerates diffusion models by intelligently merging similar tokens in the attention mechanism. By identifying and combining redundant computations, ToMe achieves 20-60% speedup with minimal quality impact.

	Unlike feature caching (DeepCache, WaveSpeed), ToMe reduces the computational graph itself — fewer tokens means fewer attention operations, less memory bandwidth, and faster generation.

	This is a training-free, drop-in optimization that works with all Stable Diffusion models (SD1.5, SDXL) and can be combined with other speedup techniques.

	## How It Works

	### The Token Redundancy Problem

	Diffusion models process images as sequences of tokens (patches):

	```
	Input Image (512×512) → Tokenize → 4096 tokens (64×64 grid of 8×8 patches)
	```

	At each attention layer, every token attends to every other token:

	$$
	\text{Attention Cost} = O(N^2 \cdot D)
	$$

	Where:
	- $N$ = number of tokens (e.g., 4096 for 512×512)
	- $D$ = embedding dimension (e.g., 768 or 1024)

	Key insight: Many tokens are highly similar (e.g., sky regions, uniform backgrounds, smooth gradients). Computing attention between nearly-identical tokens is redundant.

	### The ToMe Solution

	Token Merging reduces redundancy through bipartite matching:

	```
	Step 1: Split tokens into two sets
	┌─────────────────────┬─────────────────────┐
	│ Destination Set (dst)│ Source Set (src) │
	│ [Token 1, 3, 5, ...] │ [Token 2, 4, 6, ...] │
	└─────────────────────┴─────────────────────┘

	Step 2: Compute similarity (cosine distance)
	dst[0] ↔ src[0]: 0.92 (highly similar!)
	dst[0] ↔ src[1]: 0.34
	dst[0] ↔ src[2]: 0.18
	...

	Step 3: Merge most similar pairs
	merged_token[0] = (dst[0] + src[0]) / 2

	Step 4: Continue with fewer tokens
	4096 tokens → 2048 tokens (50% merge ratio)
	Attention cost reduced by ~4x
	```

	This happens per attention layer, with merge ratio dynamically adjusting based on layer depth.

	## Configuration

	### Parameters

	\| Parameter \| Type \| Default \| Range \| Description \|
	\|-----------\|------\|---------\|-------\|-------------\|
	\| `tome_enabled` \| bool \| `False` \| - \| Enable Token Merging \|
	\| `tome_ratio` \| float \| `0.5` \| 0.0-0.9 \| Percentage of tokens to merge (higher = faster, lower quality) \|
	\| `tome_max_downsample` \| int \| `1` \| 1, 2, 4, 8 \| Apply ToMe to layers with downsampling ≤ this value \|

	### Choosing `tome_max_downsample`

	Controls which UNet layers apply ToMe:

	\| Value \| Layers Affected \| Speed vs Quality \|
	\|-------\|----------------\|------------------\|
	\| 1 \| Only full-resolution layers (4/15) \| Conservative, minimal quality impact \|
	\| 2 \| Half-resolution layers (8/15) \| Balanced (recommended) \|
	\| 4 \| Quarter-resolution layers (12/15) \| Aggressive \|
	\| 8 \| All layers (15/15) \| Maximum speedup, noticeable quality loss \|

	Recommendation: Start with `max_downsample=1`. Only increase if you need more speedup and can tolerate quality reduction.

	## Usage

	### Streamlit UI

	Enable in the 🔀 Token Merging (ToMe) expander:

	1. Check Enable Token Merging
	2. Select a preset:
	- Conservative — 30% merge, max_downsample=2 (minimal impact)
	- Balanced — 50% merge, max_downsample=1 (recommended)
	- Aggressive — 70% merge, max_downsample=1 (maximum speed)
	- Custom — Manual slider control
	3. Generate images — console confirms activation

	Visual feedback:
	```
	✓ Token Merging ACTIVE: 50% merge ratio, max_downsample=1
	```

	### REST API

	Include in your generation request:

	```bash
	curl -X POST http://localhost:7861/api/generate \
	-H "Content-Type: application/json" \
	-d '{
	"prompt": "a cyberpunk cityscape at night, neon lights",
	"width": 1024,
	"height": 512,
	"steps": 25,
	"tome_enabled": true,
	"tome_ratio": 0.5,
	"tome_max_downsample": 1
	}'
	```

	### Python API

	```python
	from src.user.pipeline import pipeline

	pipeline(
	prompt="a detailed fantasy castle on a cliff",
	w=768,
	h=1024,
	steps=30,
	sampler="dpmpp_sde_cfgpp",
	scheduler="ays",
	tome_enabled=True,
	tome_ratio=0.5,
	tome_max_downsample=1,
	number=4 # Generate multiple images faster
	)
	```

	## Troubleshooting

	### "No speedup detected"

	Possible causes:
	1. tomesd not installed — Install with `pip install tomesd`
	2. Other bottlenecks — Enable only ToMe for isolated testing
	3. Very low resolution — ToMe benefits are minimal below 512px

	Solutions:
	```bash
	# Check installation
	python -c "import tomesd; print('ToMe available')"

	# Test in isolation at 1024×512 (ideal resolution for ToMe)
	python quick_tome_test.py
	```

	### "Images look blurry or soft"

	Cause: `tome_ratio` too high (>0.6) or `max_downsample` too aggressive (>2).

	Solutions:
	- Reduce `tome_ratio` to 0.4-0.5
	- Lower `max_downsample` to 1
	- Increase `steps` to 30-35 for better convergence
	- Disable ToMe for final high-quality renders

	### "Minimal speedup despite 70% merge"

	Cause: Other optimizations (DeepCache, Multi-Scale) already bottlenecked elsewhere (VAE decode, sampling overhead).

	Solutions:
	- Profile with isolated tests (disable all other optimizations)
	- Ensure GPU isn't memory-bound (reduce batch size)
	- Check system monitoring for CPU/disk bottlenecks

	### "Model fails to load / tomesd errors"

	Cause: Outdated tomesd version or incompatible model architecture.

	Solutions:
	```bash
	# Update tomesd
	pip install --upgrade tomesd

	# Check compatibility (ToMe only works with UNet-based models)
	# Flux/Transformer models require different ToMe variant (not yet supported)
	```

	## Technical Details

	### Implementation

	ToMe is applied via the `ModelPatcher` class (`src/Model/ModelPatcher.py`):

	```python
	def apply_tome(self, ratio: float = 0.5, max_downsample: int = 1) -> bool:
	"""Apply Token Merging to the diffusion model."""
	# Remove any existing patch (handles cached models)
	try:
	tomesd.remove_patch(self)
	except:
	pass

	# Apply ToMe patch
	tomesd.apply_patch(
	self, # ModelPatcher with .model.diffusion_model structure
	ratio=ratio,
	max_downsample=max_downsample
	)
	self.tome_enabled = True
	return True
	```

	Cache handling: ToMe patches are removed after each generation and re-applied as needed, ensuring correct behavior with model caching.

	### Bipartite Matching Algorithm

	ToMe uses proportional attention-based matching:

	1. Partition tokens:
	$$
	T_{\text{dst}}, T_{\text{src}} = \text{partition}(T, \text{stride}=(2,2))
	$$

	2. Compute similarity matrix:
	$$
	S_{ij} = \frac{T_{\text{dst}}[i] \cdot T_{\text{src}}[j]}{\|\|T_{\text{dst}}[i]\|\| \cdot \|\|T_{\text{src}}[j]\|\|}
	$$

	3. Find top-k matches:
	$$
	k = \lfloor \text{ratio} \times \|T_{\text{src}}\| \rfloor
	$$

	4. Merge tokens:
	$$
	T'[i] = \frac{T_{\text{dst}}[i] + T_{\text{src}}[\text{match}(i)]}{2}
	$$

	## Compatibility

	\| Feature \| Compatible? \| Notes \|
	\|---------\|-------------\|-------\|
	\| SD1.5 models \| ✓ \| Full support, tested extensively \|
	\| SDXL models \| ✓ \| Full support, larger speedup \|
	\| Flux models \| ✗ \| UNet-specific, Transformer variant TBD \|
	\| All samplers \| ✓ \| ToMe patches attention, agnostic to sampler \|
	\| CFG-Free \| ✓ \| No interaction, both apply independently \|
	\| DeepCache \| ✓ \| Excellent combination, speedups multiply \|
	\| Multi-Scale \| ✓ \| Compatible, benefits stack \|
	\| HiRes Fix \| ✓ \| Applied to all upscaling passes \|
	\| ADetailer \| ✓ \| Applied to detail-enhancement passes \|
	\| Stable-Fast \| ✓ \| Can combine for maximum speedup \|

	## Limitations

	1. UNet-only: Transformer architectures (Flux) use different attention patterns — dedicated Transformer-ToMe needed
	2. Detail sensitivity: High-frequency textures (fabric weave, individual hairs) see most quality impact
	3. Diminishing returns: Beyond 60% merge, quality degrades faster than speed improves
	4. One-time patch: Doesn't adapt merge ratio dynamically during generation

	## Related Optimizations

	- [DeepCache](wavespeed.md#deepcache): Feature caching — complements ToMe, speedups multiply (~2.8x combined)
	- [Multi-Scale Diffusion](optimizations.md#multi-scale): Resolution-based optimization — also reduces token count
	- [Stable-Fast](stablefast.md): Compilation-based speedup — can combine for maximum performance

	## References & Further Reading

	- Original Paper: [Token Merging for Fast Stable Diffusion](https://arxiv.org/abs/2303.17604) (Bolya & Hoffman, 2023)
	- tomesd Library: https://github.com/dbolya/tomesd
	- ToMe for Vision Transformers: https://github.com/facebookresearch/ToMe