Spaces:

Aatricks
/

LightDiffusion-Next

Running on Zero

App Files Files Community

LightDiffusion-Next / docs /sageattention.md

Aatricks

Deploy ZeroGPU Gradio Space snapshot

b701455 30 days ago

preview code

raw

history blame contribute delete

9.6 kB

	# SageAttention & SpargeAttn

	## Overview

	SageAttention and SpargeAttn are drop-in replacements for PyTorch's scaled dot-product attention that can provide significant speedup with zero to minimal quality loss. They work by optimizing the compute-heavy attention mechanism used throughout diffusion models (UNet, VAE, Flux Transformers).

	- SageAttention: Uses INT8 quantization for key/value tensors while maintaining FP16 query precision
	- SpargeAttn: Adds dynamic sparsity pruning on top of SageAttention, skipping redundant attention computations

	Both are training-free, hardware-accelerated CUDA kernels that integrate transparently into LightDiffusion-Next.

	## How It Works

	### SageAttention

	Standard attention computes:

	$$
	\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
	$$

	SageAttention accelerates this by:

	1. Quantizing K and V to INT8 before the matrix multiplication
	2. Keeping Q in FP16 to preserve attention score precision
	3. Fusing operations (softmax, scaling, matmul) in hand-tuned CUDA kernels
	4. Dequantizing output back to FP16 after final matmul

	This reduces memory bandwidth (K/V use half the space) and leverages Tensor Cores more efficiently.

	### SpargeAttn

	SpargeAttn extends SageAttention with sparse attention masking:

	1. Computes a similarity metric between query and key patches
	2. Prunes attention connections below a learned threshold (default: 60% similarity)
	3. Applies cumulative distribution filtering to keep only the top 97% of attention scores
	4. Uses partial vector thresholding to skip redundant computations

	The result: 40-60% total speedup over baseline PyTorch attention with minimal impact on output quality.

	## Installation

	### SageAttention (All Platforms)

	Prerequisites:
	- CUDA Toolkit 11.8+ (must match your PyTorch CUDA version)
	- Python 3.8+
	- PyTorch with CUDA support

	Install:

	```bash
	# Clone repository
	git clone https://github.com/thu-ml/SageAttention
	cd SageAttention

	# Install from source (no build isolation to respect existing CUDA setup)
	pip install -e . --no-build-isolation

	# Verify installation
	python -c "import sageattention; print('SageAttention installed successfully')"
	```

	### SpargeAttn (Linux/WSL2 Only)

	Prerequisites:
	- Same as SageAttention
	- Linux or WSL2 environment (Windows native builds fail due to linker path limits)
	- GPU with compute capability 8.0-9.0 (RTX 30xx, 40xx, A100, H100)

	Install:

	```bash
	# Clone repository
	git clone https://github.com/thu-ml/SparseAttention
	cd SpargeAttn

	# Set GPU architecture (critical for performance)
	export TORCH_CUDA_ARCH_LIST="9.0" # Or your GPU: 8.0, 8.6, 8.9, 9.0

	# Install from source
	pip install -e . --no-build-isolation

	# Verify installation
	python -c "import spas_sage_attn; print('SpargeAttn installed successfully')"
	```

	GPU Architecture Reference:

	\| GPU Model \| Compute Capability \| TORCH_CUDA_ARCH_LIST \|
	\|-----------\|-------------------\|----------------------\|
	\| RTX 3060/3070/3080/3090 \| 8.6 \| `"8.6"` \|
	\| RTX 4060/4070/4080/4090 \| 8.9 \| `"8.9"` \|
	\| A100 \| 8.0 \| `"8.0"` \|
	\| H100 \| 9.0 \| `"9.0"` \|
	\| RTX 5060/5070/5080/5090 \| 12.0 \| SageAttention supported, SpargeAttn pending \|

	### Docker Installation

	Both kernels are automatically built during the Docker image creation if the architecture is supported:

	```bash
	# Build with SpargeAttn (compute 8.0-9.0)
	docker-compose build --build-arg TORCH_CUDA_ARCH_LIST="8.9"

	# RTX 50xx builds (SageAttention only, no SpargeAttn yet)
	docker-compose build --build-arg TORCH_CUDA_ARCH_LIST="12.0"
	```

	## Usage

	### Automatic Detection

	LightDiffusion-Next automatically detects and enables the best available attention backend at startup:

	```python
	# Priority order (highest to lowest):
	SpargeAttn > SageAttention > xformers > PyTorch SDPA
	```

	Check which backend is active in the server logs:

	```bash
	# SpargeAttn enabled
	cat logs/server.log \| grep "attention"
	# Output: Using SpargeAttn (Sparse + SageAttention) cross attention

	# SageAttention enabled
	# Output: Using SageAttention cross attention

	# Fallback
	# Output: Using pytorch cross attention
	```

	### Streamlit UI

	No configuration needed — SageAttention/SpargeAttn are always active if installed.

	### REST API

	Same as UI — the backend selection is transparent:

	```bash
	curl -X POST http://localhost:7861/api/generate \
	-H "Content-Type: application/json" \
	-d '{
	"prompt": "a serene mountain lake at dawn",
	"width": 768,
	"height": 512,
	"num_images": 1
	}'
	# Automatically uses SpargeAttn if available
	```

	### Manual Disable

	Force PyTorch SDPA for debugging:

	```bash
	export LD_DISABLE_SAGE_ATTENTION=1
	python streamlit_app.py
	```

	## Performance

	Both SageAttention and SpargeAttn provide measurable speedup over PyTorch SDPA baseline:

	- SageAttention: Moderate speedup with zero quality loss (reported ~15-20% in papers)
	- SpargeAttn: Significant speedup with minimal quality loss (reported ~40-60% in papers)

	Actual performance gains vary based on:
	- GPU architecture and VRAM
	- Model type (SD1.5, SDXL, Flux)
	- Resolution and batch size
	- Head dimensions and sequence lengths

	Note: Benchmark your specific setup to measure real-world performance.## Technical Details

	### Head Dimension Support

	Both kernels natively support head dimensions of `[64, 96, 128]`. For other dimensions:

	- < 64: Pad to 64, compute, then slice result
	- 64-128: Pad to 128, compute, then slice result
	- > 128: Fallback to xformers or PyTorch SDPA

	LightDiffusion-Next handles padding/slicing automatically.

	### Tensor Layout

	SageAttention expects tensors in `(batch_size, num_heads, seq_len, head_dim)` format. The pipeline reshapes inputs transparently:

	```python
	# Internal reshaping (handled automatically)
	q, k, v = map(
	lambda t: t.reshape(b, -1, heads, dim_head).transpose(1, 2),
	(q, k, v),
	)
	out = sageattention.sageattn(q, k, v, tensor_layout="HND")
	```

	### SpargeAttn Thresholds

	Default pruning parameters (tuned for quality/speed balance):

	```python
	out = spas_sage_attn.spas_sage2_attn_meansim_cuda(
	q, k, v,
	simthreshd1=0.6, # Similarity threshold (60%)
	cdfthreshd=0.97, # Keep top 97% of attention scores
	pvthreshd=15, # Partial vector threshold
	is_causal=False
	)
	```

	Adjust `simthreshd1` for different trade-offs:
	- `0.5`: More aggressive pruning, higher speedup, slight quality loss
	- `0.7`: Conservative pruning, lower speedup, minimal quality loss

	## Compatibility

	### Compatible With

	- ✅ Stable Diffusion 1.5
	- ✅ Stable Diffusion 2.1
	- ✅ SDXL
	- ✅ Flux (both cross-attention and self-attention blocks)
	- ✅ All samplers (Euler, DPM++, etc.)
	- ✅ LoRA adapters
	- ✅ Textual inversion embeddings
	- ✅ HiresFix, ADetailer, Img2Img
	- ✅ Stable-Fast (when stacked)
	- ✅ WaveSpeed caching (when stacked)

	### Known Limitations

	- ❌ RTX 50xx (compute 12.0) does not support SpargeAttn yet (SageAttention works)
	- ❌ CPU-only inference (CUDA required)
	- ❌ AMD GPUs (ROCm port not available)
	- ⚠️ Head dimensions > 128 fall back to slower backends

	## Troubleshooting

	### Import Error: `No module named 'sageattention'`

	Cause: Not installed or installation failed.

	Fix:
	```bash
	cd SageAttention
	pip install -e . --no-build-isolation
	```

	Verify CUDA toolkit is accessible:
	```bash
	nvcc --version # Should match PyTorch CUDA version
	```

	### Compilation Error: `nvcc fatal error`

	Cause: CUDA toolkit not found or version mismatch.

	Fix:
	1. Install CUDA toolkit matching your PyTorch version
	2. Add CUDA to PATH:
	```bash
	export PATH=/usr/local/cuda/bin:$PATH
	export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
	```
	3. Reinstall SageAttention

	### SpargeAttn Build Fails on Windows

	Cause: Windows linker has path length limitations.

	Fix: Use WSL2 or native Linux:
	```bash
	# In WSL2
	cd SpargeAttn
	export TORCH_CUDA_ARCH_LIST="8.9"
	pip install -e . --no-build-isolation
	```

	### Slower Than Expected

	Cause: Wrong GPU architecture compiled or kernel fallback.

	Fix:
	1. Check logs for "Using pytorch cross attention" (fallback indicator)
	2. Rebuild with correct `TORCH_CUDA_ARCH_LIST`
	3. Verify GPU compute capability:
	```bash
	nvidia-smi --query-gpu=compute_cap --format=csv
	```

	### Quality Degradation with SpargeAttn

	Cause: Pruning thresholds too aggressive.

	Fix: Currently not user-configurable in the UI, but you can modify `src/Attention/AttentionMethods.py`:
	```python
	# Line ~290
	out = spas_sage_attn.spas_sage2_attn_meansim_cuda(
	q, k, v,
	simthreshd1=0.7, # Increase from 0.6 for better quality
	cdfthreshd=0.98, # Increase from 0.97
	pvthreshd=15,
	is_causal=False
	)
	```

	## Citation

	If you use SageAttention or SpargeAttn in your work:

	```bibtex
	@article{sageattention2024,
	title={SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration},
	author={Zhang, Jintao and Zhang, Jia and Zhai, Pengle and others},
	journal={arXiv preprint arXiv:2410.02367},
	year={2024}
	}

	@article{spargeattn2024,
	title={SpargeAttn: Sparsity-Aware Efficient Attention for Long Context LLMs},
	author={Zhang, Jintao and others},
	journal={arXiv preprint},
	year={2024}
	}
	```

	## Resources

	- [SageAttention Repository](https://github.com/thu-ml/SageAttention)
	- [SpargeAttn Repository](https://github.com/thu-ml/SparseAttention)
	- [SageAttention Paper](https://arxiv.org/abs/2410.02367)
	- [Flash Attention](https://github.com/Dao-AILab/flash-attention) (related work)