# SageAttention & SpargeAttn

## Overview

SageAttention and SpargeAttn are drop-in replacements for PyTorch's scaled dot-product attention that can provide significant speedup with zero to minimal quality loss. They work by optimizing the compute-heavy attention mechanism used throughout diffusion models (UNet, VAE, Flux Transformers).

- **SageAttention**: Uses INT8 quantization for key/value tensors while maintaining FP16 query precision
- **SpargeAttn**: Adds dynamic sparsity pruning on top of SageAttention, skipping redundant attention computations

Both are **training-free**, **hardware-accelerated** CUDA kernels that integrate transparently into LightDiffusion-Next.

## How It Works

### SageAttention

Standard attention computes:

$$
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
$$

SageAttention accelerates this by:

1. **Quantizing K and V** to INT8 before the matrix multiplication
2. **Keeping Q in FP16** to preserve attention score precision
3. **Fusing operations** (softmax, scaling, matmul) in hand-tuned CUDA kernels
4. **Dequantizing** output back to FP16 after final matmul

This reduces memory bandwidth (K/V use half the space) and leverages Tensor Cores more efficiently.

### SpargeAttn

SpargeAttn extends SageAttention with **sparse attention masking**:

1. Computes a similarity metric between query and key patches
2. Prunes attention connections below a learned threshold (default: 60% similarity)
3. Applies cumulative distribution filtering to keep only the top 97% of attention scores
4. Uses partial vector thresholding to skip redundant computations

The result: 40-60% total speedup over baseline PyTorch attention with minimal impact on output quality.

## Installation

### SageAttention (All Platforms)

**Prerequisites:**
- CUDA Toolkit 11.8+ (must match your PyTorch CUDA version)
- Python 3.8+
- PyTorch with CUDA support

**Install:**

```bash
# Clone repository
git clone https://github.com/thu-ml/SageAttention
cd SageAttention

# Install from source (no build isolation to respect existing CUDA setup)
pip install -e . --no-build-isolation

# Verify installation
python -c "import sageattention; print('SageAttention installed successfully')"
```

### SpargeAttn (Linux/WSL2 Only)

**Prerequisites:**
- Same as SageAttention
- Linux or WSL2 environment (Windows native builds fail due to linker path limits)
- GPU with compute capability 8.0-9.0 (RTX 30xx, 40xx, A100, H100)

**Install:**

```bash
# Clone repository
git clone https://github.com/thu-ml/SparseAttention
cd SpargeAttn

# Set GPU architecture (critical for performance)
export TORCH_CUDA_ARCH_LIST="9.0"  # Or your GPU: 8.0, 8.6, 8.9, 9.0

# Install from source
pip install -e . --no-build-isolation

# Verify installation
python -c "import spas_sage_attn; print('SpargeAttn installed successfully')"
```

**GPU Architecture Reference:**

| GPU Model | Compute Capability | TORCH_CUDA_ARCH_LIST |
|-----------|-------------------|----------------------|
| RTX 3060/3070/3080/3090 | 8.6 | `"8.6"` |
| RTX 4060/4070/4080/4090 | 8.9 | `"8.9"` |
| A100 | 8.0 | `"8.0"` |
| H100 | 9.0 | `"9.0"` |
| RTX 5060/5070/5080/5090 | 12.0 | SageAttention supported, SpargeAttn pending |

### Docker Installation

Both kernels are automatically built during the Docker image creation if the architecture is supported:

```bash
# Build with SpargeAttn (compute 8.0-9.0)
docker-compose build --build-arg TORCH_CUDA_ARCH_LIST="8.9"

# RTX 50xx builds (SageAttention only, no SpargeAttn yet)
docker-compose build --build-arg TORCH_CUDA_ARCH_LIST="12.0"
```

## Usage

### Automatic Detection

LightDiffusion-Next automatically detects and enables the best available attention backend at startup:

```python
# Priority order (highest to lowest):
SpargeAttn > SageAttention > xformers > PyTorch SDPA
```

Check which backend is active in the server logs:

```bash
# SpargeAttn enabled
cat logs/server.log | grep "attention"
# Output: Using SpargeAttn (Sparse + SageAttention) cross attention

# SageAttention enabled
# Output: Using SageAttention cross attention

# Fallback
# Output: Using pytorch cross attention
```

### Streamlit UI

No configuration needed — SageAttention/SpargeAttn are always active if installed.

### REST API

Same as UI — the backend selection is transparent:

```bash
curl -X POST http://localhost:7861/api/generate \
  -H "Content-Type: application/json" \
  -d '{
        "prompt": "a serene mountain lake at dawn",
        "width": 768,
        "height": 512,
        "num_images": 1
      }'
# Automatically uses SpargeAttn if available
```

### Manual Disable

Force PyTorch SDPA for debugging:

```bash
export LD_DISABLE_SAGE_ATTENTION=1
python streamlit_app.py
```

## Performance

Both SageAttention and SpargeAttn provide measurable speedup over PyTorch SDPA baseline:

- **SageAttention**: Moderate speedup with zero quality loss (reported ~15-20% in papers)
- **SpargeAttn**: Significant speedup with minimal quality loss (reported ~40-60% in papers)

Actual performance gains vary based on:
- GPU architecture and VRAM
- Model type (SD1.5, SDXL, Flux)
- Resolution and batch size
- Head dimensions and sequence lengths

**Note:** Benchmark your specific setup to measure real-world performance.## Technical Details

### Head Dimension Support

Both kernels natively support head dimensions of `[64, 96, 128]`. For other dimensions:

- **< 64**: Pad to 64, compute, then slice result
- **64-128**: Pad to 128, compute, then slice result
- **> 128**: Fallback to xformers or PyTorch SDPA

LightDiffusion-Next handles padding/slicing automatically.

### Tensor Layout

SageAttention expects tensors in `(batch_size, num_heads, seq_len, head_dim)` format. The pipeline reshapes inputs transparently:

```python
# Internal reshaping (handled automatically)
q, k, v = map(
    lambda t: t.reshape(b, -1, heads, dim_head).transpose(1, 2),
    (q, k, v),
)
out = sageattention.sageattn(q, k, v, tensor_layout="HND")
```

### SpargeAttn Thresholds

Default pruning parameters (tuned for quality/speed balance):

```python
out = spas_sage_attn.spas_sage2_attn_meansim_cuda(
    q, k, v,
    simthreshd1=0.6,      # Similarity threshold (60%)
    cdfthreshd=0.97,      # Keep top 97% of attention scores
    pvthreshd=15,         # Partial vector threshold
    is_causal=False
)
```

Adjust `simthreshd1` for different trade-offs:
- `0.5`: More aggressive pruning, higher speedup, slight quality loss
- `0.7`: Conservative pruning, lower speedup, minimal quality loss

## Compatibility

### Compatible With

- ✅ Stable Diffusion 1.5
- ✅ Stable Diffusion 2.1
- ✅ SDXL
- ✅ Flux (both cross-attention and self-attention blocks)
- ✅ All samplers (Euler, DPM++, etc.)
- ✅ LoRA adapters
- ✅ Textual inversion embeddings
- ✅ HiresFix, ADetailer, Img2Img
- ✅ Stable-Fast (when stacked)
- ✅ WaveSpeed caching (when stacked)

### Known Limitations

- ❌ RTX 50xx (compute 12.0) does not support SpargeAttn yet (SageAttention works)
- ❌ CPU-only inference (CUDA required)
- ❌ AMD GPUs (ROCm port not available)
- ⚠️ Head dimensions > 128 fall back to slower backends

## Troubleshooting

### Import Error: `No module named 'sageattention'`

**Cause:** Not installed or installation failed.

**Fix:**
```bash
cd SageAttention
pip install -e . --no-build-isolation
```

Verify CUDA toolkit is accessible:
```bash
nvcc --version  # Should match PyTorch CUDA version
```

### Compilation Error: `nvcc fatal error`

**Cause:** CUDA toolkit not found or version mismatch.

**Fix:**
1. Install CUDA toolkit matching your PyTorch version
2. Add CUDA to PATH:
   ```bash
   export PATH=/usr/local/cuda/bin:$PATH
   export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
   ```
3. Reinstall SageAttention

### SpargeAttn Build Fails on Windows

**Cause:** Windows linker has path length limitations.

**Fix:** Use WSL2 or native Linux:
```bash
# In WSL2
cd SpargeAttn
export TORCH_CUDA_ARCH_LIST="8.9"
pip install -e . --no-build-isolation
```

### Slower Than Expected

**Cause:** Wrong GPU architecture compiled or kernel fallback.

**Fix:**
1. Check logs for "Using pytorch cross attention" (fallback indicator)
2. Rebuild with correct `TORCH_CUDA_ARCH_LIST`
3. Verify GPU compute capability:
   ```bash
   nvidia-smi --query-gpu=compute_cap --format=csv
   ```

### Quality Degradation with SpargeAttn

**Cause:** Pruning thresholds too aggressive.

**Fix:** Currently not user-configurable in the UI, but you can modify `src/Attention/AttentionMethods.py`:
```python
# Line ~290
out = spas_sage_attn.spas_sage2_attn_meansim_cuda(
    q, k, v,
    simthreshd1=0.7,      # Increase from 0.6 for better quality
    cdfthreshd=0.98,      # Increase from 0.97
    pvthreshd=15,
    is_causal=False
)
```

## Citation

If you use SageAttention or SpargeAttn in your work:

```bibtex
@article{sageattention2024,
  title={SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration},
  author={Zhang, Jintao and Zhang, Jia and Zhai, Pengle and others},
  journal={arXiv preprint arXiv:2410.02367},
  year={2024}
}

@article{spargeattn2024,
  title={SpargeAttn: Sparsity-Aware Efficient Attention for Long Context LLMs},
  author={Zhang, Jintao and others},
  journal={arXiv preprint},
  year={2024}
}
```

## Resources

- [SageAttention Repository](https://github.com/thu-ml/SageAttention)
- [SpargeAttn Repository](https://github.com/thu-ml/SparseAttention)
- [SageAttention Paper](https://arxiv.org/abs/2410.02367)
- [Flash Attention](https://github.com/Dao-AILab/flash-attention) (related work)