File size: 9,213 Bytes
6268841 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 | # Cache-DiT Acceleration
SGLang integrates [Cache-DiT](https://github.com/vipshop/cache-dit), a caching acceleration engine for Diffusion Transformers (DiT), to achieve up to **1.69x inference speedup** with minimal quality loss.
## Overview
**Cache-DiT** uses intelligent caching strategies to skip redundant computation in the denoising loop:
- **DBCache (Dual Block Cache)**: Dynamically decides when to cache transformer blocks based on residual differences
- **TaylorSeer**: Uses Taylor expansion for calibration to optimize caching decisions
- **SCM (Step Computation Masking)**: Step-level caching control for additional speedup
## Basic Usage
Enable Cache-DiT by exporting the environment variable and using `sglang generate` or `sglang serve` :
```bash
SGLANG_CACHE_DIT_ENABLED=true \
sglang generate --model-path Qwen/Qwen-Image \
--prompt "A beautiful sunset over the mountains"
```
## Diffusers Backend
Cache-DiT supports loading acceleration configs from a custom YAML file. For
diffusers pipelines (`diffusers` backend), pass the YAML/JSON path via `--cache-dit-config`. This
flow requires cache-dit >= 1.2.0 (`cache_dit.load_configs`).
### Single GPU inference
Define a `cache.yaml` file that contains:
```yaml
cache_config:
max_warmup_steps: 8
warmup_interval: 2
max_cached_steps: -1
max_continuous_cached_steps: 2
Fn_compute_blocks: 1
Bn_compute_blocks: 0
residual_diff_threshold: 0.12
enable_taylorseer: true
taylorseer_order: 1
```
Then apply the config with:
```bash
sglang generate \
--backend diffusers \
--model-path Qwen/Qwen-Image \
--cache-dit-config cache.yaml \
--prompt "A beautiful sunset over the mountains"
```
### Distributed inference
- 1D Parallelism
Define a parallelism only config yaml `parallel.yaml` file that contains:
```yaml
parallelism_config:
ulysses_size: auto
parallel_kwargs:
attention_backend: native
extra_parallel_modules: ["text_encoder", "vae"]
```
Then, apply the distributed inference acceleration config from yaml. `ulysses_size: auto` means that cache-dit will auto detect the `world_size` as the ulysses_size. Otherwise, you should manually set it as specific int number, e.g, 4.
Then apply the distributed config with: (Note: please add `--num-gpus N` to specify the number of gpus for distributed inference)
```bash
sglang generate \
--backend diffusers \
--num-gpus 4 \
--model-path Qwen/Qwen-Image \
--cache-dit-config parallel.yaml \
--prompt "A futuristic cityscape at sunset"
```
- 2D Parallelism
You can also define a 2D parallelism config yaml `parallel_2d.yaml` file that contains:
```yaml
parallelism_config:
ulysses_size: auto
tp_size: 2
parallel_kwargs:
attention_backend: native
extra_parallel_modules: ["text_encoder", "vae"]
```
Then, apply the 2D parallelism config from yaml. Here `tp_size: 2` means using tensor parallelism with size 2. The `ulysses_size: auto` means that cache-dit will auto detect the `world_size // tp_size` as the ulysses_size.
- 3D Parallelism
You can also define a 3D parallelism config yaml `parallel_3d.yaml` file that contains:
```yaml
parallelism_config:
ulysses_size: 2
ring_size: 2
tp_size: 2
parallel_kwargs:
attention_backend: native
extra_parallel_modules: ["text_encoder", "vae"]
```
Then, apply the 3D parallelism config from yaml. Here `ulysses_size: 2`, `ring_size: 2`, `tp_size: 2` means using ulysses parallelism with size 2, ring parallelism with size 2 and tensor parallelism with size 2.
### Hybrid Cache and Parallelism
Define a hybrid cache and parallel acceleration config yaml `hybrid.yaml` file that contains:
```yaml
cache_config:
max_warmup_steps: 8
warmup_interval: 2
max_cached_steps: -1
max_continuous_cached_steps: 2
Fn_compute_blocks: 1
Bn_compute_blocks: 0
residual_diff_threshold: 0.12
enable_taylorseer: true
taylorseer_order: 1
parallelism_config:
ulysses_size: auto
parallel_kwargs:
attention_backend: native
extra_parallel_modules: ["text_encoder", "vae"]
```
Then, apply the hybrid cache and parallel acceleration config from yaml.
```bash
sglang generate \
--backend diffusers \
--num-gpus 4 \
--model-path Qwen/Qwen-Image \
--cache-dit-config hybrid.yaml \
--prompt "A beautiful sunset over the mountains"
```
## Advanced Configuration
### DBCache Parameters
DBCache controls block-level caching behavior:
| Parameter | Env Variable | Default | Description |
|-----------|---------------------------|---------|------------------------------------------|
| Fn | `SGLANG_CACHE_DIT_FN` | 1 | Number of first blocks to always compute |
| Bn | `SGLANG_CACHE_DIT_BN` | 0 | Number of last blocks to always compute |
| W | `SGLANG_CACHE_DIT_WARMUP` | 4 | Warmup steps before caching starts |
| R | `SGLANG_CACHE_DIT_RDT` | 0.24 | Residual difference threshold |
| MC | `SGLANG_CACHE_DIT_MC` | 3 | Maximum continuous cached steps |
### TaylorSeer Configuration
TaylorSeer improves caching accuracy using Taylor expansion:
| Parameter | Env Variable | Default | Description |
|-----------|-------------------------------|---------|---------------------------------|
| Enable | `SGLANG_CACHE_DIT_TAYLORSEER` | false | Enable TaylorSeer calibrator |
| Order | `SGLANG_CACHE_DIT_TS_ORDER` | 1 | Taylor expansion order (1 or 2) |
### Combined Configuration Example
DBCache and TaylorSeer are complementary strategies that work together, you can configure both sets of parameters
simultaneously:
```bash
SGLANG_CACHE_DIT_ENABLED=true \
SGLANG_CACHE_DIT_FN=2 \
SGLANG_CACHE_DIT_BN=1 \
SGLANG_CACHE_DIT_WARMUP=4 \
SGLANG_CACHE_DIT_RDT=0.4 \
SGLANG_CACHE_DIT_MC=4 \
SGLANG_CACHE_DIT_TAYLORSEER=true \
SGLANG_CACHE_DIT_TS_ORDER=2 \
sglang generate --model-path black-forest-labs/FLUX.1-dev \
--prompt "A curious raccoon in a forest"
```
### SCM (Step Computation Masking)
SCM provides step-level caching control for additional speedup. It decides which denoising steps to compute fully and
which to use cached results.
**SCM Presets**
SCM is configured with presets:
| Preset | Compute Ratio | Speed | Quality |
|----------|---------------|----------|------------|
| `none` | 100% | Baseline | Best |
| `slow` | ~75% | ~1.3x | High |
| `medium` | ~50% | ~2x | Good |
| `fast` | ~35% | ~3x | Acceptable |
| `ultra` | ~25% | ~4x | Lower |
**Usage**
```bash
SGLANG_CACHE_DIT_ENABLED=true \
SGLANG_CACHE_DIT_SCM_PRESET=medium \
sglang generate --model-path Qwen/Qwen-Image \
--prompt "A futuristic cityscape at sunset"
```
**Custom SCM Bins**
For fine-grained control over which steps to compute vs cache:
```bash
SGLANG_CACHE_DIT_ENABLED=true \
SGLANG_CACHE_DIT_SCM_COMPUTE_BINS="8,3,3,2,2" \
SGLANG_CACHE_DIT_SCM_CACHE_BINS="1,2,2,2,3" \
sglang generate --model-path Qwen/Qwen-Image \
--prompt "A futuristic cityscape at sunset"
```
**SCM Policy**
| Policy | Env Variable | Description |
|-----------|---------------------------------------|---------------------------------------------|
| `dynamic` | `SGLANG_CACHE_DIT_SCM_POLICY=dynamic` | Adaptive caching based on content (default) |
| `static` | `SGLANG_CACHE_DIT_SCM_POLICY=static` | Fixed caching pattern |
## Environment Variables
All Cache-DiT parameters can be configured via environment variables.
See [Environment Variables](../../environment_variables.md) for the complete list.
## Supported Models
SGLang Diffusion x Cache-DiT supports almost all models originally supported in SGLang Diffusion:
| Model Family | Example Models |
|--------------|-----------------------------|
| Wan | Wan2.1, Wan2.2 |
| Flux | FLUX.1-dev, FLUX.2-dev |
| Z-Image | Z-Image-Turbo |
| Qwen | Qwen-Image, Qwen-Image-Edit |
| Hunyuan | HunyuanVideo |
## Performance Tips
1. **Start with defaults**: The default parameters work well for most models
2. **Use TaylorSeer**: It typically improves both speed and quality
3. **Tune R threshold**: Lower values = better quality, higher values = faster
4. **SCM for extra speed**: Use `medium` preset for good speed/quality balance
5. **Warmup matters**: Higher warmup = more stable caching decisions
## Limitations
- **SGLang-native pipelines**: Distributed support (TP/SP) is not yet validated; Cache-DiT will be automatically
disabled when `world_size > 1`.
- **SCM minimum steps**: SCM requires >= 8 inference steps to be effective
- **Model support**: Only models registered in Cache-DiT's BlockAdapterRegister are supported
## Troubleshooting
### SCM disabled for low step count
For models with < 8 inference steps (e.g., DMD distilled models), SCM will be automatically disabled. DBCache
acceleration still works.
## References
- [Cache-DiT](https://github.com/vipshop/cache-dit)
- [SGLang Diffusion](../index.md)
|