| # Cache-DiT Acceleration |
|
|
| SGLang integrates [Cache-DiT](https://github.com/vipshop/cache-dit), a caching acceleration engine for Diffusion Transformers (DiT), to achieve up to **1.69x inference speedup** with minimal quality loss. |
|
|
| ## Overview |
|
|
| **Cache-DiT** uses intelligent caching strategies to skip redundant computation in the denoising loop: |
|
|
| - **DBCache (Dual Block Cache)**: Dynamically decides when to cache transformer blocks based on residual differences |
| - **TaylorSeer**: Uses Taylor expansion for calibration to optimize caching decisions |
| - **SCM (Step Computation Masking)**: Step-level caching control for additional speedup |
|
|
| ## Basic Usage |
|
|
| Enable Cache-DiT by exporting the environment variable and using `sglang generate` or `sglang serve` : |
|
|
| ```bash |
| SGLANG_CACHE_DIT_ENABLED=true \ |
| sglang generate --model-path Qwen/Qwen-Image \ |
| --prompt "A beautiful sunset over the mountains" |
| ``` |
|
|
| ## Diffusers Backend |
|
|
| Cache-DiT supports loading acceleration configs from a custom YAML file. For |
| diffusers pipelines (`diffusers` backend), pass the YAML/JSON path via `--cache-dit-config`. This |
| flow requires cache-dit >= 1.2.0 (`cache_dit.load_configs`). |
|
|
| ### Single GPU inference |
|
|
| Define a `cache.yaml` file that contains: |
|
|
| ```yaml |
| cache_config: |
| max_warmup_steps: 8 |
| warmup_interval: 2 |
| max_cached_steps: -1 |
| max_continuous_cached_steps: 2 |
| Fn_compute_blocks: 1 |
| Bn_compute_blocks: 0 |
| residual_diff_threshold: 0.12 |
| enable_taylorseer: true |
| taylorseer_order: 1 |
| ``` |
|
|
| Then apply the config with: |
|
|
| ```bash |
| sglang generate \ |
| --backend diffusers \ |
| --model-path Qwen/Qwen-Image \ |
| --cache-dit-config cache.yaml \ |
| --prompt "A beautiful sunset over the mountains" |
| ``` |
|
|
| ### Distributed inference |
|
|
| - 1D Parallelism |
|
|
| Define a parallelism only config yaml `parallel.yaml` file that contains: |
|
|
| ```yaml |
| parallelism_config: |
| ulysses_size: auto |
| parallel_kwargs: |
| attention_backend: native |
| extra_parallel_modules: ["text_encoder", "vae"] |
| ``` |
|
|
| Then, apply the distributed inference acceleration config from yaml. `ulysses_size: auto` means that cache-dit will auto detect the `world_size` as the ulysses_size. Otherwise, you should manually set it as specific int number, e.g, 4. |
| |
| Then apply the distributed config with: (Note: please add `--num-gpus N` to specify the number of gpus for distributed inference) |
| |
| ```bash |
| sglang generate \ |
| --backend diffusers \ |
| --num-gpus 4 \ |
| --model-path Qwen/Qwen-Image \ |
| --cache-dit-config parallel.yaml \ |
| --prompt "A futuristic cityscape at sunset" |
| ``` |
| |
| - 2D Parallelism |
| |
| You can also define a 2D parallelism config yaml `parallel_2d.yaml` file that contains: |
|
|
| ```yaml |
| parallelism_config: |
| ulysses_size: auto |
| tp_size: 2 |
| parallel_kwargs: |
| attention_backend: native |
| extra_parallel_modules: ["text_encoder", "vae"] |
| ``` |
| Then, apply the 2D parallelism config from yaml. Here `tp_size: 2` means using tensor parallelism with size 2. The `ulysses_size: auto` means that cache-dit will auto detect the `world_size // tp_size` as the ulysses_size. |
| |
| - 3D Parallelism |
| |
| You can also define a 3D parallelism config yaml `parallel_3d.yaml` file that contains: |
|
|
| ```yaml |
| parallelism_config: |
| ulysses_size: 2 |
| ring_size: 2 |
| tp_size: 2 |
| parallel_kwargs: |
| attention_backend: native |
| extra_parallel_modules: ["text_encoder", "vae"] |
| ``` |
| Then, apply the 3D parallelism config from yaml. Here `ulysses_size: 2`, `ring_size: 2`, `tp_size: 2` means using ulysses parallelism with size 2, ring parallelism with size 2 and tensor parallelism with size 2. |
|
|
| ### Hybrid Cache and Parallelism |
|
|
| Define a hybrid cache and parallel acceleration config yaml `hybrid.yaml` file that contains: |
|
|
| ```yaml |
| cache_config: |
| max_warmup_steps: 8 |
| warmup_interval: 2 |
| max_cached_steps: -1 |
| max_continuous_cached_steps: 2 |
| Fn_compute_blocks: 1 |
| Bn_compute_blocks: 0 |
| residual_diff_threshold: 0.12 |
| enable_taylorseer: true |
| taylorseer_order: 1 |
| parallelism_config: |
| ulysses_size: auto |
| parallel_kwargs: |
| attention_backend: native |
| extra_parallel_modules: ["text_encoder", "vae"] |
| ``` |
|
|
| Then, apply the hybrid cache and parallel acceleration config from yaml. |
|
|
| ```bash |
| sglang generate \ |
| --backend diffusers \ |
| --num-gpus 4 \ |
| --model-path Qwen/Qwen-Image \ |
| --cache-dit-config hybrid.yaml \ |
| --prompt "A beautiful sunset over the mountains" |
| ``` |
|
|
| ## Advanced Configuration |
|
|
| ### DBCache Parameters |
|
|
| DBCache controls block-level caching behavior: |
|
|
| | Parameter | Env Variable | Default | Description | |
| |-----------|---------------------------|---------|------------------------------------------| |
| | Fn | `SGLANG_CACHE_DIT_FN` | 1 | Number of first blocks to always compute | |
| | Bn | `SGLANG_CACHE_DIT_BN` | 0 | Number of last blocks to always compute | |
| | W | `SGLANG_CACHE_DIT_WARMUP` | 4 | Warmup steps before caching starts | |
| | R | `SGLANG_CACHE_DIT_RDT` | 0.24 | Residual difference threshold | |
| | MC | `SGLANG_CACHE_DIT_MC` | 3 | Maximum continuous cached steps | |
|
|
| ### TaylorSeer Configuration |
|
|
| TaylorSeer improves caching accuracy using Taylor expansion: |
|
|
| | Parameter | Env Variable | Default | Description | |
| |-----------|-------------------------------|---------|---------------------------------| |
| | Enable | `SGLANG_CACHE_DIT_TAYLORSEER` | false | Enable TaylorSeer calibrator | |
| | Order | `SGLANG_CACHE_DIT_TS_ORDER` | 1 | Taylor expansion order (1 or 2) | |
|
|
| ### Combined Configuration Example |
|
|
| DBCache and TaylorSeer are complementary strategies that work together, you can configure both sets of parameters |
| simultaneously: |
|
|
| ```bash |
| SGLANG_CACHE_DIT_ENABLED=true \ |
| SGLANG_CACHE_DIT_FN=2 \ |
| SGLANG_CACHE_DIT_BN=1 \ |
| SGLANG_CACHE_DIT_WARMUP=4 \ |
| SGLANG_CACHE_DIT_RDT=0.4 \ |
| SGLANG_CACHE_DIT_MC=4 \ |
| SGLANG_CACHE_DIT_TAYLORSEER=true \ |
| SGLANG_CACHE_DIT_TS_ORDER=2 \ |
| sglang generate --model-path black-forest-labs/FLUX.1-dev \ |
| --prompt "A curious raccoon in a forest" |
| ``` |
|
|
| ### SCM (Step Computation Masking) |
|
|
| SCM provides step-level caching control for additional speedup. It decides which denoising steps to compute fully and |
| which to use cached results. |
|
|
| **SCM Presets** |
|
|
| SCM is configured with presets: |
|
|
| | Preset | Compute Ratio | Speed | Quality | |
| |----------|---------------|----------|------------| |
| | `none` | 100% | Baseline | Best | |
| | `slow` | ~75% | ~1.3x | High | |
| | `medium` | ~50% | ~2x | Good | |
| | `fast` | ~35% | ~3x | Acceptable | |
| | `ultra` | ~25% | ~4x | Lower | |
|
|
| **Usage** |
|
|
| ```bash |
| SGLANG_CACHE_DIT_ENABLED=true \ |
| SGLANG_CACHE_DIT_SCM_PRESET=medium \ |
| sglang generate --model-path Qwen/Qwen-Image \ |
| --prompt "A futuristic cityscape at sunset" |
| ``` |
|
|
| **Custom SCM Bins** |
|
|
| For fine-grained control over which steps to compute vs cache: |
|
|
| ```bash |
| SGLANG_CACHE_DIT_ENABLED=true \ |
| SGLANG_CACHE_DIT_SCM_COMPUTE_BINS="8,3,3,2,2" \ |
| SGLANG_CACHE_DIT_SCM_CACHE_BINS="1,2,2,2,3" \ |
| sglang generate --model-path Qwen/Qwen-Image \ |
| --prompt "A futuristic cityscape at sunset" |
| ``` |
|
|
| **SCM Policy** |
|
|
| | Policy | Env Variable | Description | |
| |-----------|---------------------------------------|---------------------------------------------| |
| | `dynamic` | `SGLANG_CACHE_DIT_SCM_POLICY=dynamic` | Adaptive caching based on content (default) | |
| | `static` | `SGLANG_CACHE_DIT_SCM_POLICY=static` | Fixed caching pattern | |
|
|
| ## Environment Variables |
|
|
| All Cache-DiT parameters can be configured via environment variables. |
| See [Environment Variables](../../environment_variables.md) for the complete list. |
|
|
| ## Supported Models |
|
|
| SGLang Diffusion x Cache-DiT supports almost all models originally supported in SGLang Diffusion: |
|
|
| | Model Family | Example Models | |
| |--------------|-----------------------------| |
| | Wan | Wan2.1, Wan2.2 | |
| | Flux | FLUX.1-dev, FLUX.2-dev | |
| | Z-Image | Z-Image-Turbo | |
| | Qwen | Qwen-Image, Qwen-Image-Edit | |
| | Hunyuan | HunyuanVideo | |
|
|
| ## Performance Tips |
|
|
| 1. **Start with defaults**: The default parameters work well for most models |
| 2. **Use TaylorSeer**: It typically improves both speed and quality |
| 3. **Tune R threshold**: Lower values = better quality, higher values = faster |
| 4. **SCM for extra speed**: Use `medium` preset for good speed/quality balance |
| 5. **Warmup matters**: Higher warmup = more stable caching decisions |
|
|
| ## Limitations |
|
|
| - **SGLang-native pipelines**: Distributed support (TP/SP) is not yet validated; Cache-DiT will be automatically |
| disabled when `world_size > 1`. |
| - **SCM minimum steps**: SCM requires >= 8 inference steps to be effective |
| - **Model support**: Only models registered in Cache-DiT's BlockAdapterRegister are supported |
|
|
| ## Troubleshooting |
|
|
| ### SCM disabled for low step count |
|
|
| For models with < 8 inference steps (e.g., DMD distilled models), SCM will be automatically disabled. DBCache |
| acceleration still works. |
|
|
| ## References |
|
|
| - [Cache-DiT](https://github.com/vipshop/cache-dit) |
| - [SGLang Diffusion](../index.md) |
|
|