Add files using upload-large-folder tool

6268841 verified 24 days ago

9.21 kB

	# Cache-DiT Acceleration

	SGLang integrates [Cache-DiT](https://github.com/vipshop/cache-dit), a caching acceleration engine for Diffusion Transformers (DiT), to achieve up to 1.69x inference speedup with minimal quality loss.

	## Overview

	Cache-DiT uses intelligent caching strategies to skip redundant computation in the denoising loop:

	- DBCache (Dual Block Cache): Dynamically decides when to cache transformer blocks based on residual differences
	- TaylorSeer: Uses Taylor expansion for calibration to optimize caching decisions
	- SCM (Step Computation Masking): Step-level caching control for additional speedup

	## Basic Usage

	Enable Cache-DiT by exporting the environment variable and using `sglang generate` or `sglang serve` :

	```bash
	SGLANG_CACHE_DIT_ENABLED=true \
	sglang generate --model-path Qwen/Qwen-Image \
	--prompt "A beautiful sunset over the mountains"
	```

	## Diffusers Backend

	Cache-DiT supports loading acceleration configs from a custom YAML file. For
	diffusers pipelines (`diffusers` backend), pass the YAML/JSON path via `--cache-dit-config`. This
	flow requires cache-dit >= 1.2.0 (`cache_dit.load_configs`).

	### Single GPU inference

	Define a `cache.yaml` file that contains:

	```yaml
	cache_config:
	max_warmup_steps: 8
	warmup_interval: 2
	max_cached_steps: -1
	max_continuous_cached_steps: 2
	Fn_compute_blocks: 1
	Bn_compute_blocks: 0
	residual_diff_threshold: 0.12
	enable_taylorseer: true
	taylorseer_order: 1
	```

	Then apply the config with:

	```bash
	sglang generate \
	--backend diffusers \
	--model-path Qwen/Qwen-Image \
	--cache-dit-config cache.yaml \
	--prompt "A beautiful sunset over the mountains"
	```

	### Distributed inference

	- 1D Parallelism

	Define a parallelism only config yaml `parallel.yaml` file that contains:

	```yaml
	parallelism_config:
	ulysses_size: auto
	parallel_kwargs:
	attention_backend: native
	extra_parallel_modules: ["text_encoder", "vae"]
	```

	Then, apply the distributed inference acceleration config from yaml. `ulysses_size: auto` means that cache-dit will auto detect the `world_size` as the ulysses_size. Otherwise, you should manually set it as specific int number, e.g, 4.

	Then apply the distributed config with: (Note: please add `--num-gpus N` to specify the number of gpus for distributed inference)

	```bash
	sglang generate \
	--backend diffusers \
	--num-gpus 4 \
	--model-path Qwen/Qwen-Image \
	--cache-dit-config parallel.yaml \
	--prompt "A futuristic cityscape at sunset"
	```

	- 2D Parallelism

	You can also define a 2D parallelism config yaml `parallel_2d.yaml` file that contains:

	```yaml
	parallelism_config:
	ulysses_size: auto
	tp_size: 2
	parallel_kwargs:
	attention_backend: native
	extra_parallel_modules: ["text_encoder", "vae"]
	```
	Then, apply the 2D parallelism config from yaml. Here `tp_size: 2` means using tensor parallelism with size 2. The `ulysses_size: auto` means that cache-dit will auto detect the `world_size // tp_size` as the ulysses_size.

	- 3D Parallelism

	You can also define a 3D parallelism config yaml `parallel_3d.yaml` file that contains:

	```yaml
	parallelism_config:
	ulysses_size: 2
	ring_size: 2
	tp_size: 2
	parallel_kwargs:
	attention_backend: native
	extra_parallel_modules: ["text_encoder", "vae"]
	```
	Then, apply the 3D parallelism config from yaml. Here `ulysses_size: 2`, `ring_size: 2`, `tp_size: 2` means using ulysses parallelism with size 2, ring parallelism with size 2 and tensor parallelism with size 2.

	### Hybrid Cache and Parallelism

	Define a hybrid cache and parallel acceleration config yaml `hybrid.yaml` file that contains:

	```yaml
	cache_config:
	max_warmup_steps: 8
	warmup_interval: 2
	max_cached_steps: -1
	max_continuous_cached_steps: 2
	Fn_compute_blocks: 1
	Bn_compute_blocks: 0
	residual_diff_threshold: 0.12
	enable_taylorseer: true
	taylorseer_order: 1
	parallelism_config:
	ulysses_size: auto
	parallel_kwargs:
	attention_backend: native
	extra_parallel_modules: ["text_encoder", "vae"]
	```

	Then, apply the hybrid cache and parallel acceleration config from yaml.

	```bash
	sglang generate \
	--backend diffusers \
	--num-gpus 4 \
	--model-path Qwen/Qwen-Image \
	--cache-dit-config hybrid.yaml \
	--prompt "A beautiful sunset over the mountains"
	```

	## Advanced Configuration

	### DBCache Parameters

	DBCache controls block-level caching behavior:

	\| Parameter \| Env Variable \| Default \| Description \|
	\|-----------\|---------------------------\|---------\|------------------------------------------\|
	\| Fn \| `SGLANG_CACHE_DIT_FN` \| 1 \| Number of first blocks to always compute \|
	\| Bn \| `SGLANG_CACHE_DIT_BN` \| 0 \| Number of last blocks to always compute \|
	\| W \| `SGLANG_CACHE_DIT_WARMUP` \| 4 \| Warmup steps before caching starts \|
	\| R \| `SGLANG_CACHE_DIT_RDT` \| 0.24 \| Residual difference threshold \|
	\| MC \| `SGLANG_CACHE_DIT_MC` \| 3 \| Maximum continuous cached steps \|

	### TaylorSeer Configuration

	TaylorSeer improves caching accuracy using Taylor expansion:

	\| Parameter \| Env Variable \| Default \| Description \|
	\|-----------\|-------------------------------\|---------\|---------------------------------\|
	\| Enable \| `SGLANG_CACHE_DIT_TAYLORSEER` \| false \| Enable TaylorSeer calibrator \|
	\| Order \| `SGLANG_CACHE_DIT_TS_ORDER` \| 1 \| Taylor expansion order (1 or 2) \|

	### Combined Configuration Example

	DBCache and TaylorSeer are complementary strategies that work together, you can configure both sets of parameters
	simultaneously:

	```bash
	SGLANG_CACHE_DIT_ENABLED=true \
	SGLANG_CACHE_DIT_FN=2 \
	SGLANG_CACHE_DIT_BN=1 \
	SGLANG_CACHE_DIT_WARMUP=4 \
	SGLANG_CACHE_DIT_RDT=0.4 \
	SGLANG_CACHE_DIT_MC=4 \
	SGLANG_CACHE_DIT_TAYLORSEER=true \
	SGLANG_CACHE_DIT_TS_ORDER=2 \
	sglang generate --model-path black-forest-labs/FLUX.1-dev \
	--prompt "A curious raccoon in a forest"
	```

	### SCM (Step Computation Masking)

	SCM provides step-level caching control for additional speedup. It decides which denoising steps to compute fully and
	which to use cached results.

	SCM Presets

	SCM is configured with presets:

	\| Preset \| Compute Ratio \| Speed \| Quality \|
	\|----------\|---------------\|----------\|------------\|
	\| `none` \| 100% \| Baseline \| Best \|
	\| `slow` \| ~75% \| ~1.3x \| High \|
	\| `medium` \| ~50% \| ~2x \| Good \|
	\| `fast` \| ~35% \| ~3x \| Acceptable \|
	\| `ultra` \| ~25% \| ~4x \| Lower \|

	Usage

	```bash
	SGLANG_CACHE_DIT_ENABLED=true \
	SGLANG_CACHE_DIT_SCM_PRESET=medium \
	sglang generate --model-path Qwen/Qwen-Image \
	--prompt "A futuristic cityscape at sunset"
	```

	Custom SCM Bins

	For fine-grained control over which steps to compute vs cache:

	```bash
	SGLANG_CACHE_DIT_ENABLED=true \
	SGLANG_CACHE_DIT_SCM_COMPUTE_BINS="8,3,3,2,2" \
	SGLANG_CACHE_DIT_SCM_CACHE_BINS="1,2,2,2,3" \
	sglang generate --model-path Qwen/Qwen-Image \
	--prompt "A futuristic cityscape at sunset"
	```

	SCM Policy

	\| Policy \| Env Variable \| Description \|
	\|-----------\|---------------------------------------\|---------------------------------------------\|
	\| `dynamic` \| `SGLANG_CACHE_DIT_SCM_POLICY=dynamic` \| Adaptive caching based on content (default) \|
	\| `static` \| `SGLANG_CACHE_DIT_SCM_POLICY=static` \| Fixed caching pattern \|

	## Environment Variables

	All Cache-DiT parameters can be configured via environment variables.
	See [Environment Variables](../../environment_variables.md) for the complete list.

	## Supported Models

	SGLang Diffusion x Cache-DiT supports almost all models originally supported in SGLang Diffusion:

	\| Model Family \| Example Models \|
	\|--------------\|-----------------------------\|
	\| Wan \| Wan2.1, Wan2.2 \|
	\| Flux \| FLUX.1-dev, FLUX.2-dev \|
	\| Z-Image \| Z-Image-Turbo \|
	\| Qwen \| Qwen-Image, Qwen-Image-Edit \|
	\| Hunyuan \| HunyuanVideo \|

	## Performance Tips

	1. Start with defaults: The default parameters work well for most models
	2. Use TaylorSeer: It typically improves both speed and quality
	3. Tune R threshold: Lower values = better quality, higher values = faster
	4. SCM for extra speed: Use `medium` preset for good speed/quality balance
	5. Warmup matters: Higher warmup = more stable caching decisions

	## Limitations

	- SGLang-native pipelines: Distributed support (TP/SP) is not yet validated; Cache-DiT will be automatically
	disabled when `world_size > 1`.
	- SCM minimum steps: SCM requires >= 8 inference steps to be effective
	- Model support: Only models registered in Cache-DiT's BlockAdapterRegister are supported

	## Troubleshooting

	### SCM disabled for low step count

	For models with < 8 inference steps (e.g., DMD distilled models), SCM will be automatically disabled. DBCache
	acceleration still works.

	## References

	- [Cache-DiT](https://github.com/vipshop/cache-dit)
	- [SGLang Diffusion](../index.md)