FasterDFlash
/

Hanrui

Model card Files Files and versions

Hanrui / sglang /docs /diffusion /performance /index.md

Lekr0's picture

Add files using upload-large-folder tool

6268841 verified 28 days ago

|

history blame contribute delete

2.72 kB

	# Performance Optimization

	SGLang-Diffusion provides multiple performance optimization strategies to accelerate inference. This section covers all available performance tuning options.

	## Overview

	\| Optimization \| Type \| Description \|
	\|--------------\|------\|-------------\|
	\| Cache-DiT \| Caching \| Block-level caching with DBCache, TaylorSeer, and SCM \|
	\| TeaCache \| Caching \| Timestep-level caching using L1 similarity \|
	\| Attention Backends \| Kernel \| Optimized attention implementations (FlashAttention, SageAttention, etc.) \|
	\| Profiling \| Diagnostics \| PyTorch Profiler and Nsight Systems guidance \|

	## Caching Strategies

	SGLang supports two complementary caching approaches:

	### Cache-DiT

	[Cache-DiT](https://github.com/vipshop/cache-dit) provides block-level caching with advanced strategies. It can achieve up to 1.69x speedup.

	Quick Start:
	```bash
	SGLANG_CACHE_DIT_ENABLED=true \
	sglang generate --model-path Qwen/Qwen-Image \
	--prompt "A beautiful sunset over the mountains"
	```

	Key Features:
	- DBCache: Dynamic block-level caching based on residual differences
	- TaylorSeer: Taylor expansion-based calibration for optimized caching
	- SCM: Step-level computation masking for additional speedup

	See [Cache-DiT Documentation](cache/cache_dit.md) for detailed configuration.

	### TeaCache

	TeaCache (Temporal similarity-based caching) accelerates diffusion inference by detecting when consecutive denoising steps are similar enough to skip computation entirely.

	Quick Overview:
	- Tracks L1 distance between modulated inputs across timesteps
	- When accumulated distance is below threshold, reuses cached residual
	- Supports CFG with separate positive/negative caches

	Supported Models: Wan (wan2.1, wan2.2), Hunyuan (HunyuanVideo), Z-Image

	See [TeaCache Documentation](cache/teacache.md) for detailed configuration.

	## Attention Backends

	Different attention backends offer varying performance characteristics depending on your hardware and model:

	- FlashAttention: Fastest on NVIDIA GPUs with fp16/bf16
	- SageAttention: Alternative optimized implementation
	- xformers: Memory-efficient attention
	- SDPA: PyTorch native scaled dot-product attention

	See [Attention Backends](attention_backends.md) for platform support and configuration options.

	## Profiling

	To diagnose performance bottlenecks, SGLang-Diffusion supports profiling tools:

	- PyTorch Profiler: Built-in Python profiling
	- Nsight Systems: GPU kernel-level analysis

	See [Profiling Guide](profiling.md) for detailed instructions.

	## References

	- [Cache-DiT Repository](https://github.com/vipshop/cache-dit)
	- [TeaCache Paper](https://arxiv.org/abs/2411.14324)