Lekr0's picture
Add files using upload-large-folder tool
6268841 verified

Caching Acceleration for Diffusion Models

SGLang provides multiple caching acceleration strategies for Diffusion Transformer (DiT) models. These strategies can significantly reduce inference time by skipping redundant computation.

Overview

SGLang supports two complementary caching approaches:

Strategy Scope Mechanism Best For
Cache-DiT Block-level Skip individual transformer blocks dynamically Advanced, higher speedup
TeaCache Timestep-level Skip entire denoising steps based on L1 similarity Simple, built-in

Cache-DiT

Cache-DiT provides block-level caching with advanced strategies like DBCache and TaylorSeer. It can achieve up to 1.69x speedup.

See cache_dit.md for detailed configuration.

Quick Start

SGLANG_CACHE_DIT_ENABLED=true \
sglang generate --model-path Qwen/Qwen-Image \
    --prompt "A beautiful sunset over the mountains"

Key Features

  • DBCache: Dynamic block-level caching based on residual differences
  • TaylorSeer: Taylor expansion-based calibration for optimized caching
  • SCM: Step-level computation masking for additional speedup

TeaCache

TeaCache (Temporal similarity-based caching) accelerates diffusion inference by detecting when consecutive denoising steps are similar enough to skip computation entirely.

See teacache.md for detailed documentation.

Quick Overview

  • Tracks L1 distance between modulated inputs across timesteps
  • When accumulated distance is below threshold, reuses cached residual
  • Supports CFG with separate positive/negative caches

Supported Models

  • Wan (wan2.1, wan2.2)
  • Hunyuan (HunyuanVideo)
  • Z-Image

For Flux and Qwen models, TeaCache is automatically disabled when CFG is enabled.

References