Spaces:
Running on Zero
Running on Zero
| # Implemented Optimizations Report | |
| This document presents a source-based engineering report on the optimization stack used across generation, model loading, and serving in LightDiffusion-Next. | |
| Unlike the overview pages: | |
| - The source tree is treated as the primary reference point. | |
| - Each optimization is described in terms of purpose, implementation, integration, and trade-offs. | |
| - Supporting infrastructure and codebase groundwork are included when they materially contribute to the performance profile of the project. | |
| ## Report Scope | |
| ### Usage Profile Definitions | |
| - `default`: selected in the standard execution path | |
| - `integrated`: part of the current generation or serving flow | |
| - `optional`: integrated, but enabled through request settings, configuration, or model capabilities | |
| - `conditional`: available when hardware, dependencies, or runtime capabilities allow it | |
| - `implementation-specific`: implemented and used, but its effective behavior is shaped by a narrower internal path than the request surface alone suggests | |
| - `infrastructure-level`: supports the fast path indirectly through loading, transfer, caching, or serving behavior | |
| - `codebase groundwork`: implemented in the codebase as part of the optimization stack, but not yet surfaced as a broad standard pipeline option | |
| ### What This Report Covers | |
| This report covers both model-level and system-level optimizations: | |
| - inference and sampling speedups | |
| - precision and memory reductions | |
| - request batching and pipeline throughput improvements | |
| - preview and output-path latency reductions | |
| It does not catalog ordinary features unless they clearly reduce compute, memory, or end-to-end latency. | |
| ## Quick Inventory | |
| | Optimization | Usage Profile | Main Goal | Primary Evidence | | |
| |---|---|---|---| | |
| | CUDA runtime tuning (TF32, cuDNN benchmark, SDPA enablement) | integrated, conditional | faster kernels and better backend selection | `src/Device/Device.py` | | |
| | Attention backend cascade (SpargeAttn/SageAttention/xformers/SDPA) | integrated, conditional | faster attention kernels with fallback | `src/Attention/Attention.py`, `src/Attention/AttentionMethods.py` | | |
| | Flux2 SDPA backend priority | integrated, conditional | prefer cuDNN/Flash SDPA for Flux2 attention | `src/NeuralNetwork/flux2/layers.py`, `src/Device/Device.py` | | |
| | Cross-attention K/V projection cache | integrated | skip repeated key/value projection work for static context | `src/Attention/Attention.py` | | |
| | Prompt embedding cache | integrated | avoid re-encoding repeated prompts | `src/Utilities/prompt_cache.py`, `src/clip/Clip.py` | | |
| | Conditioning batch packing and memory-aware concatenation | integrated | reduce forward passes and pack compatible condition chunks | `src/cond/cond.py` | | |
| | CFG=1 unconditional-skip fast path | integrated | skip unnecessary unconditional branch at CFG 1.0 | `src/sample/CFG.py`, `src/sample/BaseSampler.py` | | |
| | AYS scheduler | default | reach similar quality in fewer steps | `src/sample/ays_scheduler.py`, `src/sample/ksampler_util.py` | | |
| | CFG++ samplers | integrated | improve denoising behavior with momentum-style correction | `src/sample/BaseSampler.py` | | |
| | CFG-Free sampling | integrated, optional | taper CFG late in sampling for better detail/naturalness | `src/sample/CFG.py` | | |
| | Dynamic CFG rescaling | integrated, optional | reduce overshoot and saturation from strong CFG | `src/sample/CFG.py` | | |
| | Adaptive noise scheduling | integrated, optional | adjust schedule based on observed complexity | `src/sample/CFG.py` | | |
| | `batched_cfg` request surface | implementation-specific | request-facing control around the deeper conditioning batching path | `src/sample/sampling.py`, `src/cond/cond.py` | | |
| | Multi-scale latent switching | integrated, optional | do some denoising at reduced spatial resolution | `src/sample/BaseSampler.py` | | |
| | HiDiffusion MSW-MSA patching | integrated, optional | patch UNet attention for high-resolution multiscale workflows | `src/Core/Pipeline.py`, `src/hidiffusion/msw_msa_attention.py` | | |
| | Stable-Fast | integrated, conditional | trace/compile UNet forward path | `src/StableFast/StableFast.py`, `src/Core/Pipeline.py` | | |
| | `torch.compile` | integrated, optional | compiler-based model speedup without Stable-Fast | `src/Device/Device.py`, `src/Core/AbstractModel.py` | | |
| | VAE compile, tiled path, and transfer tuning | integrated | speed up decode/encode and avoid OOM | `src/AutoEncoders/VariationalAE.py` | | |
| | BF16/FP16 automatic dtype selection | integrated, conditional | reduce memory and improve throughput on supported hardware | `src/Device/Device.py` | | |
| | FP8 weight quantization | integrated, conditional | reduce weight memory and enable Flux2-friendly inference paths | `src/Core/AbstractModel.py`, `src/Model/ModelPatcher.py` | | |
| | NVFP4 weight quantization | integrated, optional | stronger memory reduction than FP8 | `src/Core/AbstractModel.py`, `src/Model/ModelPatcher.py`, `src/Utilities/Quantization.py` | | |
| | Flux2 load-time weight-only quantization | integrated, conditional | keep large Flux2/Klein components workable on smaller VRAM budgets | `src/Core/Models/Flux2KleinModel.py` | | |
| | ToMe | integrated, optional | reduce attention cost by token merging on UNet models | `src/Model/ModelPatcher.py`, `src/Core/Pipeline.py` | | |
| | DeepCache | integrated, optional, implementation-specific | reuse prior denoiser output between update steps | `src/WaveSpeed/deepcache_nodes.py`, `src/Core/Pipeline.py` | | |
| | First Block Cache for Flux | codebase groundwork | cache transformer work for Flux-like models | `src/WaveSpeed/first_block_cache.py` | | |
| | Low-VRAM partial loading and offload policy | integrated | load only what fits and offload the rest | `src/cond/cond_util.py`, `src/Device/Device.py`, `src/Model/ModelPatcher.py` | | |
| | Async transfer helpers and pinned checkpoint tensors | integrated, infrastructure-level | reduce host/device transfer overhead | `src/Device/Device.py`, `src/Utilities/util.py` | | |
| | Request coalescing and queue batching | integrated | increase throughput across compatible API requests | `server.py` | | |
| | Large-group chunking and image-save guardrails | integrated | keep large coalesced runs from blowing up save/decode paths | `server.py`, `src/FileManaging/ImageSaver.py` | | |
| | Next-model prefetch | integrated | hide future checkpoint load latency | `server.py`, `src/Device/ModelCache.py`, `src/Utilities/util.py` | | |
| | Keep-models-loaded cache | integrated | reuse loaded checkpoints and reduce warm starts | `src/Device/ModelCache.py`, `server.py` | | |
| | In-memory PNG byte buffer | integrated | avoid disk round-trip for API responses | `src/FileManaging/ImageSaver.py`, `server.py` | | |
| | TAESD preview pacing and preview fidelity control | integrated, conditional | reduce preview overhead while keeping live feedback usable | `src/sample/BaseSampler.py`, `src/AutoEncoders/taesd.py`, `server.py` | | |
| ## Executive Summary | |
| The optimization strategy in LightDiffusion-Next is layered and cumulative rather than dependent on a single acceleration mechanism. | |
| 1. The core generation path combines runtime kernel selection, conditioning batching, lower-precision execution, and schedule optimization. | |
| 2. Several optimizations are part of the standard execution path, most notably AYS scheduling, prompt caching, attention backend selection, low-VRAM loading policy, and server-side request grouping. | |
| 3. A second layer of optional mechanisms provides workload-specific extensions, including Stable-Fast, `torch.compile`, ToMe, multiscale sampling, quantization, and guidance refinements such as CFG-Free and dynamic rescaling. | |
| 4. The serving layer contributes materially to end-to-end throughput and latency through request coalescing, chunking, model prefetching, keep-loaded caching, and in-memory response handling. | |
| 5. The codebase also contains foundational work for additional caching paths, particularly around Flux-oriented first-block caching, alongside the currently integrated DeepCache path. | |
| ## Runtime And Attention Optimizations | |
| ### CUDA runtime tuning | |
| - Status: `integrated, conditional` | |
| - Purpose: use faster math modes and let the backend choose more aggressive convolution and attention kernels. | |
| - Implementation in LightDiffusion-Next: `src/Device/Device.py` enables TF32 (`torch.backends.cuda.matmul.allow_tf32`, `torch.backends.cudnn.allow_tf32`), enables cuDNN benchmarking, and turns on PyTorch math/flash/memory-efficient SDPA when available. | |
| - Project integration: these are process-wide defaults. They do not require per-request toggles, so supported CUDA deployments get them automatically. | |
| - Effect: reduces matmul/convolution cost and opens better SDPA backends with no extra application-layer work. | |
| - Benefits: automatic, broad coverage, low complexity. | |
| - Trade-offs: hardware-conditional; benefits depend on GPU generation and PyTorch build. | |
| - Evidence: `src/Device/Device.py`. | |
| ### Attention backend cascade: SpargeAttn, SageAttention, xformers, PyTorch SDPA | |
| - Status: `integrated, conditional` | |
| - Purpose: use the fastest available attention kernel and fall back safely when unsupported. | |
| - Implementation in LightDiffusion-Next: UNet/VAE attention chooses `SpargeAttn > SageAttention > xformers > PyTorch` in `src/Attention/Attention.py`; the concrete kernels and fallback behavior live in `src/Attention/AttentionMethods.py`. | |
| - Project integration: the selection happens once when the attention module is imported/constructed. Sage/Sparge paths reshape inputs to HND layouts and pad unsupported head sizes to supported dimensions where possible; larger unsupported head sizes fall back. | |
| - Effect: faster attention on supported CUDA systems without changing calling code. | |
| - Benefits: automatic fallback chain, works across UNet cross-attention and VAE attention blocks, handles padding for awkward head sizes. | |
| - Trade-offs: dependency- and GPU-dependent; not all head sizes stay on the fast path; behavior differs between generic UNet/VAE attention and Flux2 attention. | |
| - Evidence: `src/Attention/Attention.py`, `src/Attention/AttentionMethods.py`. | |
| ### Flux2 SDPA backend priority | |
| - Status: `integrated, conditional` | |
| - Purpose: prefer the best PyTorch SDPA backend for Flux2 transformer attention. | |
| - Implementation in LightDiffusion-Next: `src/Device/Device.py` builds an SDPA priority context preferring cuDNN attention, then Flash, then efficient, then math; `src/NeuralNetwork/flux2/layers.py` uses `Device.get_sdpa_context()` around `scaled_dot_product_attention`. | |
| - Project integration: Flux2 uses a separate attention implementation from the generic UNet attention path. It first tries prioritized SDPA, then xformers, then plain SDPA. | |
| - Effect: prioritized fast attention for Flux2 with robust fallback behavior. | |
| - Benefits: keeps Flux2 on the most optimized native backend available; does not require custom kernels. | |
| - Trade-offs: benefits depend heavily on PyTorch version, backend support, and GPU runtime. | |
| - Evidence: `src/Device/Device.py`, `src/NeuralNetwork/flux2/layers.py`. | |
| ### Cross-attention static K/V projection cache | |
| - Status: `integrated` | |
| - Purpose: when the context tensor is unchanged across denoising steps, avoid recomputing K/V projections every step. | |
| - Implementation in LightDiffusion-Next: `CrossAttention` in `src/Attention/Attention.py` keeps a small `_context_cache` keyed by `id(context)` and caches projected `k` and `v`. | |
| - Project integration: this primarily targets prompt-conditioning cases where context is static while the latent evolves. The cache is tiny and self-pruning. | |
| - Effect: shaves repeated linear-projection work from cross-attention-heavy denoising loops. | |
| - Benefits: simple, training-free, no user configuration. | |
| - Trade-offs: keyed by object identity, so it only helps when the exact context object is reused; small cache size limits reuse breadth. | |
| - Evidence: `src/Attention/Attention.py`. | |
| ### Prompt embedding cache | |
| - Status: `integrated` | |
| - Purpose: cache text encoder outputs for repeated prompts instead of re-encoding them each time. | |
| - Implementation in LightDiffusion-Next: `src/Utilities/prompt_cache.py` stores `(cond, pooled)` entries keyed by prompt hash and CLIP identity; `src/clip/Clip.py` checks the cache before tokenization/encoding and writes back after encode. | |
| - Project integration: prompt caching is globally enabled by default, applies to single prompts and prompt lists, and prunes old entries once the cache exceeds its configured maximum. | |
| - Effect: reduces prompt-side overhead in repeated-prompt workflows, especially seed sweeps and incremental prompt refinement. | |
| - Benefits: low complexity, wired into the actual CLIP encode path, no quality trade-off. | |
| - Trade-offs: cache size is estimate-based and global, not per-model-session aware. | |
| - Evidence: `src/Utilities/prompt_cache.py`, `src/clip/Clip.py`, cache clear hook in `src/Core/Pipeline.py`. | |
| ### Conditioning batch packing and CFG=1 fast path | |
| - Status: `integrated` | |
| - Purpose: concatenate compatible conditioning work into fewer forward calls, and skip unconditional work entirely when CFG is effectively disabled. | |
| - Implementation in LightDiffusion-Next: `src/cond/cond.py::calc_cond_batch()` groups compatible condition chunks by shape and memory budget, concatenates them, and falls back per chunk when transformer options mismatch. `src/sample/CFG.py` sets `uncond_ = None` when `cond_scale == 1.0` and the optimization is not disabled. | |
| - Project integration: this path is central to the standard sampling flow. The batching logic also validates Flux-style transformer image sizes and falls back when they do not match token grids. | |
| - Effect: fewer model invocations, better GPU utilization, and a lower-cost path for CFG=1 workloads. | |
| - Benefits: real throughput win, memory-aware, includes safety fallback for positional/shape mismatches. | |
| - Trade-offs: batching heuristics are shape- and memory-sensitive; fallback behavior can reduce speed when conditions diverge. | |
| - Evidence: `src/cond/cond.py`, `src/sample/CFG.py`, `src/sample/BaseSampler.py`, `tests/unit/test_calc_cond_batch_fallback.py`. | |
| ## Sampling And Guidance Optimizations | |
| ### AYS scheduler | |
| - Status: `default` | |
| - Purpose: use precomputed sigma schedules that spend steps where they matter most, so fewer steps can reach comparable quality. | |
| - Implementation in LightDiffusion-Next: schedules are encoded in `src/sample/ays_scheduler.py`; `src/sample/ksampler_util.py` routes `ays`, `ays_sd15`, and `ays_sdxl` to the scheduler and auto-detects model type when possible. | |
| - Project integration: both `server.py` and `src/user/pipeline.py` default the scheduler to `ays`. Exact schedules are used when present; otherwise the code resamples or interpolates schedules. | |
| - Effect: fewer denoising steps for similar output quality, especially on SD1.5 and SDXL. | |
| - Benefits: training-free, defaulted into the request path, compatible with the sampler stack. | |
| - Trade-offs: produces different trajectories than classic schedulers; unsupported step counts use interpolation rather than paper-derived schedules. | |
| - Evidence: `src/sample/ays_scheduler.py`, `src/sample/ksampler_util.py`, defaults in `server.py` and `src/user/pipeline.py`, benchmark usage in `tests/benchmark_performance.py`. | |
| ### CFG++ samplers | |
| - Status: `integrated` | |
| - Purpose: apply CFG++-style momentum behavior in sampler variants to improve denoising stability and quality. | |
| - Implementation in LightDiffusion-Next: sampler registry maps `_cfgpp` sampler names to the same sampler classes, and `get_sampler()` enables `use_momentum` whenever the sampler name contains `_cfgpp`. | |
| - Project integration: the sampler loop stores prior denoised state and applies momentum-style correction through `BaseSampler.apply_cfg()`. The server default sampler is `dpmpp_sde_cfgpp`. | |
| - Effect: better denoising behavior than plain sampler variants without a separate post-process stage. | |
| - Benefits: integrated directly into the sampler registry; default sampler already uses it. | |
| - Trade-offs: only applies on `_cfgpp` variants; behavior is coupled to sampler implementation details rather than being a universal guidance layer. | |
| - Evidence: `src/sample/BaseSampler.py`, default sampler in `server.py`. | |
| ### CFG-Free sampling | |
| - Status: `integrated, optional` | |
| - Purpose: reduce CFG late in the denoising process so the model can finish with less over-guidance. | |
| - Implementation in LightDiffusion-Next: `CFGGuider` stores `cfg_free_enabled` and `cfg_free_start_percent`, tracks current sigma position, and progressively reduces `self.cfg` once the configured progress threshold is crossed. | |
| - Project integration: the flag is part of the request/context surface and is forwarded by SD1.5, SDXL, Flux2, HiResFix, and Img2Img code paths. | |
| - Effect: potentially better detail recovery and more natural late-stage refinement. | |
| - Benefits: integrated and actually wired through multiple pipelines; easy to combine with the rest of the sampler stack. | |
| - Trade-offs: quality optimization rather than pure speedup; exact effect is prompt- and sampler-dependent. | |
| - Evidence: `src/sample/CFG.py`, `src/Core/Models/SD15Model.py`, `src/Core/Models/SDXLModel.py`, `src/Core/Models/Flux2KleinModel.py`, `src/Processors/HiresFix.py`, `src/Processors/Img2Img.py`. | |
| ### Dynamic CFG rescaling | |
| - Status: `integrated, optional` | |
| - Purpose: reduce effective CFG when the guidance delta becomes too strong. | |
| - Implementation in LightDiffusion-Next: `CFGGuider._apply_dynamic_cfg_rescaling()` computes either a variance-based or range-based adjustment and clamps the result. | |
| - Project integration: it runs inside `cfg_function()` before CFG mixing is finalized, so it affects the real denoising path rather than acting as a post-hoc metric. | |
| - Effect: reduces oversaturation and over-guided outputs for high-CFG workloads. | |
| - Benefits: low incremental overhead and direct integration into CFG computation. | |
| - Trade-offs: not a pure speed optimization; the chosen formulas are heuristic and can flatten outputs if pushed too hard. | |
| - Evidence: `src/sample/CFG.py`. | |
| ### Adaptive noise scheduling | |
| - Status: `integrated, optional` | |
| - Purpose: use observed prediction complexity to perturb the sigma schedule during sampling. | |
| - Implementation in LightDiffusion-Next: `CFGGuider` records complexity history during prediction and scales `sigmas` inside `inner_sample()` if adaptive mode is enabled. | |
| - Project integration: complexity can be estimated with a spatial-difference metric or variance-like behavior, depending on the selected method. | |
| - Effect: attempts to spend effort where the current prediction appears more complex. | |
| - Benefits: implemented end-to-end in the guider. | |
| - Trade-offs: heuristic, can alter reproducibility, and its benefit is much less established in this repo than AYS or request coalescing. | |
| - Evidence: `src/sample/CFG.py`. | |
| ### `batched_cfg` request surface | |
| - Status: `implementation-specific` | |
| - Purpose: expose control over conditional/unconditional batching. | |
| - Implementation in LightDiffusion-Next: the field exists in the request and context models and is passed into sampling, where it is stored in `model_options["batched_cfg"]`. | |
| - Project integration: the main batching behavior is centered in `calc_cond_batch()`, while `batched_cfg` is carried through `model_options` as part of the request-side control surface around that path. | |
| - Effect: provides a request-facing handle for a batching path whose heavy lifting is performed centrally in conditioning packing. | |
| - Benefits: fits cleanly into the existing request and sampling pipeline. | |
| - Trade-offs: its effect is indirect because the main concatenation behavior is implemented deeper in the conditioning layer. | |
| - Evidence: `src/sample/sampling.py`, `src/Core/Context.py`, `src/cond/cond.py`. | |
| ## Multiscale And Architecture-Specific Optimizations | |
| ### Multi-scale latent switching | |
| - Status: `integrated, optional` | |
| - Purpose: run some denoising steps at a downscaled latent resolution and return to full resolution for selected steps. | |
| - Implementation in LightDiffusion-Next: `MultiscaleManager` in `src/sample/BaseSampler.py` computes a per-step full-resolution schedule and uses bilinear downscale/upscale around sampler model calls. | |
| - Project integration: the samplers consult `ms.use_fullres(i)` each step. Flux and Flux2 are explicitly excluded because the code treats multiscale as incompatible with DiT-style architectures. | |
| - Effect: lower compute on some denoising steps for compatible samplers and architectures. | |
| - Benefits: actually participates in the sampler loop; configurable by factor and schedule. | |
| - Trade-offs: it necessarily changes the denoising path and can trade detail for speed; not available for Flux/Flux2. | |
| - Evidence: `src/sample/BaseSampler.py`, `src/sample/sampling.py`, `src/Core/Models/Flux2KleinModel.py`. | |
| ### HiDiffusion MSW-MSA patching | |
| - Status: `integrated, optional` | |
| - Purpose: patch UNet attention for high-resolution workflows using HiDiffusion-style MSW-MSA attention changes. | |
| - Implementation in LightDiffusion-Next: the pipeline clones the inner model and applies `ApplyMSWMSAAttentionSimple` when multiscale is enabled on UNet architectures. | |
| - Project integration: the patch is explicitly blocked for Flux/Flux2 and disabled in some sub-pipelines like refiner or certain detail passes where the project wants to avoid artifact risk. | |
| - Effect: makes the multiscale/high-resolution path more efficient or more stable on SD1.5/SDXL-style UNets. | |
| - Benefits: architecture-aware and guarded against obvious misuse. | |
| - Trade-offs: not universal; adds another patching layer and can be brittle if architecture assumptions drift. | |
| - Evidence: `src/Core/Pipeline.py`, `src/hidiffusion/msw_msa_attention.py`, `src/Core/AbstractModel.py`, `src/Core/Models/SD15Model.py`, `src/Core/Models/SDXLModel.py`. | |
| ## Model Compilation, Precision, And Memory Optimizations | |
| ### Stable-Fast | |
| - Status: `integrated, conditional` | |
| - Purpose: trace and wrap UNet execution to reduce Python overhead and optionally use CUDA graph behavior. | |
| - Implementation in LightDiffusion-Next: `src/StableFast/StableFast.py` builds a lazy trace module around the model function and stores compiled modules in a cache keyed by converted kwargs; `Pipeline._apply_optimizations()` applies it when `stable_fast` is enabled. | |
| - Project integration: only model types that advertise `supports_stable_fast=True` can use it. Flux2 explicitly opts out at the capability layer. | |
| - Effect: faster repeated UNet execution when the optional `sfast` dependency is present and shapes stay compatible enough for compilation reuse. | |
| - Benefits: capability-gated, optional dependency handled defensively, integrated into the core optimization application phase. | |
| - Trade-offs: dependency-sensitive, compilation overhead can dominate short runs, CUDA graph behavior is less flexible. | |
| - Evidence: `src/StableFast/StableFast.py`, `src/Core/Pipeline.py`, `src/Core/Models/SD15Model.py`, `src/Core/Models/SDXLModel.py`, `src/Core/Models/Flux2KleinModel.py`. | |
| ### `torch.compile` | |
| - Status: `integrated, optional` | |
| - Purpose: rely on PyTorch compiler paths instead of Stable-Fast. | |
| - Implementation in LightDiffusion-Next: `src/Device/Device.py::compile_model()` defaults to `max-autotune-no-cudagraphs`; `src/Core/AbstractModel.py::apply_torch_compile()` applies it to the top-level module or diffusion submodule when possible. | |
| - Project integration: the optimization is mutually exclusive with Stable-Fast in the main pipeline. | |
| - Effect: compiler-based speedups with a safer default mode than more fragile CUDA-graph-heavy settings. | |
| - Benefits: built on standard PyTorch, tested for safe default mode. | |
| - Trade-offs: compiler behavior is environment-dependent; still vulnerable to dynamic-shape and dynamic-state limitations. | |
| - Evidence: `src/Device/Device.py`, `src/Core/AbstractModel.py`, `src/Core/Pipeline.py`, `tests/unit/test_fp8_compile.py`. | |
| ### VAE compile, tiled path, and transfer tuning | |
| - Status: `integrated` | |
| - Purpose: speed up VAE encode/decode, reduce overhead, and avoid OOM by choosing tiled or batched paths. | |
| - Implementation in LightDiffusion-Next: `VariationalAE.VAE` compiles the decoder on first use, runs decode/encode under `torch.inference_mode()`, uses channels-last where useful, chooses tiled fallback when memory is tight, and uses non-blocking transfers. | |
| - Project integration: this is automatic. Callers do not opt in. | |
| - Effect: faster VAE stages, less repeated Python/autograd overhead, and better robustness under constrained memory. | |
| - Benefits: always enabled and directly applied in the decode and encode hot path. | |
| - Trade-offs: decoder compile still depends on `torch.compile` availability; tiling adds complexity and can affect throughput at small sizes. | |
| - Evidence: `src/AutoEncoders/VariationalAE.py`. | |
| ### BF16/FP16 automatic dtype selection | |
| - Status: `integrated, conditional` | |
| - Purpose: pick a lower-precision working dtype that matches the hardware and model constraints. | |
| - Implementation in LightDiffusion-Next: `src/Device/Device.py` contains the dtype selection logic for UNet, text encoder, and VAE devices/dtypes, including bf16 support checks and fallback rules. | |
| - Project integration: loaders and patchers consult these helpers when deciding how to instantiate and place components. | |
| - Effect: reduced memory footprint and better arithmetic throughput on modern hardware. | |
| - Benefits: broad, centralized policy. | |
| - Trade-offs: heuristic; wrong hardware assumptions can reduce numerical stability or disable a faster path. | |
| - Evidence: `src/Device/Device.py`, `src/Model/ModelPatcher.py`, `src/FileManaging/Loader.py`. | |
| ### FP8 weight quantization | |
| - Status: `integrated, conditional` | |
| - Purpose: store weights in FP8 while casting them back to the input dtype during execution. | |
| - Implementation in LightDiffusion-Next: `AbstractModel.apply_fp8()` hardware-gates support using `Device.is_fp8_supported()`, rewrites eligible weights to FP8, and enables runtime cast behavior on `CastWeightBiasOp` modules. The lower-level `ModelPatcher.weight_only_quantize()` also supports FP8-style quantization. | |
| - Project integration: it is available through generation settings and also used in Flux2 load paths when appropriate. | |
| - Effect: lower model weight memory with an execution path that avoids dtype-mismatch crashes. | |
| - Benefits: tested explicitly, integrates with cast-aware modules, useful for large models. | |
| - Trade-offs: hardware-gated; quality/performance trade-offs depend on model and layer mix. | |
| - Evidence: `src/Core/AbstractModel.py`, `src/Device/Device.py`, `src/Model/ModelPatcher.py`, `tests/unit/test_fp8_compile.py`. | |
| ### NVFP4 weight quantization | |
| - Status: `integrated, optional` | |
| - Purpose: use a more aggressive 4-bit weight-only format to reduce memory further than FP8. | |
| - Implementation in LightDiffusion-Next: both `AbstractModel.apply_nvfp4()` and `ModelPatcher.weight_only_quantize("nvfp4")` quantize supported weights, store scale buffers, and enable runtime casting/dequantization. | |
| - Project integration: the quantization path is used most clearly in Flux2/Klein loading, but the abstract model path also exists for supported models. | |
| - Effect: significant memory reduction at the cost of more aggressive approximation. | |
| - Benefits: strongest memory reduction path in the repo. | |
| - Trade-offs: more invasive than FP8, more likely to affect quality, and only applies to some weight shapes. | |
| - Evidence: `src/Core/AbstractModel.py`, `src/Model/ModelPatcher.py`, `src/Utilities/Quantization.py`, `tests/test_nvfp4.py`, `tests/test_nvfp4_integration.py`. | |
| ### Flux2 load-time weight-only quantization | |
| - Status: `integrated, conditional` | |
| - Purpose: automatically quantize large Flux2 diffusion and Klein text encoder weights during loading when the configuration or hardware path calls for it. | |
| - Implementation in LightDiffusion-Next: `Flux2KleinModel.load()` selects a quantization format and applies weight-only quantization to the diffusion model; `_load_klein_text_encoder()` applies the same idea to the text encoder before offloading it back to CPU. | |
| - Project integration: Flux2 is the clearest example in the codebase where quantization is implemented as a first-class loading strategy rather than as a generic capability alone. | |
| - Effect: keeps a large Flux2/Klein stack usable on lower-VRAM systems than an uncompressed load would allow. | |
| - Benefits: integrated, architecture-specific, and directly aligned with large-model VRAM constraints. | |
| - Trade-offs: tightly coupled to Flux2/Klein assumptions; not equivalent to a universally available quantized-mode toggle. | |
| - Evidence: `src/Core/Models/Flux2KleinModel.py`. | |
| ### ToMe | |
| - Status: `integrated, optional` | |
| - Purpose: merge similar tokens to reduce attention workload in UNet-based models. | |
| - Implementation in LightDiffusion-Next: `ModelPatcher.apply_tome()` applies and removes `tomesd` patches; `Pipeline._apply_optimizations()` applies it only when the model capabilities allow it. | |
| - Project integration: SD1.5 and SDXL advertise `supports_tome=True`; Flux2 advertises `False`. | |
| - Effect: lower attention cost on supported UNet models, particularly at higher token counts. | |
| - Benefits: explicitly capability-gated, integrated into the core optimization phase. | |
| - Trade-offs: optional dependency, UNet-only in current practice, and quality can soften if pushed too aggressively. | |
| - Evidence: `src/Model/ModelPatcher.py`, `src/Core/Pipeline.py`, capability declarations in `src/Core/Models/*`, `tests/unit/test_tome_fix.py`. | |
| ### DeepCache | |
| - Status: `integrated, optional, implementation-specific` | |
| - Purpose: reuse work across denoising steps rather than running a full forward pass every time. | |
| - Implementation in LightDiffusion-Next: `ApplyDeepCacheOnModel.patch()` clones the model and wraps its UNet function. On cache-update steps it runs the model normally and stores the output; on reuse steps it returns the cached output directly. | |
| - Project integration: the main pipeline applies it from `_apply_optimizations()` when `deepcache_enabled` is true and the model advertises support. | |
| - Effect: fewer full model computations on reuse steps, trading some fidelity for speed. | |
| - Benefits: live integrated path, simple integration model, and capability gating. | |
| - Trade-offs: the implementation works at whole-output reuse granularity rather than a finer-grained internal block reuse strategy, so its speed/fidelity profile is comparatively coarse. | |
| - Evidence: `src/WaveSpeed/deepcache_nodes.py`, `src/Core/Pipeline.py`, `src/Core/AbstractModel.py`, `src/Core/Models/SD15Model.py`, `src/Core/Models/SDXLModel.py`, `tests/test_core_functionalities.py`. | |
| ### First Block Cache for Flux | |
| - Status: `codebase groundwork` | |
| - Purpose: cache downstream transformer work when the first-block residual indicates the state has not changed much. | |
| - Implementation in LightDiffusion-Next: `src/WaveSpeed/first_block_cache.py` contains cache contexts and patch builders for both UNet-like and Flux-like forward paths. | |
| - Project integration: the module provides the machinery for a Flux-oriented first-block caching path. In the current project flow, the directly surfaced caching path is DeepCache, while this module remains groundwork for a more specialized integration. | |
| - Effect: establishes the components needed for a transformer-oriented cache path in the codebase. | |
| - Benefits: nontrivial implementation foundation already exists. | |
| - Trade-offs: it is not yet surfaced as a broad standard option in the same way as the main integrated optimizations. | |
| - Evidence: `src/WaveSpeed/first_block_cache.py`. | |
| ## Memory Management And Serving Optimizations | |
| ### Low-VRAM partial loading and offload policy | |
| - Status: `integrated` | |
| - Purpose: keep only the amount of model state in VRAM that current free memory allows, offloading the rest. | |
| - Implementation in LightDiffusion-Next: `cond_util.prepare_sampling()` calls `Device.load_models_gpu(..., force_full_load=False)`; `Device.load_models_gpu()` computes low-VRAM budgets and delegates partial loading to `ModelPatcher.patch_model_lowvram()` and `partially_load()`. | |
| - Project integration: this is a core loading behavior, not a side option. Text encoder and VAE also have explicit offload-device helpers. | |
| - Effect: keeps generation viable on limited VRAM systems and reduces full reload pressure. | |
| - Benefits: central to memory behavior in constrained environments, architecture-aware, and tied into checkpoint, text encoder, and VAE device policy. | |
| - Trade-offs: more complex state management; partial loading can increase latency and complicate debugging. | |
| - Evidence: `src/cond/cond_util.py`, `src/Device/Device.py`, `src/Model/ModelPatcher.py`. | |
| ### Async transfer helpers and pinned checkpoint tensors | |
| - Status: `integrated, infrastructure-level` | |
| - Purpose: reduce CPU<->GPU transfer cost with asynchronous copies, streams, and pinned host memory. | |
| - Implementation in LightDiffusion-Next: `Device.cast_to()` can issue transfers on offload streams; checkpoint tensors are pinned on CUDA loads in `util.load_torch_file()`; VAE encode/decode uses non-blocking transfers. | |
| - Project integration: these mechanisms appear most clearly in checkpoint loading, model movement, and VAE data flow. Some parts act as general transfer infrastructure rather than as a single user-facing optimization toggle. | |
| - Effect: faster host/device movement and less transfer-induced stalling in hot paths that actually use the helpers. | |
| - Benefits: useful on CUDA systems, especially during model load and VAE stages. | |
| - Trade-offs: integration is uneven; some helper functions look broader than their current call footprint. | |
| - Evidence: `src/Device/Device.py`, `src/Utilities/util.py`, `src/AutoEncoders/VariationalAE.py`. | |
| ### Request coalescing and queue batching | |
| - Status: `integrated` | |
| - Purpose: batch compatible API requests together so the backend does fewer larger pipeline invocations. | |
| - Implementation in LightDiffusion-Next: `server.py::GenerationBuffer` groups pending requests by a signature that includes model, size, scheduler, sampler, steps, multiscale settings, and other batch-level properties. | |
| - Project integration: the worker chooses the oldest eligible group, optionally waits for more arrivals, flattens per-request samples into one pipeline call, and later remaps saved results back to request futures. | |
| - Effect: better throughput and GPU utilization for concurrent API use. | |
| - Benefits: real server-level optimization, clearly implemented, includes observability-oriented logs. | |
| - Trade-offs: requires careful grouping keys; incompatible request options fragment batching opportunities. | |
| - Evidence: `server.py`. | |
| ### Singleton policy, large-group chunking, and image-save guardrails | |
| - Status: `integrated` | |
| - Purpose: prevent batching from hurting latency for lone requests, and prevent oversized coalesced batches from exploding decode/save paths. | |
| - Implementation in LightDiffusion-Next: `LD_BATCH_WAIT_SINGLETONS` controls whether singletons wait; `LD_MAX_IMAGES_PER_GROUP` and `ImageSaver.MAX_IMAGES_PER_SAVE` drive chunking; large groups are split into smaller sequential pipeline runs. | |
| - Project integration: the server keeps the coalescing optimization from turning into pathological giant save/decode operations, and tests cover the chunking behavior. | |
| - Effect: better tail latency for single requests and more stable handling of large batched workloads. | |
| - Benefits: directly addresses operational failure modes in large batched workloads. | |
| - Trade-offs: chunking reduces some batching benefits; many environment variables affect behavior. | |
| - Evidence: `server.py`, `src/FileManaging/ImageSaver.py`, `tests/unit/test_generation_buffer_chunking.py`, `docs/quirks.md`. | |
| ### Next-model prefetch | |
| - Status: `integrated` | |
| - Purpose: while one batch is running, read the next checkpoint into CPU RAM if the queued next batch needs a different model. | |
| - Implementation in LightDiffusion-Next: `GenerationBuffer._look_ahead_and_prefetch()` resolves the next checkpoint, loads it via `util.load_torch_file()` on a background task, and stores it in `ModelCache` as a prefetched state dict. | |
| - Project integration: the next load can reuse the prefetched state dict through `util.load_torch_file()` before the cache entry is cleared. | |
| - Effect: overlaps some future checkpoint load cost with current generation work. | |
| - Benefits: server-side latency hiding with minimal interface impact. | |
| - Trade-offs: only helps when queued work is predictable; increases CPU RAM usage. | |
| - Evidence: `server.py`, `src/Device/ModelCache.py`, `src/Utilities/util.py`. | |
| ### Keep-models-loaded cache | |
| - Status: `integrated` | |
| - Purpose: keep recently used checkpoints and sampling models resident instead of cleaning them up after every request. | |
| - Implementation in LightDiffusion-Next: `ModelCache` stores checkpoints, TAESD models, sampling models, and the keep-loaded policy; `server.py` temporarily applies the request's `keep_models_loaded` directive for a group. | |
| - Project integration: when enabled, main models are retained and only auxiliary control models are cleaned up aggressively. | |
| - Effect: lower warm-start cost between related generations and less repetitive reload churn. | |
| - Benefits: simple end-user behavior for a meaningful latency/memory trade-off. | |
| - Trade-offs: consumes more VRAM/RAM; can make memory pressure less predictable on multi-user servers. | |
| - Evidence: `src/Device/ModelCache.py`, `server.py`. | |
| ### In-memory PNG byte buffer | |
| - Status: `integrated` | |
| - Purpose: return API images from memory instead of reading them back from disk after save. | |
| - Implementation in LightDiffusion-Next: `ImageSaver` can store encoded PNG bytes in `_image_bytes_buffer`; `server.py` first calls `pop_image_bytes()` when fulfilling request futures. | |
| - Project integration: batched pipeline runs can still save images normally while the API path avoids a disk round-trip for the response payload. | |
| - Effect: lower response latency and less unnecessary disk I/O for served images. | |
| - Benefits: directly reduces response-path disk I/O in API-serving scenarios. | |
| - Trade-offs: consumes temporary RAM; only helps when the buffer path is actually populated. | |
| - Evidence: `src/FileManaging/ImageSaver.py`, `server.py`. | |
| ### TAESD preview pacing and preview fidelity control | |
| - Status: `integrated, conditional` | |
| - Purpose: keep live previews useful without letting preview generation dominate sampling time. | |
| - Implementation in LightDiffusion-Next: `SamplerCallback` caches preview settings, only triggers previews at a coarse interval, and runs preview work on a background thread; the server also applies per-request preview fidelity presets (`low`, `balanced`, `high`). | |
| - Project integration: previews are generated only when previewing is enabled, and the preview cadence is adaptive to total step count. | |
| - Effect: live feedback with bounded preview overhead. | |
| - Benefits: explicit pacing, non-blocking thread model, request-level fidelity override. | |
| - Trade-offs: still extra work during sampling; fidelity presets are intentionally coarse. | |
| - Evidence: `src/sample/BaseSampler.py`, `src/AutoEncoders/taesd.py`, `server.py`, preview tests under `tests/e2e` and `tests/integration/api`. | |
| ## Integration Notes | |
| These notes highlight how several optimizations are currently integrated and used inside the project. | |
| ### 1. Flux-oriented first block caching | |
| - The codebase contains a dedicated `src/WaveSpeed/first_block_cache.py` module with cache contexts and patch builders for Flux-oriented paths. | |
| - In the current optimization stack, the directly surfaced caching path is DeepCache, while First Block Cache remains implementation groundwork for a more specialized integration. | |
| - This establishes the core components for a transformer-oriented cache path even though it is not yet surfaced as a primary standard option. | |
| ### 2. DeepCache reuse granularity | |
| - DeepCache is integrated through `src/WaveSpeed/deepcache_nodes.py` and is applied from the main pipeline when enabled. | |
| - In this project, it works by reusing prior denoiser outputs on designated reuse steps. | |
| - This yields a clear speed-fidelity profile based on output reuse rather than on finer-grained internal block caching. | |
| ### 3. Conditioning batching control | |
| - Conditioning batching is centered in `src/cond/cond.py::calc_cond_batch()`, where compatible condition chunks are packed and concatenated. | |
| - The `batched_cfg` request field participates as request-side control metadata around this behavior. | |
| - In operation, the batching outcome is therefore shaped mainly by the central conditioning logic rather than by a standalone external switch. | |
| ### 4. GPU attention backend selection | |
| - Attention backend selection is hardware- and build-aware, with the runtime choosing among SpargeAttn, SageAttention, xformers, and PyTorch SDPA based on capability checks. | |
| - The exact backend used in practice therefore depends on the active GPU generation, dependencies, and runtime configuration. | |
| - Backend acceleration is therefore largely automatic from the user perspective while remaining environment-specific in implementation. | |
| ### 5. Prompt cache behavior | |
| - Prompt caching is implemented as a global dict-backed cache keyed by prompt hash and CLIP identity. | |
| - The cache prunes old entries once it exceeds its configured size threshold. | |
| - In operation, it primarily benefits repeated-prompt workflows such as seed sweeps and prompt iteration. | |
| ## Conclusion | |
| LightDiffusion-Next uses a layered optimization strategy spanning runtime kernels, scheduling, guidance logic, precision and memory control, model patching, and server-side throughput management. | |
| - The core operational stack is built around AYS scheduling, attention backend selection, conditioning batching, low-VRAM loading policy, prompt caching, VAE tuning, and request coalescing. | |
| - Optional paths such as Stable-Fast, `torch.compile`, ToMe, DeepCache, multiscale sampling, and quantization extend that stack for specific hardware targets, model families, and workload profiles. | |
| - The serving layer is a first-class component of the performance model, with batching, chunking, prefetching, keep-loaded caches, and in-memory responses contributing directly to end-to-end latency and throughput. | |