arxiv:2605.06169

Mean Mode Screaming: Mean--Variance Split Residuals for 1000-Layer Diffusion Transformers

Published on May 7

· Submitted by

Pengqi Lu on May 11

Upvote

Authors:

Pengqi Lu

Abstract

Deep diffusion transformers face structural instability at extreme depths due to mean-dominated collapse triggered by mean mode screaming, which is mitigated through mean-variance split residuals that maintain stable training while preserving performance.

AI-generated summary

Scaling Diffusion Transformers (DiTs) to hundreds of layers introduces a structural vulnerability: networks can enter a silent, mean-dominated collapse state that homogenizes token representations and suppresses centered variation. Through mechanistic auditing, we isolate the trigger event of this collapse as Mean Mode Screaming (MMS). MMS can occur even when training appears stable, with a mean-coherent backward shock on residual writers that opens deep residual branches and drives the network into a mean-dominated state. We show this behavior is driven by an exact decomposition of these gradients into mean-coherent and centered components, compounded by the structural suppression of attention-logit gradients through the null space of the Softmax Jacobian once values homogenize. To address this, we propose Mean-Variance Split (MV-Split) Residuals, which combine a separately gained centered residual update with a leaky trunk-mean replacement. On a 400-layer single-stream DiT, MV-Split prevents the divergent collapse that crashes the un-stabilized baseline; it tracks close to the baseline's pre-crash trajectory while remaining substantially better than token-isotropic gating methods such as LayerScale across the full schedule. Finally, we present a 1000-layer DiT as a scale-validation run at boundary scales, establishing that the architecture remains stably trainable at extreme depth.

View arXiv page View PDF Project page GitHub 0 Add to collection

Community

StableKirito

Paper author Paper submitter about 4 hours ago

Mean Mode Screaming (MMS) — the abrupt entry event into a silent, mean-dominated collapse state in ultra-deep Diffusion Transformers. Optimization can remain stable for thousands of steps and then diverge within a few updates, with the loss returning to near its initialization level. We trace this to a mean-coherent backward shock on residual writers that opens deep residual branches and drives the network into a state where token representations homogenize and centered variation is suppressed.

Mechanistically, the failure exploits a geometric asymmetry between the token-mean and centered subspaces. Row-stochastic attention strictly preserves pure-mean states, while gradients admit an exact decomposition into mean-coherent and centered components — as token alignment increases, the mean-coherent component accumulates in an O(T) coherent regime and dominates the residual update. Once values homogenize, attention-logit gradients are suppressed through the null space of the Softmax Jacobian, locking the network into the collapsed state. Existing depth stabilizers (ReZero, LayerScale) shrink the two modes by the same factor — this stabilizes training but also damps the signal-bearing centered mode, slowing convergence.

Mean–Variance Split (MV-Split) Residuals combine a separately gained centered residual update with a leaky trunk-mean replacement, regulating the mean path without scaling down the centered path by the same factor. On a 400-layer single-stream DiT, MV-Split prevents the divergent collapse that crashes the un-stabilized baseline and converges faster than LayerScale across the full schedule. A 1000-layer DiT serves as a scale-validation run at boundary scales.

🌐 Interactive gradient-diagnosis viewer replays the actual W&B run that crashed — per-layer 3D token-flow driven by W&B metrics. Scrub the timeline, switch the driving metric, hover for live values: https://erwold.github.io/mv-split/

🤗 1000-layer DiT weights: https://huggingface.co/StableKirito/mvsplit-dit-1000l