| | <!DOCTYPE html> |
| | <html lang="en"> |
| | <head> |
| | <meta charset="UTF-8"> |
| | <meta name="viewport" content="width=device-width, initial-scale=1.0"> |
| | <title>Sensitivity-Aware Training (SAT): Using Statistical Weight Geometry to Guide LLM Training Dynamics</title> |
| | <style> |
| | :root { |
| | --bg: #fafafa; |
| | --text: #1a1a2e; |
| | --accent: #2d5aa0; |
| | --muted: #555; |
| | --border: #ddd; |
| | --code-bg: #f0f0f0; |
| | --table-header: #e8eef6; |
| | --highlight: #fff3cd; |
| | } |
| | * { box-sizing: border-box; margin: 0; padding: 0; } |
| | body { |
| | font-family: 'Georgia', 'Times New Roman', serif; |
| | line-height: 1.7; |
| | color: var(--text); |
| | background: var(--bg); |
| | max-width: 52em; |
| | margin: 0 auto; |
| | padding: 2em 1.5em 4em; |
| | } |
| | h1 { |
| | font-size: 1.7em; |
| | line-height: 1.3; |
| | text-align: center; |
| | margin: 1.5em 0 0.3em; |
| | color: var(--text); |
| | } |
| | .authors { |
| | text-align: center; |
| | color: var(--muted); |
| | margin-bottom: 2em; |
| | font-style: italic; |
| | } |
| | h2 { |
| | font-size: 1.3em; |
| | margin: 2em 0 0.7em; |
| | color: var(--accent); |
| | border-bottom: 2px solid var(--accent); |
| | padding-bottom: 0.2em; |
| | } |
| | h3 { |
| | font-size: 1.1em; |
| | margin: 1.5em 0 0.5em; |
| | color: #333; |
| | } |
| | h4 { |
| | font-size: 1em; |
| | font-weight: bold; |
| | margin: 1.2em 0 0.3em; |
| | color: var(--text); |
| | } |
| | p { margin: 0.7em 0; text-align: justify; } |
| | .abstract { |
| | background: #f5f5f5; |
| | border-left: 4px solid var(--accent); |
| | padding: 1.2em 1.5em; |
| | margin: 1.5em 0 2em; |
| | font-size: 0.95em; |
| | } |
| | .abstract strong { color: var(--accent); } |
| | ul, ol { margin: 0.5em 0 0.5em 2em; } |
| | li { margin: 0.3em 0; } |
| | table { |
| | border-collapse: collapse; |
| | margin: 1em auto; |
| | font-size: 0.9em; |
| | width: auto; |
| | min-width: 50%; |
| | } |
| | th, td { |
| | border: 1px solid var(--border); |
| | padding: 0.45em 0.8em; |
| | text-align: center; |
| | } |
| | th { |
| | background: var(--table-header); |
| | font-weight: bold; |
| | } |
| | td:first-child, th:first-child { text-align: left; } |
| | tr:nth-child(even) { background: #f9f9f9; } |
| | .best { font-weight: bold; color: #1a7a1a; } |
| | .caption { |
| | text-align: center; |
| | font-size: 0.88em; |
| | color: var(--muted); |
| | margin-top: 0.5em; |
| | margin-bottom: 1.5em; |
| | font-style: italic; |
| | } |
| | .equation { |
| | display: block; |
| | text-align: center; |
| | margin: 1em 0; |
| | padding: 0.8em; |
| | background: var(--code-bg); |
| | border-radius: 4px; |
| | font-family: 'Courier New', monospace; |
| | font-size: 0.92em; |
| | overflow-x: auto; |
| | } |
| | .algorithm { |
| | background: #fafafa; |
| | border: 1px solid var(--border); |
| | padding: 1em 1.5em; |
| | margin: 1em 0; |
| | font-family: 'Courier New', monospace; |
| | font-size: 0.88em; |
| | line-height: 1.5; |
| | border-radius: 4px; |
| | } |
| | .algorithm .keyword { color: var(--accent); font-weight: bold; } |
| | .algorithm .comment { color: #888; font-style: italic; } |
| | code { |
| | background: var(--code-bg); |
| | padding: 0.15em 0.4em; |
| | border-radius: 3px; |
| | font-size: 0.9em; |
| | } |
| | .ref { color: var(--accent); cursor: default; } |
| | .footnote { |
| | font-size: 0.85em; |
| | color: var(--muted); |
| | border-top: 1px solid var(--border); |
| | margin-top: 2em; |
| | padding-top: 1em; |
| | } |
| | .bib { |
| | font-size: 0.85em; |
| | margin: 0.3em 0; |
| | padding-left: 2.5em; |
| | text-indent: -2.5em; |
| | } |
| | .section-divider { |
| | border: none; |
| | border-top: 1px solid var(--border); |
| | margin: 2em 0; |
| | } |
| | .highlight-row { background: var(--highlight) !important; } |
| | @media (max-width: 600px) { |
| | body { padding: 1em 0.8em; font-size: 0.95em; } |
| | table { font-size: 0.8em; } |
| | th, td { padding: 0.3em 0.5em; } |
| | } |
| | </style> |
| | </head> |
| | <body> |
| |
|
| | <h1>Sensitivity-Aware Training (SAT): Using Statistical Weight Geometry to Guide LLM Training Dynamics</h1> |
| | <p class="authors"><a href="https://baa.ai">baa.ai</a></p> |
| |
|
| | |
| | <div class="abstract"> |
| | <strong>Abstract.</strong> |
| | Large language models (LLMs) are routinely trained at full floating-point precision and subsequently compressed via post-training quantization (PTQ). The mismatch between the statistical geometry of weights produced by unconstrained training and the requirements of low-bit arithmetic is well documented: outlier weights, pathologically concentrated singular-value spectra, and noise-amplifying layer topologies all degrade quantized model quality. The SWAN framework (Statistical Weight Analysis for quantizatioN) addresses this mismatch retroactively, diagnosing sensitivity after training concludes. We propose <strong>Sensitivity-Aware Training (SAT)</strong>, a principled extension of the SWAN philosophy into the training loop itself. SAT replaces the static, post-hoc sensitivity report with three online training signals: a kurtosis-driven regularisation term that penalises outlier emergence in real time, a spectral norm constraint that maintains well-conditioned weight matrices throughout optimisation, and a targeted noise-injection schedule that surgically hardens only layers flagged as high-risk. Layered on top of these signals is a <strong>Dynamic Bit-Width Allocation (DBWA)</strong> mechanism that periodically evaluates SWAN metrics and adjusts per-layer training precision accordingly, reducing wasted compute on low-sensitivity layers while protecting high-sensitivity ones. Together these mechanisms produce models that are quantization-ready by construction, eliminating a root cause of PTQ degradation rather than compensating for it after the fact. |
| | </div> |
| |
|
| | |
| | <h2>1. Introduction</h2> |
| |
|
| | <p>The deployment lifecycle of a modern LLM involves a conceptual discontinuity: models are trained in high-precision floating point to maximise gradient quality, then compressed aggressively for inference. This two-phase pipeline assumes that a model’s internal geometry—the statistical distribution of its weights and activations—is largely incidental to the training objective and can be post-processed without penalty. Empirical evidence increasingly contradicts this assumption.</p> |
| |
|
| | <p>Dettmers et al. <span class="ref">[2]</span> demonstrated that outlier features emerge systematically in transformer activations and make naive INT8 quantization catastrophically lossy. Subsequent work—including GPTQ, AWQ, SmoothQuant, and QuaRot—has developed increasingly sophisticated post-training correction strategies: weight rotation, activation smoothing, mixed-precision allocation, and learned rounding. The SWAN framework synthesises the diagnostic half of this body of work, exposing three complementary metrics (excess kurtosis, SVD spectral concentration, and output noise amplification) that together characterise a trained layer’s sensitivity to quantization.</p> |
| |
|
| | <p>The natural next question is: if we can measure sensitivity after training, can we prevent the conditions that produce it <em>during</em> training? This paper answers affirmatively. The SWAN metrics, originally conceived as post-hoc diagnostics, can be repurposed as online training signals. The resulting framework—Sensitivity-Aware Training (SAT)—treats the statistical geometry of weights as a first-class training objective alongside the primary language modelling loss.</p> |
| |
|
| | <p><strong>This paper makes the following contributions:</strong></p> |
| | <ol> |
| | <li>A formal derivation of kurtosis-regularised gradient updates that suppress outlier emergence without interfering with model expressiveness.</li> |
| | <li>A spectral norm conditioning constraint that promotes distributed singular-value spectra and improves the robustness of learned representations to precision reduction.</li> |
| | <li>A targeted quantization-noise injection protocol that concentrates the hardening effect of Quantization-Aware Training (QAT) on statistically identified high-risk layers.</li> |
| | <li>A Dynamic Bit-Width Allocation (DBWA) mechanism that periodically queries SWAN metrics to reassign per-layer training precision, substantially reducing memory and compute overhead during pre-training.</li> |
| | <li>A comparative analysis against standard pre-training, full QAT, and SWAN-guided PTQ, demonstrating that SAT achieves superior quantized model quality at competitive or lower training cost.</li> |
| | </ol> |
| |
|
| | <hr class="section-divider"> |
| |
|
| | |
| | <h2>2. Background and Related Work</h2> |
| |
|
| | <h3>2.1 The Outlier Problem in LLM Quantization</h3> |
| |
|
| | <p>Uniform quantization maps a continuous range of values to a fixed grid of <em>b</em>-bit integers. Its accuracy depends critically on the distribution of the values being quantized: the wider and more irregular the distribution, the larger the rounding error. Transformer models trained with standard optimisers develop persistent outlier activations—values orders of magnitude larger than the median—that force quantization grids to accommodate extreme ranges, wasting representational capacity on the tails and degrading the precision of the bulk of the distribution. Weight matrices exhibit a related pathology: a small number of singular values accumulate disproportionate spectral energy, meaning that information is effectively stored in a low-dimensional subspace that is fragile under precision reduction.</p> |
| |
|
| | <h3>2.2 Post-Training Quantization and Its Limits</h3> |
| |
|
| | <p>PTQ methods accept a trained model as given and attempt to correct its statistical deficiencies through transformation. GPTQ <span class="ref">[3]</span> uses second-order weight perturbation to compensate for quantization error. AWQ <span class="ref">[4]</span> identifies salient weight channels via activation magnitude and protects them. SmoothQuant <span class="ref">[9]</span> migrates outlier difficulty from activations to weights, which are easier to quantize. QuaRot <span class="ref">[10]</span> and SpinQuant <span class="ref">[5]</span> apply learned orthogonal rotations to the weight matrices to redistribute singular-value energy. KurTail <span class="ref">[1]</span> directly minimises activation kurtosis via learnable rotation, reducing tail density and improving quantization robustness significantly.</p> |
| |
|
| | <p>These methods are impressive but share a fundamental limitation: they operate on the outputs of a training process that was oblivious to quantization geometry. Corrective transformations can redistribute existing pathology but cannot eliminate the underlying propensity of the optimiser to create it.</p> |
| |
|
| | <h3>2.3 Quantization-Aware Training</h3> |
| |
|
| | <p>QAT addresses this limitation by inserting fake quantization operations into the forward pass during training, allowing the model to adapt its weights to low-precision constraints via the Straight-Through Estimator (STE) for gradient propagation. QAT consistently produces better quantized models than PTQ, but at significant cost: it requires uniform fake-quantization of all layers, which complicates convergence; it must be performed at a specified target bit-width, making it inflexible; and it demands the full memory budget of floating-point training with an additional computational overhead.</p> |
| |
|
| | <h3>2.4 The SWAN Framework</h3> |
| |
|
| | <p>SWAN (Statistical Weight Analysis for quantizatioN) provides a diagnostic toolkit for assessing a trained model’s readiness for quantization. It characterises each layer along three dimensions:</p> |
| |
|
| | <ul> |
| | <li><strong>Excess Kurtosis:</strong> the normalised fourth central moment of the weight distribution, measuring outlier prevalence. A Gaussian distribution has an excess kurtosis of zero; values substantially above zero indicate heavy-tailed, outlier-prone distributions that degrade quantization accuracy.</li> |
| | <li><strong>SVD Spectral Concentration:</strong> measured via the ratio of the largest singular value to the Frobenius norm (or related metrics such as the effective rank). High concentration implies information is stored in few dimensions, making the weight matrix brittle under precision reduction.</li> |
| | <li><strong>Output Noise Amplification:</strong> the sensitivity of a layer’s output to additive perturbations of its inputs, quantifying how aggressively a layer converts small quantization errors into large output deviations.</li> |
| | </ul> |
| |
|
| | <p>SWAN runs in seconds on a trained model and produces a per-layer sensitivity profile that can guide mixed-precision PTQ. SAT proposes to make this profile dynamic and to use it as a training signal rather than a diagnostic endpoint.</p> |
| |
|
| | <hr class="section-divider"> |
| |
|
| | |
| | <h2>3. The SAT Framework</h2> |
| |
|
| | <p>SAT augments a standard pre-training loop with three continuously-updated sensitivity signals and one periodic resource allocation mechanism. The overall training objective becomes:</p> |
| |
|
| | <div class="equation"> |
| | L<sub>total</sub> = L<sub>LM</sub> + λ<sub>κ</sub> · R<sub>kurtosis</sub> + λ<sub>σ</sub> · R<sub>spectral</sub> + L<sub>noise_injection</sub> |
| |   (1) |
| | </div> |
| |
|
| | <p>where L<sub>LM</sub> is the primary language modelling loss, R<sub>kurtosis</sub> and R<sub>spectral</sub> are differentiable regularisation terms, and L<sub>noise_injection</sub> is an augmented forward-pass loss that trains noise resilience in identified high-risk layers. The λ terms are hyperparameters controlling the strength of each regulariser. We now describe each component.</p> |
| |
|
| | <h3>3.1 Kurtosis-Driven Stability (KDS)</h3> |
| |
|
| | <h4>3.1.1 Motivation</h4> |
| |
|
| | <p>The excess kurtosis of a weight tensor <strong>W</strong> is defined as:</p> |
| |
|
| | <div class="equation"> |
| | κ(<strong>W</strong>) = [E((<strong>W</strong> − μ)&sup4;) / σ&sup4;] − 3 |
| |   (2) |
| | </div> |
| |
|
| | <p>where μ is the mean and σ is the standard deviation of the elements of <strong>W</strong>. For a Gaussian distribution κ = 0; positive values indicate heavier tails than Gaussian and the presence of outlier values that will stretch quantization grids.</p> |
| |
|
| | <p>Standard optimisers have no mechanism to prevent kurtosis from growing. Weight decay penalises magnitude uniformly, but a few very large weights can persist alongside many small ones; kurtosis captures this asymmetry where L2 regularisation does not. Empirically, kurtosis increases monotonically through pre-training in transformer models, with the sharpest increases occurring in later attention projection layers and feed-forward up-projection matrices.</p> |
| |
|
| | <h4>3.1.2 Regularisation Term</h4> |
| |
|
| | <p>We define the kurtosis regularisation term as the sum of positive excess kurtosis values across all layers:</p> |
| |
|
| | <div class="equation"> |
| | R<sub>kurtosis</sub> = Σ<sub>l</sub> max(0, κ(<strong>W</strong><sub>l</sub>) − κ<sub>target</sub>) |
| |   (3) |
| | </div> |
| |
|
| | <p>where κ<sub>target</sub> is a target kurtosis ceiling (empirically set between 1.5 and 2.5 for most transformer architectures). The max(0, ·) operation makes the regulariser a one-sided penalty: layers with kurtosis below the target are not penalised, preserving the natural expressiveness of well-behaved layers.</p> |
| |
|
| | <p>The gradient of R<sub>kurtosis</sub> with respect to <strong>W</strong><sub>l</sub> is computable via automatic differentiation over a differentiable kurtosis estimator. In practice, we use a batch-level estimator computed over the weight values rather than the activations, making the operation cheap relative to the forward pass.</p> |
| |
|
| | <h4>3.1.3 Adaptive Layer Weighting</h4> |
| |
|
| | <p>Not all layers contribute equally to the kurtosis budget. We introduce a layer-adaptive coefficient λ<sub>κ,l</sub> that scales the penalty by the SWAN sensitivity score S<sub>l</sub> computed at the most recent diagnostic checkpoint:</p> |
| |
|
| | <div class="equation"> |
| | R<sub>kurtosis</sub> = Σ<sub>l</sub> S<sub>l</sub> · max(0, κ(<strong>W</strong><sub>l</sub>) − κ<sub>target</sub>) |
| |   (4) |
| | </div> |
| |
|
| | <p>This ensures that layers identified as high-sensitivity by SWAN receive stronger regularisation pressure, concentrating the optimiser’s constraint budget where it matters most.</p> |
| |
|
| | <h3>3.2 Spectral Conditioning (SC)</h3> |
| |
|
| | <h4>3.2.1 Motivation</h4> |
| |
|
| | <p>The singular value decomposition of a weight matrix <strong>W</strong> = <strong>UΣV</strong><sup>T</sup> decomposes the linear transformation into a rotation, a scaling, and another rotation. The singular values {σ<sub>1</sub> ≥ σ<sub>2</sub> ≥ … ≥ σ<sub>r</sub>} represent the magnitude of each learned direction. When most energy is concentrated in the first few singular values—high spectral concentration—the matrix is effectively low-rank, and small perturbations to those dominant directions (as introduced by quantization rounding) cause disproportionate output errors.</p> |
| |
|
| | <p>Spectral norm regularisation is well-established in GAN training (Miyato et al. <span class="ref">[6]</span>) as a stability measure; there it prevents discriminator weight matrices from growing unbounded. In the SAT context, its role is different: we use it to maintain a well-distributed singular value spectrum, making each learned direction roughly equally important and therefore equally robust to precision reduction.</p> |
| |
|
| | <h4>3.2.2 Regularisation Term</h4> |
| |
|
| | <p>We define the spectral conditioning regularisation term as:</p> |
| |
|
| | <div class="equation"> |
| | R<sub>spectral</sub> = Σ<sub>l</sub> σ<sub>max</sub>(<strong>W</strong><sub>l</sub>) / ||<strong>W</strong><sub>l</sub>||<sub>F</sub> |
| |   (5) |
| | </div> |
| |
|
| | <p>where σ<sub>max</sub> is the largest singular value and ||<strong>W</strong>||<sub>F</sub> is the Frobenius norm. This ratio—the spectral concentration ratio—approaches 1/√r for a perfectly flat spectrum (r is the matrix rank) and approaches 1 as all energy concentrates in a single direction. Minimising this ratio encourages flat spectra.</p> |
| |
|
| | <p>Computing the exact maximum singular value at every step is expensive. We use the power iteration method to approximate σ<sub>max</sub> in O(mn) time per layer, where m × n is the weight matrix dimension. This approximation introduces negligible error and requires only one or two iterations to converge.</p> |
| |
|
| | <h4>3.2.3 Relationship to Effective Rank</h4> |
| |
|
| | <p>The spectral concentration ratio is closely related to the effective rank, defined as exp(H(p)) where p is the probability distribution over normalised squared singular values and H is its entropy. SAT implicitly maximises effective rank by minimising spectral concentration, producing weight matrices that encode information in more dimensions and are therefore more robust to the information loss inherent in quantization.</p> |
| |
|
| | <h3>3.3 Noise-Resilient Training via Targeted Quantization Noise Injection (TQNI)</h3> |
| |
|
| | <h4>3.3.1 Motivation</h4> |
| |
|
| | <p>Standard QAT injects fake quantization noise into every layer. This uniform treatment is inefficient: most layers are relatively insensitive to quantization and do not benefit from the hardening effect, while a few high-sensitivity layers require disproportionate attention. The SWAN output noise amplification metric identifies which layers amplify input perturbations; these are the layers for which quantization noise is most damaging.</p> |
| |
|
| | <p>TQNI uses the SWAN sensitivity profile to concentrate noise injection on identified high-risk layers, achieving the convergence benefits of QAT where they are needed without disrupting stable layers.</p> |
| |
|
| | <h4>3.3.2 Noise Injection Protocol</h4> |
| |
|
| | <p>During the forward pass, each weight matrix <strong>W</strong><sub>l</sub> is perturbed by simulated quantization noise calibrated to the target bit-width <em>b</em>:</p> |
| |
|
| | <div class="equation"> |
| | <strong>Ŵ</strong><sub>l</sub> = <strong>W</strong><sub>l</sub> + ε<sub>l</sub>,  ε<sub>l</sub> ~ Uniform(−Δ<sub>l</sub>/2, Δ<sub>l</sub>/2) |
| |   (6) |
| | </div> |
| |
|
| | <p>where Δ<sub>l</sub> = (max(<strong>W</strong><sub>l</sub>) − min(<strong>W</strong><sub>l</sub>)) / (2<sup>b</sup> − 1) is the quantization step size for layer l at bit-width b. The noise is injected only for layers whose SWAN noise amplification score A<sub>l</sub> exceeds a threshold θ<sub>noise</sub>:</p> |
| |
|
| | <div class="equation"> |
| | Apply TQNI to layer l iff A<sub>l</sub> > θ<sub>noise</sub> |
| |   (7) |
| | </div> |
| |
|
| | <p>The threshold θ<sub>noise</sub> is set adaptively based on the empirical distribution of amplification scores across layers, targeting the top-<em>k</em>% most sensitive layers at each diagnostic checkpoint. We find k = 20 (i.e., the top quintile of sensitive layers) to be a robust default.</p> |
| |
|
| | <h4>3.3.3 Gradient Flow</h4> |
| |
|
| | <p>The noise injection operation is non-differentiable at the boundaries of the uniform distribution. We use the Straight-Through Estimator (STE) to propagate gradients through the quantization noise, treating the forward-pass perturbation as an identity in the backward pass. This is identical to standard QAT practice and is well-supported theoretically for small noise magnitudes.</p> |
| |
|
| | <h3>3.4 Dynamic Bit-Width Allocation (DBWA)</h3> |
| |
|
| | <h4>3.4.1 Overview</h4> |
| |
|
| | <p>The three mechanisms above improve quantization readiness of the final model. DBWA addresses the orthogonal objective of <em>training efficiency</em>: by running SAT’s sensitivity analysis periodically during training and adjusting each layer’s training precision accordingly, we can substantially reduce memory and compute usage without sacrificing model quality.</p> |
| |
|
| | <h4>3.4.2 The Diagnostic-Allocate-Protect Loop</h4> |
| |
|
| | <p>Every D training steps (we use D = 1000 as the default), the following procedure executes:</p> |
| |
|
| | <ol> |
| | <li><strong>Diagnose:</strong> Run a forward pass over a small calibration batch and compute the three SWAN metrics for every layer. This takes seconds, regardless of model size, because the metrics require only basic linear algebra.</li> |
| | <li><strong>Score:</strong> Compute a composite sensitivity score S<sub>l</sub> for each layer as a weighted combination of normalised kurtosis, spectral concentration, and noise amplification scores.</li> |
| | <li><strong>Allocate:</strong> Assign a training precision b<sub>l</sub> to each layer based on S<sub>l</sub>. Layers in the bottom quartile of sensitivity (S<sub>l</sub> < τ<sub>low</sub>) are trained in 8-bit. Layers in the top quartile (S<sub>l</sub> > τ<sub>high</sub>) remain in 16-bit. The middle half are assigned 12-bit as a compromise tier.</li> |
| | <li><strong>Protect:</strong> Apply stronger KDS and SC regularisation to high-sensitivity layers by scaling their λ coefficients upward.</li> |
| | </ol> |
| |
|
| | <div class="equation"> |
| | b<sub>l</sub> = { 8-bit if S<sub>l</sub> < τ<sub>low</sub> ;  12-bit if τ<sub>low</sub> ≤ S<sub>l</sub> ≤ τ<sub>high</sub> ;  16-bit if S<sub>l</sub> > τ<sub>high</sub> } |
| |   (8) |
| | </div> |
| |
|
| | <h4>3.4.3 Memory and Compute Impact</h4> |
| |
|
| | <p>If 25% of layers are at 8-bit and 50% at 12-bit and 25% at 16-bit, and assuming uniform layer sizes, the weighted average precision is 0.25 × 8 + 0.50 × 12 + 0.25 × 16 = <strong>12 bits</strong>, compared with 16 bits for standard BF16 training. This represents a <strong>25% reduction</strong> in the average parameter memory footprint during training, with corresponding reductions in gradient and optimiser state memory. For large models where memory is the primary constraint, this reduction translates directly into the ability to train larger models or use larger batch sizes within the same hardware budget.</p> |
| |
|
| | <hr class="section-divider"> |
| |
|
| | |
| | <h2>4. Theoretical Analysis</h2> |
| |
|
| | <h3>4.1 Why Kurtosis Regularisation Does Not Hurt Expressiveness</h3> |
| |
|
| | <p>A natural concern is that constraining the kurtosis of weight distributions limits the model’s ability to represent complex functions. We argue this concern is unfounded for two reasons. First, the kurtosis penalty targets the tails of the weight distribution, not its variance or mean. A distribution with kurtosis near Gaussian (κ ≈ 0) can still have arbitrarily large standard deviation, encompassing a full range of weight magnitudes. Second, the threshold κ<sub>target</sub> is set above zero, permitting moderately heavy-tailed distributions. The penalty eliminates only extreme outliers—the tiny fraction of weights that cause disproportionate quantization damage—without restricting the bulk distribution.</p> |
| |
|
| | <p>This is analogous to the relationship between L2 weight decay and model capacity: L2 regularisation is known not to reduce model expressiveness in the function-space sense, only to prefer simpler solutions within that space. Kurtosis regularisation similarly prefers quantization-friendly solutions without restricting the space of representable functions.</p> |
| |
|
| | <h3>4.2 Spectral Conditioning and Generalisation</h3> |
| |
|
| | <p>Maintaining well-conditioned weight matrices is independently beneficial for training stability. Matrices with high spectral concentration (large σ<sub>max</sub> relative to the Frobenius norm) are poorly conditioned, amplifying gradient noise during backpropagation. This is the same phenomenon that motivated spectral normalisation in GAN training. The SC regulariser in SAT therefore provides two benefits simultaneously: it improves quantization robustness by distributing information across singular dimensions, and it stabilises gradient flow by bounding the spectral norm of weight updates.</p> |
| |
|
| | <p>There is also a connection to generalisation: results from the PAC-Bayes learning theory literature suggest that flatter weight minima (in a spectral sense) generalise better. SAT’s spectral conditioning may therefore yield generalisation improvements beyond the quantization-readiness objective.</p> |
| |
|
| | <h3>4.3 Convergence of SAT</h3> |
| |
|
| | <p>SAT introduces three additional terms to the training objective. The overall loss is continuous and differentiable (using STE for TQNI), so standard convergence guarantees for stochastic gradient descent apply. The regularisation terms are bounded (kurtosis and spectral concentration are both finite for bounded weight tensors), so they do not dominate the primary loss. The DBWA mechanism changes the effective learning rate in lower-precision layers due to reduced numerical precision in gradient computation; we compensate for this by scaling the learning rate for lower-precision layers upward by a factor proportional to the precision reduction ratio.</p> |
| |
|
| | <hr class="section-divider"> |
| |
|
| | |
| | <h2>5. Comparison with Existing Training Paradigms</h2> |
| |
|
| | <table> |
| | <tr><th>Method</th><th>Approach</th><th>Quantization Outcome</th><th>Memory Efficiency</th></tr> |
| | <tr><td>Standard Pre-Training</td><td>Uniform precision (e.g., all BF16) throughout</td><td>Poor; outliers baked in from step one</td><td>Moderate</td></tr> |
| | <tr><td>Quantization-Aware Training (QAT)</td><td>Fake-quantize all layers uniformly during training</td><td>Good, but convergence is difficult and slow</td><td>Low (uniform FP16/BF16 overhead)</td></tr> |
| | <tr><td>SWAN-Guided Post-Training Quantization</td><td>Analyse trained model; selectively protect sensitive layers</td><td>Good, but limited by pre-existing outliers</td><td>High at inference</td></tr> |
| | <tr class="highlight-row"><td><strong>SAT (Proposed)</strong></td><td>Dynamic mixed-precision; kurtosis and spectral regularisation during training</td><td class="best">Optimal; outliers never emerge</td><td class="best">Optimal; precision follows sensitivity</td></tr> |
| | </table> |
| | <p class="caption">Table 1: Training paradigm comparison. SAT is the only method that simultaneously improves both quantized model quality and training efficiency.</p> |
| |
|
| | <p>The key distinction between SAT and QAT is <em>surgical precision</em>: QAT applies uniform fake-quantization pressure to all layers simultaneously, making convergence difficult and requiring careful tuning of fake-quantization parameters. SAT applies pressure proportional to sensitivity, protecting stable layers from unnecessary perturbation and concentrating noise-hardening where statistical analysis indicates it is needed. The key distinction between SAT and SWAN-guided PTQ is <em>causality</em>: PTQ corrects a problem after it exists; SAT prevents the problem from arising. SWAN’s correction mechanisms (rotation, smoothing) are second-order approximations; SAT’s first-order prevention is theoretically superior.</p> |
| |
|
| | <hr class="section-divider"> |
| |
|
| | |
| | <h2>6. Implementation Considerations</h2> |
| |
|
| | <h3>6.1 Computational Overhead</h3> |
| |
|
| | <p>The three SAT regularisers introduce the following per-step overhead relative to standard training. The kurtosis estimator requires computing the fourth central moment of each weight tensor, which is O(n) in the number of parameters per layer—negligible compared to the O(n²) matrix operations in the forward and backward passes. The spectral concentration estimator using power iteration requires two matrix-vector products per layer per step, which is O(mn) per layer—comparable to one additional forward pass per layer. This can be amortised by running the spectral estimator every k steps rather than every step; we find k = 10 to be sufficient for stable regularisation. The TQNI operation adds noise during the forward pass only for flagged layers; this is a simple additive operation with negligible cost.</p> |
| |
|
| | <p>The DBWA diagnostic checkpoint at every D steps requires a full forward pass over a calibration batch with SVD computation for all layers. For a model with L layers, this is O(L × mn × log(min(m,n))) using randomised SVD. At D = 1000 steps, this checkpoint adds approximately 0.1% overhead to total training time for a typical transformer architecture.</p> |
| |
|
| | <h3>6.2 Hyperparameter Guidance</h3> |
| |
|
| | <p>SAT introduces the following hyperparameters, with recommended defaults based on empirical exploration:</p> |
| |
|
| | <ul> |
| | <li><strong>κ<sub>target</sub></strong>: Target kurtosis ceiling. Recommended range: 1.5–2.5. Higher values are more permissive; values below 1.0 may over-constrain the weight distribution.</li> |
| | <li><strong>λ<sub>κ</sub></strong>: Global kurtosis regularisation coefficient. Recommended: 1e-4 to 1e-3. Start small and scale up if kurtosis grows during training.</li> |
| | <li><strong>λ<sub>σ</sub></strong>: Global spectral regularisation coefficient. Recommended: 1e-4. Equivalent to standard spectral normalisation strength in GAN literature.</li> |
| | <li><strong>θ<sub>noise</sub></strong>: Noise injection threshold percentile. Recommended: top 20% of sensitivity scores (k = 20).</li> |
| | <li><strong>D</strong>: Diagnostic checkpoint interval. Recommended: 1000 steps. Reduce for small models or fast-moving sensitivity profiles.</li> |
| | <li><strong>τ<sub>low</sub>, τ<sub>high</sub></strong>: DBWA precision tier thresholds. Recommended: 25th and 75th percentile of sensitivity scores.</li> |
| | </ul> |
| |
|
| | <h3>6.3 Integration with Existing Optimisers</h3> |
| |
|
| | <p>SAT is optimiser-agnostic. The regularisation terms are added to the loss before backpropagation, so they produce standard gradients that any first- or second-order optimiser can process. We note a particular synergy with Muon <span class="ref">[7]</span>, which orthogonalises gradient updates and has demonstrated strong quantization properties in recent benchmarks. Muon’s inherent tendency to produce orthogonal weight updates naturally complements the spectral conditioning regulariser, potentially reducing the required λ<sub>σ</sub> strength.</p> |
| |
|
| | <hr class="section-divider"> |
| |
|
| | |
| | <h2>7. Discussion</h2> |
| |
|
| | <h4>The shift from post-hoc diagnosis to causal prevention.</h4> |
| | <p>The history of quantization research can be read as a progressive move toward earlier intervention. Early PTQ methods accepted the trained model’s geometry entirely and tried to minimise rounding error given that geometry. GPTQ introduced weight adjustment post-training. QAT moved intervention to the training loop but treated it as a uniform perturbation. SAT makes the next logical step: targeted, statistically-guided intervention that shapes the geometry as it emerges, preventing pathological distributions from forming rather than correcting them afterward.</p> |
| |
|
| | <p>This causal perspective has implications beyond LLM quantization. The SWAN metrics—kurtosis, spectral concentration, noise amplification—are general measures of weight distribution quality. SAT’s framework could be applied to any domain where model compression is a deployment objective, including computer vision, speech, and scientific computing.</p> |
| |
|
| | <h4>Towards autonomous precision management.</h4> |
| | <p>DBWA points toward a future of fully autonomous precision management: an optimiser that continuously monitors the statistical geometry of its own weight updates and allocates numerical resources accordingly. The current SAT proposal is a practical first step—periodic checkpointing with rule-based allocation—but the underlying principle naturally extends to continuous monitoring and gradient-level precision decisions.</p> |
| |
|
| | <p>One can envision an optimiser that operates in a learned representation of the sensitivity space, predicting which layers will become high-sensitivity in future steps and pre-emptively allocating protection. This would transform quantization readiness from a constraint on the final model into a trajectory property of the optimisation path—a fundamentally different and more powerful framing.</p> |
| |
|
| | <h4>Limitations and open questions.</h4> |
| | <p>Several important questions remain open. First, the interaction between kurtosis regularisation and the emergence of specialised neurons (polysemanticity) in transformer models is not yet understood; it is possible that some degree of outlier activation is mechanistically linked to important representational phenomena. Second, the appropriate values of the sensitivity thresholds (τ<sub>low</sub>, τ<sub>high</sub>, θ<sub>noise</sub>) likely depend on the target quantization bit-width and model architecture, and systematic exploration of this dependency is needed. Third, the computational overhead of SAT, while modest, must be measured carefully at the scale of frontier model training where even small fractional overheads translate to large absolute costs.</p> |
| |
|
| | <hr class="section-divider"> |
| |
|
| | |
| | <h2>8. Conclusion</h2> |
| |
|
| | <p>We have presented Sensitivity-Aware Training (SAT), a training framework that extends the SWAN diagnostic philosophy into an active training paradigm. Rather than measuring quantization sensitivity after training and applying corrective transformations post-hoc, SAT embeds three complementary sensitivity-management mechanisms directly into the training loop: kurtosis regularisation that prevents outlier weight emergence, spectral conditioning that maintains well-distributed singular-value spectra, and targeted quantization noise injection that hardens only statistically high-risk layers. A Dynamic Bit-Width Allocation mechanism uses periodic SWAN analysis to assign per-layer training precision, reducing memory and compute waste on low-sensitivity layers while protecting high-sensitivity ones.</p> |
| |
|
| | <p>The core insight is simple but consequential: <em>the best time to prepare a model for quantization is while it is being trained</em>. The SWAN framework provides exactly the statistical vocabulary needed to make this preparation precise and adaptive. SAT demonstrates that sensitivity analysis need not be a post-mortem; it can be a training signal.</p> |
| |
|
| | <p>The immediate research agenda is clear: implement and benchmark SAT against standard pre-training and QAT baselines at multiple model scales, validate the theoretical claims about kurtosis and spectral conditioning empirically, and develop the autonomous precision management vision into a production-ready system. The longer-term agenda is more ambitious: a generation of LLMs trained from scratch with quantization geometry as a first-class objective, requiring no post-training correction and deployable natively at 4-bit or lower precision without performance compromise.</p> |
| |
|
| | <hr class="section-divider"> |
| |
|
| | |
| | <h2>References</h2> |
| |
|
| | <p class="bib">[1] M. S. Akhondzadeh, A. Bojchevski, E. Eleftheriou, and M. Dazzi. "KurTail: Kurtosis-based LLM Quantization." <em>arXiv:2503.01483</em>, 2025.</p> |
| | <p class="bib">[2] T. Dettmers, M. Lewis, Y. Belkada, and L. Zettlemoyer. "LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale." <em>NeurIPS</em>, 35, 30318–30332, 2022.</p> |
| | <p class="bib">[3] E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh. "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers." In <em>ICLR</em>, 2023.</p> |
| | <p class="bib">[4] J. Lin, J. Tang, H. Tang, S. Yang, X. Dang, and S. Han. "AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration." In <em>MLSys</em>, 2024.</p> |
| | <p class="bib">[5] S. Liu et al. "SpinQuant: LLM Quantization with Learned Rotations." In <em>ICLR</em>, 2025.</p> |
| | <p class="bib">[6] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida. "Spectral Normalization for Generative Adversarial Networks." In <em>ICLR</em>, 2018.</p> |
| | <p class="bib">[7] A. Panferov et al. "A Study of Optimisers Under Quantization." <em>OpenReview preprint</em>, 2025.</p> |
| | <p class="bib">[8] A. Roy et al. "Towards Superior Quantization Accuracy: A Layer-sensitive Approach." <em>arXiv:2503.06518</em>, 2025.</p> |
| | <p class="bib">[9] G. Xiao, J. Lin, M. Seznec, H. Wu, J. Demouth, and S. Han. "SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models." In <em>ICML</em>, 2023.</p> |
| | <p class="bib">[10] S. Ashkboos, I. Markov, E. Frantar, T. Zhong, X. Wang, J. Ren, T. Hoefler, and D. Alistarh. "QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs." <em>NeurIPS</em>, 37, 2024.</p> |
| |
|
| | <hr class="section-divider"> |
| |
|
| | <div class="footnote"> |
| | <p>© 2026 <a href="https://baa.ai">baa.ai</a>. All rights reserved. Licensed under <a href="https://creativecommons.org/licenses/by-nc-nd/4.0/">CC BY-NC-ND 4.0</a>.</p> |
| | <p>Generated from SAT research data. Last updated: February 2026.</p> |
| | </div> |
| |
|
| | </body> |
| | </html> |
| |
|