What Can You See With a Spectral Microscope?
Gradience Series, Post 5
Gradience was built for practical questions: is my adapter over-provisioned? Will this merge work? The telemetry it collects — Hessian eigenvalues, representation geometry, gradient alignment — was designed to answer those questions.
But telemetry has a way of being more interesting than you planned. When you record the top Hessian eigenvalue, its trace, the gradient-Hessian-gradient product, participation ratio, anisotropy, and CKA similarity at regular intervals throughout a training run, you end up with a dataset that invites questions the toolkit wasn't built to ask. Questions from dynamical systems theory, statistical physics, and nonlinear time-series analysis.
This post is about what happens when you point those methods at a single Mistral-7B LoRA fine-tuning run. We're not making claims. We're showing what becomes visible — and what the next experiments would need to be.
The Data
One training run. Mistral-7B, LoRA rank 64, fine-tuned on GSM8K. Two telemetry streams:
Hessian stream (telemetry.jsonl): 601 measurements from step 1 to 60,000. Each record contains the top Hessian eigenvalue (λ₁), its Hutchinson trace estimate (trace_H), and the gradient-Hessian-gradient product (gHg). These are second-order quantities — they describe the curvature of the loss landscape, not just its height.
Representation stream (telemetry10.csv): 1,200 measurements from step 100 to 120,000. Participation ratio (effective dimensionality of activations), anisotropy (directional concentration), and CKA similarity (how much the representation has changed from its initial state).
Two windows on the same training process, measured at different resolutions and over slightly different spans. Between them, enough data for four analyses.
Vignette 1: The Three Acts of Gradient Alignment
The first question is geometric: how aligned is the gradient with the dominant curvature direction?
The Rayleigh quotient Rq = gᵀHg / (gᵀg · λ₁) measures this. When Rq ≈ 1, the gradient points along the top eigenvector of the Hessian — the optimizer is climbing (or descending) the steepest curvature direction. When Rq ≈ 0, the gradient is orthogonal to it.
Over 60,000 steps, the mean alignment is 0.90. The gradient overwhelmingly tracks the top curvature direction. But the temporal structure is what's interesting:
Act I (steps 1–12,000): Mean alignment 0.79, low variance. The optimizer is exploring — alignment is high but not maximal, and it's stable. The landscape is being surveyed.
Act II (steps 12,000–36,000): Mean alignment jumps to 1.06 (exceeding 1.0 because the Rayleigh quotient can overshoot in finite-precision Hutchinson estimation). The optimizer has locked onto the dominant curvature direction. This is the edge-of-stability regime: SGD self-tunes so that λ₁ ≈ 2/η, and the gradient aligns with that direction to ride the stability boundary.
Act III (steps 36,000–60,000): Mean alignment drops back to 0.78, but variance increases (rolling standard deviation trend ρ = 0.92, p ≈ 0). The optimizer is destabilizing — not catastrophically, but the neat alignment of Act II has broken down. The gradient is exploring multiple curvature directions again, but less coherently than in Act I.
This three-act structure — explore, lock on, destabilize — is consistent with the edge-of-stability literature (Cohen et al., 2021), but seeing it play out in the Rayleigh quotient of a LoRA fine-tuning run is a vivid confirmation that these dynamics aren't confined to toy models. The Hessian telemetry makes the acts directly visible.
What you'd need to know if it's real: The same analysis across multiple seeds and regimes. If Act II is shorter in pathological runs (high learning rate, excessive weight decay), that's a diagnostic signal. If the three-act structure is absent in some runs, that's also informative.
Vignette 2: How Representations Compress
Participation ratio (PR) measures the effective dimensionality of a layer's activations. High PR means the representation uses many directions roughly equally; low PR means it's concentrated in a few.
Over 120,000 steps, PR follows a clear trajectory:
Expansion phase (steps 100–24,000): PR climbs from 41 to the mid-60s (mean 66.6). The network is learning to spread information across many representational dimensions. The trend slope is positive.
Compression phase (steps 24,000–72,000): PR reverses, declining with a negative slope four times steeper than the expansion rate. The representation is consolidating — the network has found the directions that matter and is concentrating there.
Plateau (steps 72,000–120,000): PR stabilises around 64, compressing only slowly.
This expand-then-compress pattern is predicted by the information bottleneck theory (Shwartz-Ziv and Tishby, 2017): neural networks first increase mutual information with inputs (expansion), then compress to retain only information relevant to outputs. The PR trajectory is a direct geometric signature of that process.
The coupling is tight: PR and anisotropy are inversely correlated at ρ = −0.74 (p ≈ 10⁻²⁰⁸). As the representation becomes lower-dimensional (PR drops), it becomes more directionally concentrated (anisotropy rises). This isn't surprising — they're measuring related aspects of the same thing — but the strength of the coupling (−0.74, not −0.3 or −0.5) shows that the two metrics are capturing nearly the same underlying process from different angles.
What you'd need to know if it's real: Does the compression onset (step ~24,000) coincide with the training loss elbow? If so, the information bottleneck interpretation holds straightforwardly. If the compression starts before or after the loss elbow, something more interesting is happening.
Vignette 3: Power-Law Convergence in Representation Similarity
CKA (Centered Kernel Alignment) measures how similar the current representation is to the initial one. It starts near 0.43 and converges toward 1.0 — but how it converges turns out to be diagnostic.
Two models: exponential convergence (CKA ~ 1 − e^(−t/τ)) and power-law convergence (CKA ~ t^α). Exponential convergence would mean a fixed timescale — the representation relaxes to its final state at a constant rate. Power-law convergence would mean the timescale itself changes — early convergence is fast, late convergence is slow, with no characteristic time constant.
The data fit: power-law α = −0.40 (R² = 0.29) vs. exponential τ < 0 (R² = 0.09, negative time constant — the exponential model doesn't just fit poorly, it fits nonsensically).
Neither fit is strong in absolute terms. The R² of 0.29 for the power law means it explains less than a third of the variance. But the contrast matters: the exponential model actively fails, while the power law captures the qualitative shape. CKA convergence decelerates — the early rate is roughly 12x the late rate — and that deceleration is better described by a power law than by any single-timescale process.
Power-law relaxation is the hallmark of systems near a critical point, where the absence of a characteristic scale produces scale-free dynamics. In the context of representation learning, it would mean the network doesn't have a single "learning timescale" — instead, different aspects of the representation converge at different rates, producing the aggregate power-law signature.
What you'd need to know if it's real: The exponent α across runs. If it's universal (same α regardless of learning rate, weight decay, task), that would be strong evidence for criticality. If it varies with hyperparameters, it's a tunable property of the optimiser rather than a phase of matter.
Vignette 4: Long-Range Memory in the Loss Landscape
Detrended Fluctuation Analysis (DFA) asks: does a time series have long-range temporal correlations? The DFA exponent α classifies the dynamics. α = 0.5 is white noise (no memory). α = 1.0 is 1/f noise (scale-free correlations). α = 1.5 is Brownian motion (pure drift). Values between 0.5 and 1.0 indicate genuine long-range correlations — the state of the system at one time carries information about its state far in the future, beyond what short-range autocorrelation would produce.
We computed DFA for several Hessian quantities. The results stratify cleanly:
| Metric | DFA exponent | Interpretation |
|---|---|---|
| λ₁ (top eigenvalue) | 0.97 | Near-1/f: scale-free curvature fluctuations |
| gHg (gradient-Hessian product) | 0.92 | Near-1/f |
| Train loss | 1.01 | 1/f noise |
| λ₁ / trace_H (order parameter) | 0.68 | Long-range correlated |
| trace_H (total curvature) | 1.24 | Superdiffusive drift |
The raw Hessian quantities (λ₁, gHg) fluctuate with near-1/f dynamics — their power spectra are approximately 1/f, meaning fluctuations at every timescale carry roughly equal energy. This is the signature of systems poised between order and disorder.
But the order parameter — the ratio λ₁/trace_H, which measures how concentrated the curvature is in the top direction — has a lower exponent (0.68). This is still well above white noise (0.5), indicating genuine long-range memory, but it's subcritical relative to the raw metrics. The normalisation by trace_H removes the drift component and reveals the underlying correlation structure: long-range, but not scale-free.
The separation between trace_H (1.24, superdiffusive) and the order parameter (0.68, subcritical) is itself informative. It means the total curvature has a strong trending component — it drifts — while the shape of the curvature (how concentrated it is) fluctuates with long-range memory but without drift. The geometry of the loss landscape has two kinds of dynamics happening simultaneously: a slow secular change in scale and a correlated but stationary fluctuation in structure.
What you'd need to know if it's real: DFA exponents across regimes. If pathological runs (high learning rate) show α closer to 0.5 (memoryless), that would mean healthy training is specifically characterised by long-range correlations in Hessian structure. If all runs show similar exponents regardless of outcome, the correlations are a property of SGD itself rather than of successful learning.
What These Analyses Have in Common
Each vignette borrows a technique from a different field — Rayleigh quotient analysis from matrix perturbation theory, participation ratio from condensed matter physics, power-law fitting from critical phenomena, DFA from fractal time-series analysis. None of them were designed for neural network training dynamics. All of them found structure.
That's the case for a spectral microscope. The telemetry Gradience collects for practical purposes — auditing rank utilisation, predicting compression, assessing merge compatibility — turns out to contain information about training dynamics that standard monitoring misses entirely. Loss curves can't show you three-act gradient alignment, or expand-then-compress representation dynamics, or power-law convergence, or long-range Hessian correlations. These phenomena live in the geometry, and you need geometric instrumentation to see them.
We're not claiming these are discoveries. They're observations from a single run, at a single scale, on a single task. Each vignette ends with the same caveat: here's what you'd need to do to know if the pattern is real. That's the honest state of things.
But we think the observations are interesting enough to report, and the toolkit is general enough that anyone with a LoRA training run and a few minutes can reproduce them. The analyses above required no new data collection — just different questions asked of the same telemetry.
Try It
The exploratory analysis scripts are in reanalysis/exploratory_dynamics.py and reanalysis/criticality_analysis.py in the Gradience II repository. They read from the same telemetry.jsonl and telemetry10.csv files that Gradience's standard audit pipeline produces.
pip install gradience
gradience audit --peft-dir ./your-adapter --suggest-ranks
# then, for the exploratory analyses:
python reanalysis/exploratory_dynamics.py
python reanalysis/criticality_analysis.py
Code: github.com/johntnanney/gradience Documentation: THEORY.md and METRICS_GUIDE.md in the repo License: Apache 2.0
This is Post 5 in the Gradience series. Previous posts: [Post 1: A Flight Recorder for LoRA Fine-Tuning], [Post 2: Loss Tells You How Well You're Fitting. Geometry Tells You What Kind of Fitting You're Doing], [Post 3: Before You Merge: Subspace Overlap Predicts Adapter Dominance], [Post 4: What We Got Wrong About Geometry vs. Loss].