What We Got Wrong About Geometry vs. Loss
Gradience Series, Post 4
In earlier posts in this series, we reported that geometric features achieve 100% regime classification accuracy, compared to 65% for loss-based features. That number was the centrepiece of our argument that geometry tells you something loss can't.
We went back and audited the claim. The result: the 100% figure doesn't hold up on the data we have. The actual picture is more interesting — and more honest — than the original framing.
What We Did
We wrote a formal reanalysis protocol with five analysis modules and ran it against every dataset in the Gradience repository. The protocol covers permutation testing, feature ablation, bootstrap confidence intervals, information-theoretic analysis, time-series changepoint detection, phase transition signatures, and cross-strand integration between our Hessian telemetry and representation geometry streams.
Everything was scoped to existing data. No new experiments, no new training runs. Just the numbers we already had, examined more carefully.
The full protocol and all scripts are public in the repository under reanalysis/.
What We Found
The headline numbers are wrong
On the five-class regime classification problem (baseline, high learning rate, high weight decay, low learning rate, low weight decay — 15 runs, 3 seeds each), geometry achieves 66.7% accuracy via Leave-One-Seed-Out cross-validation. Loss achieves 40%. Both are significantly above chance (permutation p < 0.001 for each).
The gap between them — geometry's 27-point advantage — is not statistically significant by McNemar's test (p = 0.289). With only 15 samples, we don't have the power to distinguish them.
The bootstrap 95% confidence interval for geometry's accuracy is [20%, 73%]. That interval is essentially uninformative. It includes chance-level performance. This is what n=15 gets you.
The 100% figure probably comes from a different problem
We found one dataset where geometry does achieve perfect classification: the binary problem of distinguishing baseline from low-weight-decay runs. The out-of-fold predictions show 100% accuracy with margins of 0.81 to 0.99. That's a legitimate result, but it's a two-class problem with two regimes that differ only in whether weight decay is on or off. It's a much weaker claim than "geometry achieves 100% on regime classification."
But geometry really does carry more information
This is the part that survives scrutiny. Information-theoretic analysis shows geometric features carry 7.4 times more mutual information about training regimes than loss does. All six geometric features are individually significant (p < 0.03). They're slightly synergistic rather than redundant — the whole is genuinely greater than the sum of the parts.
The conditional entropy analysis tells the same story more conservatively: geometry reduces regime uncertainty by 15%, loss by 10%. The ratio is 1.5x, not 7.4x. The truth is somewhere in between, depending on how you measure.
Parsimony favours loss
Here's the uncomfortable finding. Geometry uses 6 features to classify 5 regimes with 15 samples. That's 35 parameters in a softmax classifier. Loss uses 1 feature. By BIC, loss wins. By AIC, loss wins. The geometric model achieves lower negative log-likelihood (it fits better), but the complexity penalty is devastating at this sample size.
This doesn't mean geometry is useless. It means we can't demonstrate its superiority with the data we have. More seeds would change this equation entirely.
Geometry detects transitions earlier
In the time-series telemetry data, geometric metrics (Hessian trace, top eigenvalue) detect regime transitions about 300 steps before loss metrics do. The Hessian trace is the single earliest-responding quantity we measured. This supports the qualitative claim that geometry "sees" things before loss does — but in a single run, not across regimes.
The two geometries are coupled but not redundant
Our canonical correlation analysis between Hessian-space metrics (λ₁, trace, gHg) and representation-space metrics (participation ratio, anisotropy, CKA) yields a first canonical correlation of 0.661. They share a meaningful common signal — probably something like overall representational complexity — but they're not measuring the same thing. This is good news for the research programme: both measurement systems contribute independent information.
One possible phase transition
We found a candidate phase transition near step 58,450 in the telemetry data, where multiple order-parameter susceptibilities cluster. Trajectory tortuosity spikes in the same region. Whether this is a genuine phase transition or an artifact of a single run is an open question. It's worth investigating with replicated runs.
What This Means for Gradience
The engineering value of the toolkit is unaffected. Spectral audit, compression benchmarking, and merge-compatibility analysis don't depend on the regime classification claim. They work because rank utilization and spectral decay are directly observable quantities with clear practical interpretations.
The research claim — that geometry is more informative than loss for characterizing training regimes — has genuine support, but at a different magnitude than we reported. The revised version:
Geometric features achieve
67% five-class regime classification (p < 0.001), modestly outperforming loss (40%), with the advantage driven by multivariate feature combinations. Geometry carries 7.4x more mutual information about regimes than loss and detects training dynamics transitions ~300 steps earlier. The strength of the classification advantage cannot be determined at the current sample size (n=15).
That's less dramatic than "100% vs 65%." It's also true.
What We're Doing About It
Three things:
First, we're publishing this post and the full reanalysis — protocol, scripts, and raw results — so anyone can check our work.
Second, the missing spectral data matters. The early_spectral_complexity feature — arguably the most important variable for the Gradience thesis — was never computed. The pipeline exists (snapshot_worker_cpu_v2.py) but was never run to completion. Filling this gap is the highest-priority next step.
Third, sample size. Going from 3 seeds to 6 per regime would cut the bootstrap confidence interval roughly in half. Going to 10 would make McNemar's test powerful enough to detect the geometry-vs-loss gap if it's real. This is boring but necessary.
The Epistemics
We think publishing corrections like this is important. Not because getting things wrong is embarrassing — it's normal — but because the alternative is worse. A research programme that doesn't audit its own claims isn't doing research. It's doing marketing.
The Gradience project started from an intuition: that the geometry of training dynamics contains information that scalar loss doesn't capture. The reanalysis confirms the intuition is correct. What it corrects is the strength of the evidence we'd cited for it.
The distinction matters. "Geometry is informative" is a research direction. "Geometry achieves 100%" was a milestone we hadn't actually reached. Now we know where we are, and we know what it would take to get where we thought we were.