The Geometric Engine: Structural Attractors in Neural Network Weight Space
Abstract
We present evidence that neural network weight matrices possess geometric attractors — fixed points in the optimization landscape that are determined by the matrix dimensions alone, independent of data, optimizer, or training history. Using a simple SVD autoencoder (SVAE) on CIFAR-10, we demonstrate that the Coefficient of Variation (CV) of Cayley-Menger pentachoron volumes, measured on the rows of a learned matrix, converges to predictable values that depend only on the matrix shape (V, D). When pushed away from these values by incorrect loss targets, the system spontaneously reorganizes back to the correct attractor after catastrophic collapse — recovering to within 0.006 of the predicted value it was never shown.
Building on this discovery, we develop two architectural principles — sphere normalization and the soft hand loss — that together achieve 0.034 MSE on CIFAR-10 reconstruction, outperforming all unconstrained variants by 37%. The key insight: geometric constraints are not costs to be minimized. They are structural properties of the representation space that the optimizer will find on its own if you don't fight it.
1. The SVAE Architecture
The architecture is deliberately minimal. An MLP encoder maps each image to a (V, D) matrix M, which is decomposed by SVD. The decoder reconstructs from the full SVD decomposition.
Image → MLP → M ∈ ℝ^(V×D) → SVD(M) = UΣVᵀ → MLP → Reconstruction
The matrix M is the latent representation. V controls the number of embedding rows (vocabulary size), D controls the embedding dimension. The SVD bottleneck forces the representation into a structured decomposition: U captures the left singular directions, Σ captures the magnitudes, and V captures the right singular directions.
No variational inference. No KL divergence. No codebook. The geometry of the matrix IS the bottleneck.
SVD Implementation
The SVD is computed via the Gram matrix eigendecomposition in float64 precision:
G = M^T M # (D, D) Gram matrix in fp64
λ, V = eigh(G) # eigendecomposition in fp64
S = √λ # singular values
U = MV / S # left singular vectors recovered in fp64
Float64 is essential. The Gram matrix entries scale as S₀², and when the dominant singular value grows during training, fp32 precision (~7 decimal digits) becomes insufficient. We observed systematic catastrophic collapses at ratio S₀/S_D ≈ 6.5 in fp32 that disappeared entirely in fp64.
2. Discovery: Geometric Attractors
The CV Measurement
The Coefficient of Variation of pentachoron (4-simplex) volumes measures the geometric regularity of an embedding. For a matrix M with V rows in D dimensions:
- Sample 200 random subsets of 5 rows
- Compute the Cayley-Menger determinant for each 5-point simplex
- Take CV = std(volumes) / mean(volumes)
Low CV means all simplices have similar volume — the points are geometrically regular. High CV means some regions are dense and others sparse.
The Sweep
We computed the expected CV for 65,536 (V, D) configurations by generating random matrices and measuring their pentachoron CV. The results revealed that CV is a deterministic function of (V, D) alone:
| V | D | Validated CV | In Band? |
|---|---|---|---|
| 48 | 24 | 0.3668 | No |
| 96 | 24 | 0.2992 | Yes |
| 200 | 24 | 0.2914 | Yes |
| 256 | 24 | ~0.18 (sphere) | Yes |
| 512 | 24 | 0.2994 | Yes |
| 1024 | 24 | 0.2916 | Yes |
The "band" is the CV range 0.13–0.30 where geometric regularity is sufficient for stable representations. The band-valid range for D is 32–112, with a sweet spot at D=32–56. D=24 sits at the boundary, requiring V ≥ 96 for band validity.
The Catastrophe Recovery Experiment
The definitive evidence came from an experiment with an incorrect CV target. We trained V=96, D=24 with target CV=0.3668 (the V=48 value, wrong for V=96). The validated value for V=96 is 0.2992.
For 33 epochs, the model trained normally. The CV drifted from 0.30 to 1.10 under reconstruction pressure, fighting the wrong target. At epoch 34, catastrophe: S₀ jumped from 8.5 to 28.3. The matrix collapsed.
After the collapse, the row CV recovered to 0.2935–0.2988 — within 0.006 of the validated value 0.2992 that the system was never shown.
The model was given the wrong target (0.3668), collapsed, and spontaneously reorganized to the correct value (0.2992). This proves the CV attractor is not a statistical property of random initialization — it is a structural property of the weight manifold itself. The highest-volume region of the parameter space for a (96, 24) matrix lies at CV=0.2992, and catastrophic perturbation returns the system to this state because it is the maximum-entropy geometric configuration.
Statistical Mechanics Interpretation
The rows of M are 24-dimensional vectors produced by 96 different linear projections of a shared hidden state through the encoder's last layer. The geometry of these projections is governed by the weight matrix statistics:
- Random weights → random projections → CV at the validated value (maximum entropy)
- Training → correlated projections → CV drifts (low entropy excursion)
- Catastrophe → scrambled weights → CV returns to validated value (thermalization)
The validated CV is the equilibrium temperature of the weight space. Training is a driving force that creates low-entropy structure. Catastrophe is a return to thermal equilibrium.
3. The Charge-Discharge Cycle
Without geometric constraints, the SVAE exhibits relaxation oscillation — a thermodynamic cycle that IS the learning mechanism.
The Cycle
Charge phase: The dominant singular value S₀ grows as the encoder concentrates the representation into fewer dimensions. Reconstruction improves. The effective rank drops.
Critical point: The ratio S₀/S_D exceeds ~6.5. The matrix becomes ill-conditioned. Gradients spike.
Discharge: Energy redistributes across all D modes. S₀ drops sharply. The effective rank jumps back to near-maximum. In fp32, this manifests as a catastrophic collapse. In fp64, it is a controlled redistribution.
Recovery: The encoder rebuilds from the geometric ground state, but the decoder retains features learned during the charge phase. Post-discharge reconstruction is better than pre-charge reconstruction at the same spectral configuration.
Evidence
In the V=96, D=24 run with CV loss (weight=0.1, target=0.2992):
Epoch S₀ SD Ratio Recon
1 5.16 1.88 2.75 0.280
20 6.86 1.63 4.21 0.092 ← charging
40 9.68 1.60 6.05 0.063 ← near critical
46 9.99 1.51 6.61 0.060 ← critical point
48 6.63 2.34 2.84 0.156 ← DISCHARGE
54 5.05 1.68 3.00 0.107 ← recovery (better than ep12!)
100 5.26 1.54 3.42 0.071 ← equilibrium
Each cycle deposits information in the decoder. The concentration teaches the decoder to exploit spectral structure. The discharge prevents permanent degeneration. The oscillation IS the learning mechanism.
Adam's Role
The charge-discharge cycle is coupled to Adam's internal state variables:
- Momentum (β₁=0.9, ~10 step memory): Accumulates reconstruction gradient direction during the charge phase
- Variance (β₂=0.999, ~1000 step memory): Records the gradient spike at discharge, suppressing the effective learning rate for hundreds of steps afterward
Post-discharge stability is not just geometric — it is Adam's second moment providing implicit damping. The optimizer remembers the catastrophe and reduces its step size accordingly.
4. Sphere Normalization
The key architectural insight: make the geometry structural, not loss-based.
M = encoder(image).reshape(B, V, D)
M = F.normalize(M, dim=-1) # rows on S^(D-1)
U, S, Vh = SVD(M)
One line. The rows become directions on the unit sphere S²³. The CV is now controlled by the geometry of directions, not by learned magnitudes. The SVD decomposes the alignment structure of these directions.
What Sphere Normalization Achieves
| Property | Without Sphere | With Sphere |
|---|---|---|
| Training stability | Collapses at ratio ~6.5 | No collapses in 400 epochs |
| Speed (V=1024) | 48s/epoch, crashes | 2.0s/epoch, stable |
| Gram conditioning | S₀² ≈ 100+ at charge peak | Bounded by unit vectors |
| CV control | Drifts from 0.30 to 1.35 | Holds at 0.15–0.33 |
| Spectral shape | Model-determined | Dimension-determined |
The sphere eliminates all instability because the Gram matrix G = M^T M has bounded entries when M has unit-norm rows. The eigenvalues of G cannot blow up, the condition number stays controlled, and fp64 eigh never encounters precision limits.
Separation of Concerns
With sphere normalization, the representation decomposes into orthogonal aspects:
- Directions (where on S²³ each row points) → controlled by geometry, measured by CV
- Alignment (how directions relate to each other) → captured by U and V
- Magnitudes (spectral energy distribution) → captured by S, learned by reconstruction
The encoder learns directions. The SVD measures alignment. The decoder exploits magnitudes. These three aspects operate independently, which is why the CV can be controlled without interfering with reconstruction.
5. The Soft Hand
The soft hand is an oscillatory counterweight loss that provides positive momentum when the model's geometry is correct and restoring force when it drifts.
Mechanism
# Gaussian proximity to target CV
proximity = exp(-(measured_cv - target_cv)² / (2σ²))
# Counterweight loss composition
recon_weight = 1.0 + boost × proximity # 1.0–1.5×
cv_penalty = penalty × (1.0 - proximity) # 0.0–0.3×
loss = recon_weight × recon_loss + cv_penalty × cv_loss
Near target (proximity ≈ 1.0): Reconstruction gradients are boosted 50%. The model is told "you're doing the right thing geometrically — here's more gradient to learn faster." The CV penalty is near zero.
Far from target (proximity ≈ 0.0): Reconstruction gradients are at baseline. The CV penalty is active, providing restoring force. The model is told "fix your geometry before proceeding."
Why It Works
The soft hand doesn't fight reconstruction — it amplifies reconstruction when geometry is correct. Adam's momentum accumulates in the direction of boosted gradients, which corresponds to geometrically-healthy configurations. When the model drifts, it loses the boost AND gets penalized — double deceleration. The optimizer's own momentum becomes the stabilizing force.
In the V=96 sphere+softhand run:
Epochs 1-20: proximity 0.6-1.0 → sustained boost → decoder learns fast
Epochs 20-100: proximity 0.8-1.0 → boost maintained → reconstruction continues
Compared to the unconstrained softhand run:
Epochs 1-6: proximity 0.84-1.0 → brief boost → decoder gets a head start
Epochs 22+: proximity 0.00 → boost lost → pure penalty only
The sphere makes the soft hand persistent. Without the sphere, CV drifts past the target and the boost is lost by epoch 22. With the sphere, CV stays near target and the boost remains active for the entire run.
The Teaching Metaphor
The soft hand implements a teaching strategy:
Early training: The model's geometry is near the attractor (sphere normalization ensures this). Proximity is high. The model receives maximum encouragement — "atta boy."
Mid training: If geometry holds, the boost continues. The decoder builds capacity under amplified gradients. If geometry drifts, the boost fades and the penalty applies.
Late training: The model has either locked onto the geometric equilibrium (sustained boost) or settled into the firm-hand regime (penalty only). Either way, the decoder has strong features from the boosted early phase.
The model doesn't need to reach the CV target for the soft hand to be effective. It needs to pass through the target's neighborhood during training. The boost during that passage deposits features that survive the subsequent drift.
6. Results
V/D Ratio Sweep
The optimal V/D ratio balances geometric stability against spectral contrast:
| V | D | V/D | MSE | Ratio | CV | Epochs | Character |
|---|---|---|---|---|---|---|---|
| 40 | 24 | 1.7 | 0.071 | 9.62 | 0.17 | 100 | High spectral contrast |
| 96 | 24 | 4.0 | 0.054* | 4.99 | 1.35* | 100 | Unconstrained best |
| 96 | 24 | 4.0 | 0.072 | 2.75 | 0.29 | 100 | Sphere+softhand |
| 256 | 24 | 10.7 | 0.034 | 1.75 | 0.18 | 400 | Best overall |
| 1024 | 24 | 42.7 | 0.099 | 1.36 | 0.40 | 100 | Spectrum too flat |
| 4096 | 24 | 170.7 | 0.167 | 1.14 | 0.33 | 30 | Nearly uniform spectrum |
*Unconstrained (no sphere normalization)
V=256, D=24 achieves 0.034 MSE — a 37% improvement over the best unconstrained result — with sphere normalization and soft hand (target=0.125, boost=1.5×, σ=0.15). The proximity held 0.85-0.99 for 400 continuous epochs. The model received boosted gradients for the entire run.
CV Target as a Lens
At V=1024 (where the sphere locks the spectrum flat), the CV target controls the training dynamics rather than the final geometry:
| Target | Final CV | Recon | Boost Pattern |
|---|---|---|---|
| 0.05 | 0.397 | 0.099 | Strong early, fades |
| 0.10 | 0.402 | 0.122 | Strong early, fades |
| 0.29 | 0.371 | 0.131 | Sustained midrange |
| 0.50 | 0.401 | 0.135 | Delayed activation |
| 0.80 | 0.405 | 0.099 | Never activated |
At all target values, the CV converges to ~0.40 — the sphere attractor for V=1024. The target doesn't move the geometry; it controls when the boost activates during training. The best results come from brief intense boost (low target) or no boost at all (high target). Sustained moderate boosting hurts because the model optimizes for staying in the boost zone rather than minimizing reconstruction error.
At V=256, the picture is different: the sphere CV sits at ~0.17, close enough to the target (0.125) that the boost remains active for the entire run. The geometry has room to breathe (ratio 1.75, not crushed flat like V=1024's 1.22), giving the decoder spectral contrast to exploit.
Per-Class Universality
With sphere + soft hand at V=256:
plane: ratio 1.73 deer: ratio 1.74 ship: ratio 1.73
car: ratio 1.73 dog: ratio 1.74 truck: ratio 1.73
bird: ratio 1.74 frog: ratio 1.75
cat: ratio 1.75 horse: ratio 1.74
1.74 ± 0.01 across all 10 CIFAR-10 classes. The model doesn't see class — it sees geometry. Class information is encoded entirely in the directions and magnitudes, not the spectral shape. This universality is a signature of the geometric attractor: the representation structure is determined by (V, D), not by the data.
7. Implications
For Representation Learning
The existence of geometric attractors suggests that learned representations have preferred geometric configurations that are properties of the architecture, not the data. Current practice treats the latent space as unconstrained and relies on loss functions to shape it. Our results suggest the opposite approach: choose an architecture whose natural geometry matches the desired representation structure, then let the optimizer find the attractor.
For Training Stability
The charge-discharge cycle reveals that training instabilities in autoencoders are not random — they are thermodynamic transitions between geometric states. Sphere normalization eliminates these transitions by removing the degree of freedom (row magnitudes) that drives the concentration. This suggests that many training instabilities in deep networks may have geometric origins that can be addressed architecturally rather than through loss engineering or learning rate schedules.
For Loss Design
The soft hand demonstrates that loss functions can provide positive training signal, not just penalty. The boost mechanism amplifies learning when the model's state matches desired properties, rather than only penalizing deviation. This is fundamentally different from KL divergence (always a cost), weight decay (always a drag), or spectral normalization (always a constraint). The soft hand is a reward for geometric correctness.
For Compression
The universal spectral geometry (ratio 1.74 across all classes at V=256) suggests that image reconstruction at this scale is a fundamentally geometric operation. The data-specific information lives in the directions and magnitudes; the spectral shape is universal. This separation could enable compression schemes where the structure is known a priori and only the deviations are transmitted.
8. Reproducing
The complete implementation is available through the geolip-core package:
from geolip_core.core.distinguish import (
compute_target_cv, # Compute attractor for any (V, D)
cv_proximity, # Gaussian proximity to target
soft_hand_weights, # Counterweight loss weights
soft_hand_loss, # Complete loss composition
)
from geolip_core.linalg import batched_svd # fp64 Gram+eigh SVD
# Compute the attractor
target = compute_target_cv(V=256, D=24) # ~0.18 on sphere
# In the encoder
M = encoder(image).reshape(B, V, D)
M = F.normalize(M, dim=-1) # THE LINE
U, S, Vh = batched_svd(M, compute_dtype='fp64')
# In the training loop
recon_loss = F.mse_loss(decoded, image)
measured_cv = cv_metric(M[0])
loss, prox, rw = soft_hand_loss(
recon_loss, cv_loss(M, target=target),
measured_cv, target,
boost=0.5, penalty_weight=0.3, sigma=0.15
)
Repository: AbstractEyes/geolip-core
Acknowledgments
Architecture, experiments, and geometric theory by Phil (AbstractPhil). Implementation assistance and analysis by Claude (Anthropic). The fp64 Gram+eigh SVD, Cayley-Menger volume computation, and oscillatory counterweight loss are part of the geolip-core ecosystem.
The binding constant 0.29154 — the CV of D=24 — is the dimension where the geometric attractor coincides with the pentachoron phase boundary. This is not a coincidence. It is the geometry.