The Geometric Engine: Structural Attractors in Neural Network Weight Space

Published April 6, 2026

Phil (AbstractPhil) — AbstractEyes / geolip-core

Abstract

We present evidence that neural network weight matrices possess geometric attractors — fixed points in the optimization landscape that are determined by the matrix dimensions alone, independent of data, optimizer, or training history. Using a simple SVD autoencoder (SVAE) on CIFAR-10, we demonstrate that the Coefficient of Variation (CV) of Cayley-Menger pentachoron volumes, measured on the rows of a learned matrix, converges to predictable values that depend only on the matrix shape (V, D). When pushed away from these values by incorrect loss targets, the system spontaneously reorganizes back to the correct attractor after catastrophic collapse — recovering to within 0.006 of the predicted value it was never shown.

Building on this discovery, we develop two architectural principles — sphere normalization and the soft hand loss — that together achieve 0.034 MSE on CIFAR-10 reconstruction, outperforming all unconstrained variants by 37%. The key insight: geometric constraints are not costs to be minimized. They are structural properties of the representation space that the optimizer will find on its own if you don't fight it.

1. The SVAE Architecture

The architecture is deliberately minimal. An MLP encoder maps each image to a (V, D) matrix M, which is decomposed by SVD. The decoder reconstructs from the full SVD decomposition.

Image → MLP → M ∈ ℝ^(V×D) → SVD(M) = UΣVᵀ → MLP → Reconstruction

The matrix M is the latent representation. V controls the number of embedding rows (vocabulary size), D controls the embedding dimension. The SVD bottleneck forces the representation into a structured decomposition: U captures the left singular directions, Σ captures the magnitudes, and V captures the right singular directions.

No variational inference. No KL divergence. No codebook. The geometry of the matrix IS the bottleneck.

SVD Implementation

The SVD is computed via the Gram matrix eigendecomposition in float64 precision:

G = M^T M          # (D, D) Gram matrix in fp64
λ, V = eigh(G)     # eigendecomposition in fp64  
S = √λ             # singular values
U = MV / S          # left singular vectors recovered in fp64

Float64 is essential. The Gram matrix entries scale as S₀², and when the dominant singular value grows during training, fp32 precision (~7 decimal digits) becomes insufficient. We observed systematic catastrophic collapses at ratio S₀/S_D ≈ 6.5 in fp32 that disappeared entirely in fp64.

2. Discovery: Geometric Attractors

The CV Measurement

The Coefficient of Variation of pentachoron (4-simplex) volumes measures the geometric regularity of an embedding. For a matrix M with V rows in D dimensions:

Sample 200 random subsets of 5 rows
Compute the Cayley-Menger determinant for each 5-point simplex
Take CV = std(volumes) / mean(volumes)

Low CV means all simplices have similar volume — the points are geometrically regular. High CV means some regions are dense and others sparse.

The Sweep

We computed the expected CV for 65,536 (V, D) configurations by generating random matrices and measuring their pentachoron CV. The results revealed that CV is a deterministic function of (V, D) alone:

V	D	Validated CV	In Band?
48	24	0.3668	No
96	24	0.2992	Yes
200	24	0.2914	Yes
256	24	~0.18 (sphere)	Yes
512	24	0.2994	Yes
1024	24	0.2916	Yes

The "band" is the CV range 0.13–0.30 where geometric regularity is sufficient for stable representations. The band-valid range for D is 32–112, with a sweet spot at D=32–56. D=24 sits at the boundary, requiring V ≥ 96 for band validity.

The Catastrophe Recovery Experiment

The definitive evidence came from an experiment with an incorrect CV target. We trained V=96, D=24 with target CV=0.3668 (the V=48 value, wrong for V=96). The validated value for V=96 is 0.2992.

For 33 epochs, the model trained normally. The CV drifted from 0.30 to 1.10 under reconstruction pressure, fighting the wrong target. At epoch 34, catastrophe: S₀ jumped from 8.5 to 28.3. The matrix collapsed.

After the collapse, the row CV recovered to 0.2935–0.2988 — within 0.006 of the validated value 0.2992 that the system was never shown.

The model was given the wrong target (0.3668), collapsed, and spontaneously reorganized to the correct value (0.2992). This proves the CV attractor is not a statistical property of random initialization — it is a structural property of the weight manifold itself. The highest-volume region of the parameter space for a (96, 24) matrix lies at CV=0.2992, and catastrophic perturbation returns the system to this state because it is the maximum-entropy geometric configuration.

Statistical Mechanics Interpretation

The rows of M are 24-dimensional vectors produced by 96 different linear projections of a shared hidden state through the encoder's last layer. The geometry of these projections is governed by the weight matrix statistics:

Random weights → random projections → CV at the validated value (maximum entropy)
Training → correlated projections → CV drifts (low entropy excursion)
Catastrophe → scrambled weights → CV returns to validated value (thermalization)

The validated CV is the equilibrium temperature of the weight space. Training is a driving force that creates low-entropy structure. Catastrophe is a return to thermal equilibrium.

3. The Charge-Discharge Cycle

Without geometric constraints, the SVAE exhibits relaxation oscillation — a thermodynamic cycle that IS the learning mechanism.

The Cycle

Charge phase: The dominant singular value S₀ grows as the encoder concentrates the representation into fewer dimensions. Reconstruction improves. The effective rank drops.
Critical point: The ratio S₀/S_D exceeds ~6.5. The matrix becomes ill-conditioned. Gradients spike.
Discharge: Energy redistributes across all D modes. S₀ drops sharply. The effective rank jumps back to near-maximum. In fp32, this manifests as a catastrophic collapse. In fp64, it is a controlled redistribution.
Recovery: The encoder rebuilds from the geometric ground state, but the decoder retains features learned during the charge phase. Post-discharge reconstruction is better than pre-charge reconstruction at the same spectral configuration.

Evidence

In the V=96, D=24 run with CV loss (weight=0.1, target=0.2992):

Epoch  S₀     SD    Ratio  Recon
1      5.16   1.88  2.75   0.280
20     6.86   1.63  4.21   0.092  ← charging
40     9.68   1.60  6.05   0.063  ← near critical
46     9.99   1.51  6.61   0.060  ← critical point
48     6.63   2.34  2.84   0.156  ← DISCHARGE
54     5.05   1.68  3.00   0.107  ← recovery (better than ep12!)
100    5.26   1.54  3.42   0.071  ← equilibrium

Each cycle deposits information in the decoder. The concentration teaches the decoder to exploit spectral structure. The discharge prevents permanent degeneration. The oscillation IS the learning mechanism.

Adam's Role

The charge-discharge cycle is coupled to Adam's internal state variables:

Momentum (β₁=0.9, ~10 step memory): Accumulates reconstruction gradient direction during the charge phase
Variance (β₂=0.999, ~1000 step memory): Records the gradient spike at discharge, suppressing the effective learning rate for hundreds of steps afterward

Post-discharge stability is not just geometric — it is Adam's second moment providing implicit damping. The optimizer remembers the catastrophe and reduces its step size accordingly.

4. Sphere Normalization

The key architectural insight: make the geometry structural, not loss-based.

M = encoder(image).reshape(B, V, D)
M = F.normalize(M, dim=-1)  # rows on S^(D-1)
U, S, Vh = SVD(M)

One line. The rows become directions on the unit sphere S²³. The CV is now controlled by the geometry of directions, not by learned magnitudes. The SVD decomposes the alignment structure of these directions.

What Sphere Normalization Achieves

Property	Without Sphere	With Sphere
Training stability	Collapses at ratio ~6.5	No collapses in 400 epochs
Speed (V=1024)	48s/epoch, crashes	2.0s/epoch, stable
Gram conditioning	S₀² ≈ 100+ at charge peak	Bounded by unit vectors
CV control	Drifts from 0.30 to 1.35	Holds at 0.15–0.33
Spectral shape	Model-determined	Dimension-determined

The sphere eliminates all instability because the Gram matrix G = M^T M has bounded entries when M has unit-norm rows. The eigenvalues of G cannot blow up, the condition number stays controlled, and fp64 eigh never encounters precision limits.

Separation of Concerns

With sphere normalization, the representation decomposes into orthogonal aspects:

Directions (where on S²³ each row points) → controlled by geometry, measured by CV
Alignment (how directions relate to each other) → captured by U and V
Magnitudes (spectral energy distribution) → captured by S, learned by reconstruction

The encoder learns directions. The SVD measures alignment. The decoder exploits magnitudes. These three aspects operate independently, which is why the CV can be controlled without interfering with reconstruction.

5. The Soft Hand

The soft hand is an oscillatory counterweight loss that provides positive momentum when the model's geometry is correct and restoring force when it drifts.

Mechanism

# Gaussian proximity to target CV
proximity = exp(-(measured_cv - target_cv)² / (2σ²))

# Counterweight loss composition
recon_weight = 1.0 + boost × proximity      # 1.0–1.5×
cv_penalty   = penalty × (1.0 - proximity)  # 0.0–0.3×
loss = recon_weight × recon_loss + cv_penalty × cv_loss

Near target (proximity ≈ 1.0): Reconstruction gradients are boosted 50%. The model is told "you're doing the right thing geometrically — here's more gradient to learn faster." The CV penalty is near zero.

Far from target (proximity ≈ 0.0): Reconstruction gradients are at baseline. The CV penalty is active, providing restoring force. The model is told "fix your geometry before proceeding."

Why It Works

The soft hand doesn't fight reconstruction — it amplifies reconstruction when geometry is correct. Adam's momentum accumulates in the direction of boosted gradients, which corresponds to geometrically-healthy configurations. When the model drifts, it loses the boost AND gets penalized — double deceleration. The optimizer's own momentum becomes the stabilizing force.

In the V=96 sphere+softhand run:

Epochs 1-20:   proximity 0.6-1.0  →  sustained boost  →  decoder learns fast
Epochs 20-100: proximity 0.8-1.0  →  boost maintained  →  reconstruction continues

Compared to the unconstrained softhand run:

Epochs 1-6:    proximity 0.84-1.0  →  brief boost     →  decoder gets a head start
Epochs 22+:    proximity 0.00      →  boost lost       →  pure penalty only

The sphere makes the soft hand persistent. Without the sphere, CV drifts past the target and the boost is lost by epoch 22. With the sphere, CV stays near target and the boost remains active for the entire run.

The Teaching Metaphor

The soft hand implements a teaching strategy:

Early training: The model's geometry is near the attractor (sphere normalization ensures this). Proximity is high. The model receives maximum encouragement — "atta boy."
Mid training: If geometry holds, the boost continues. The decoder builds capacity under amplified gradients. If geometry drifts, the boost fades and the penalty applies.
Late training: The model has either locked onto the geometric equilibrium (sustained boost) or settled into the firm-hand regime (penalty only). Either way, the decoder has strong features from the boosted early phase.

The model doesn't need to reach the CV target for the soft hand to be effective. It needs to pass through the target's neighborhood during training. The boost during that passage deposits features that survive the subsequent drift.

6. Results

V/D Ratio Sweep

The optimal V/D ratio balances geometric stability against spectral contrast:

V	D	V/D	MSE	Ratio	CV	Epochs	Character
40	24	1.7	0.071	9.62	0.17	100	High spectral contrast
96	24	4.0	0.054*	4.99	1.35*	100	Unconstrained best
96	24	4.0	0.072	2.75	0.29	100	Sphere+softhand
256	24	10.7	0.034	1.75	0.18	400	Best overall
1024	24	42.7	0.099	1.36	0.40	100	Spectrum too flat
4096	24	170.7	0.167	1.14	0.33	30	Nearly uniform spectrum

*Unconstrained (no sphere normalization)

V=256, D=24 achieves 0.034 MSE — a 37% improvement over the best unconstrained result — with sphere normalization and soft hand (target=0.125, boost=1.5×, σ=0.15). The proximity held 0.85-0.99 for 400 continuous epochs. The model received boosted gradients for the entire run.

CV Target as a Lens

At V=1024 (where the sphere locks the spectrum flat), the CV target controls the training dynamics rather than the final geometry:

Target	Final CV	Recon	Boost Pattern
0.05	0.397	0.099	Strong early, fades
0.10	0.402	0.122	Strong early, fades
0.29	0.371	0.131	Sustained midrange
0.50	0.401	0.135	Delayed activation
0.80	0.405	0.099	Never activated

At all target values, the CV converges to ~0.40 — the sphere attractor for V=1024. The target doesn't move the geometry; it controls when the boost activates during training. The best results come from brief intense boost (low target) or no boost at all (high target). Sustained moderate boosting hurts because the model optimizes for staying in the boost zone rather than minimizing reconstruction error.

At V=256, the picture is different: the sphere CV sits at ~0.17, close enough to the target (0.125) that the boost remains active for the entire run. The geometry has room to breathe (ratio 1.75, not crushed flat like V=1024's 1.22), giving the decoder spectral contrast to exploit.

Per-Class Universality

With sphere + soft hand at V=256:

plane: ratio 1.73    deer: ratio 1.74    ship: ratio 1.73
car:   ratio 1.73    dog:  ratio 1.74    truck: ratio 1.73
bird:  ratio 1.74    frog: ratio 1.75
cat:   ratio 1.75    horse: ratio 1.74

1.74 ± 0.01 across all 10 CIFAR-10 classes. The model doesn't see class — it sees geometry. Class information is encoded entirely in the directions and magnitudes, not the spectral shape. This universality is a signature of the geometric attractor: the representation structure is determined by (V, D), not by the data.

7. Implications

For Representation Learning

The existence of geometric attractors suggests that learned representations have preferred geometric configurations that are properties of the architecture, not the data. Current practice treats the latent space as unconstrained and relies on loss functions to shape it. Our results suggest the opposite approach: choose an architecture whose natural geometry matches the desired representation structure, then let the optimizer find the attractor.

For Training Stability

The charge-discharge cycle reveals that training instabilities in autoencoders are not random — they are thermodynamic transitions between geometric states. Sphere normalization eliminates these transitions by removing the degree of freedom (row magnitudes) that drives the concentration. This suggests that many training instabilities in deep networks may have geometric origins that can be addressed architecturally rather than through loss engineering or learning rate schedules.

For Loss Design

The soft hand demonstrates that loss functions can provide positive training signal, not just penalty. The boost mechanism amplifies learning when the model's state matches desired properties, rather than only penalizing deviation. This is fundamentally different from KL divergence (always a cost), weight decay (always a drag), or spectral normalization (always a constraint). The soft hand is a reward for geometric correctness.

For Compression

The universal spectral geometry (ratio 1.74 across all classes at V=256) suggests that image reconstruction at this scale is a fundamentally geometric operation. The data-specific information lives in the directions and magnitudes; the spectral shape is universal. This separation could enable compression schemes where the structure is known a priori and only the deviations are transmitted.

8. Reproducing

The complete implementation is available through the geolip-core package:

from geolip_core.core.distinguish import (
    compute_target_cv,    # Compute attractor for any (V, D)
    cv_proximity,         # Gaussian proximity to target
    soft_hand_weights,    # Counterweight loss weights
    soft_hand_loss,       # Complete loss composition
)
from geolip_core.linalg import batched_svd  # fp64 Gram+eigh SVD

# Compute the attractor
target = compute_target_cv(V=256, D=24)  # ~0.18 on sphere

# In the encoder
M = encoder(image).reshape(B, V, D)
M = F.normalize(M, dim=-1)               # THE LINE
U, S, Vh = batched_svd(M, compute_dtype='fp64')

# In the training loop  
recon_loss = F.mse_loss(decoded, image)
measured_cv = cv_metric(M[0])
loss, prox, rw = soft_hand_loss(
    recon_loss, cv_loss(M, target=target),
    measured_cv, target,
    boost=0.5, penalty_weight=0.3, sigma=0.15
)

Repository: AbstractEyes/geolip-core

Acknowledgments

Architecture, experiments, and geometric theory by Phil (AbstractPhil). Implementation assistance and analysis by Claude (Anthropic). The fp64 Gram+eigh SVD, Cayley-Menger volume computation, and oscillatory counterweight loss are part of the geolip-core ecosystem.

The binding constant 0.29154 — the CV of D=24 — is the dimension where the geometric attractor coincides with the pentachoron phase boundary. This is not a coincidence. It is the geometry.

Aleph Differentiation, Parts 3 & 3-D: Two Laws, Five Days, One Framework

July 20, 2026

The Aleph Moves Into a Pretrained Trunk: Relays, Registers, and the Two-Regime Dispatch Law

July 13, 2026

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote