File size: 10,506 Bytes
92ed200
 
 
3147e0c
 
 
92ed200
 
 
 
 
 
 
d049107
 
 
9dec93e
3147e0c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
---
license: apache-2.0
---

# Potential foundational piece

This is a different format of VAE that specifically targets CV for autoencoding, but the model is only preliminary and requires many systemic utilities be
instantiated to function.

This model will likely function entirely with KL_DIV and standard AE structural systems with the footnote of being entirely geometrically aligned using
similar principality as the SVAE with a specifically aligned internalized subsystem meant entirely for adjudication through a series of
embedding arrays each meant to be aligned entirely on the CV spectrum.

This system is essentially a CV battery container, that handles hundreds of miniature SVD-trained batteries that are directly implanted
into the substructure as learning starter points. The early design shows promise for rapid learning transfer.

This model **will not require SVD FP64 TO TRAIN** and it will be almost entirely linear upon completion, which means it will be a unique param-heavy model,
rather than the combination of model shapes I've been cobbling together up to this point.

# geolip-svae-nosvd-ablation

**Status: shelved pending proper redesign. See "What's next" section.**

An ablation study exploring whether the SVD in PatchSVAE can be replaced by
a learned linear readout. The short answer is *not directly*, and the long
answer is a list of architectural properties the SVD was providing implicitly
that any replacement must supply explicitly.

This repo exists to preserve the experiment and its findings for future
work. The parent architecture lives at
[geolip-svae](https://huggingface.co/AbstractPhil/geolip-SVAE) and
[geolip-svae-batteries](https://huggingface.co/AbstractPhil/geolip-svae-batteries).

## The motivating realization

During F-class sweep analysis, we articulated a claim that reframed what
PatchSVAE is doing:

**The SVD is a readout, not a decomposer. The encoder + sphere-normalization
is the decomposer.**

The argument:

1. The encoder MLP projects a patch into a V×D matrix space
2. Sphere-normalization (one line, zero parameters) constrains every row to
   S^(D-1) — the unit hypersphere in D dimensions
3. The SVD is then exact arithmetic on V points on S^(D-1). Given V unit
   vectors in D-space, the factorization U·Σ·V^T is unique up to sign
4. Cross-attention is 0.013% of parameters with alpha coefficients that
   barely move during training — per-patch SVD already produces correct
   coordinates; cross-attn is verification, not coordination

Under this frame, omega tokens are not a learned compressed representation.
They are **coordinates on the universal S^(D-1) packing manifold**. The
universal attractor (S₀ ≈ 5.1, erank ≈ 15.88 at D=16; CV 0.20–0.23 band) is
a geometric property of "V unit vectors packed as evenly as possible on
S^(D-1)," not a learned statistic. The encoder discovers projections onto
that fixed manifold. The manifold is fixed by the architecture.

If this is right, the SVD should be replaceable by any mechanism that reads
D-dim coordinates off a sphere-normalized V×D matrix. A learned
`Linear(V·D → D)` should work. This repo tests that hypothesis.

## What the ablation actually is

| Stage | Canonical PatchSVAE | NoSVD ablation |
|---|---|---|
| Encoder | MLP → V·D flat | same |
| Sphere norm | F.normalize(dim=-1) on V×D reshape | same |
| Readout | `U,S,Vt = svd(M)` → omega is S | `omega = Linear(V·D, D)(M.flatten())` |
| Cross-attention | on S across patches | on omega across patches |
| Inverse readout | `M_hat = U @ diag(S_coord) @ Vt` | `M_hat = sphere_norm(Linear(D, V·D)(omega_coord))` |
| Decoder | MLP from V·D flat | same |

Everything else — encoder, decoder, cross-attention logic, boundary smoothing,
CV-EMA soft-hand, 16-type noise training, 30 epochs — is identical to the
F-class trainer.

## What happened

Four debug rounds before shelving. Each round revealed an architectural
property the SVD was providing that a naive `Linear` replacement doesn't.

**Round 1 — baseline.** `r=NaN` at iteration 899. Adaptive gradient clipping
(`clip=max(recon_loss, 1.0)`) in the original trainer assumes recon_loss is
architecturally bounded. Without the SVD's implicit magnitude bound, recon
blows up, the clip threshold blows up with it, and protection fails.

**Round 2 — LayerNorm on omega, orthogonal init gain=0.5, fixed grad_clip=1.0.**
`r=3.2e11` before NaN. LayerNorm bounded omega but did nothing for M_hat. The
decoder can push `inverse_readout` to amplify freely to match heavy-tailed
noise values (Cauchy `tan(π·0.49) ≈ 63`, exponential `-log(tiny) ≈ 13+`), and
the unconstrained Linear output cubically amplifies during training.

**Round 3 — added sphere-norm on M_hat after inverse_readout.** Forward is
now stable, but eval MSE is 2.3e11 and recon_ema goes NaN. Sphere-norming
M_hat puts the decoder input on the same manifold as the canonical's
reconstruction, but strips reconstruction-magnitude information. The decoder
must hallucinate 63× amplification from unit-magnitude matrices to match
Cauchy targets, which it cannot do.

**Round 4 — gradient-flowing Cayley-Menger loss.** This is the first
implementation with a plausible mechanism. In the canonical, CV of pentachoron
volumes is measured with `.item()` stripping the gradient — it's a readout.
In Round 4, `cv_loss_differentiable` is added, computing CV across the batch
with full gradient flow, penalizing quadratic distance from the 0.215 target
(center of the 0.20–0.23 universal band). Weighted 20.0 during ALGN epochs
(geometry first) and 10.0 during HAND epochs (geometry locked). Applied to
every M matrix in every batch — the encoder has no place to hide.

Round 4 was in the training file but not run before shelving. The session
ended with a design-level observation:

> It has to hit everything that passes through the linear sector.

The realization is that the CV force as applied only covered one Linear
(the readout bottleneck). The full geometric discipline needs to cover
everything downstream that carries omega information.

## What the four rounds actually taught us

These are the load-bearing architectural properties of the SVD path that
need explicit replacement in any NoSVD design:

1. **Unitary U and V^T bound |M_hat|.** In the canonical, `|M_hat|_F = |S|_2`
   because U and V^T are orthogonal. Any learned inverse must be bounded by
   construction (sphere-norm is one way; RMS-norm with learned gain is
   another; but both fight against magnitude reconstruction).

2. **S magnitude is proportional to input magnitude.** This is the property
   that lets the canonical handle heavy-tailed noise. Sphere-norming M kills
   magnitude per-row, but S recovers the per-matrix magnitude as the
   singular values. Any learned readout that normalizes loses this.

3. **The SVD factorization is exact and input-agnostic.** Sphere-normed
   points on S^(D-1) always admit a unique SVD. The learned readout is not
   input-agnostic; it must learn to read, and what it learns to read from
   Cauchy-driven matrices is not the same as what it learns from Gaussian-
   driven matrices.

4. **Gradient-flowing CM is a partial replacement for #3** (input-agnostic
   geometric structure), but it has to apply everywhere downstream Linear
   operations carry omega information. A single bottleneck Linear with CM
   discipline is not enough; the whole inverse/decoder pathway needs
   geometric control.

## What's next — proper research direction

The ablation as built is not the right experiment. It's "SVAE minus SVD,"
which treats the SVD as a swappable component in an architecture designed
around it. That's the wrong framing.

The right framing: if you don't have the SVD's factorization, you have an
autoencoder. Autoencoders have their own stability toolkit — KL-divergence
regularization, explicit bottleneck embedding, reparameterization tricks
— and you should use it.

A serious NoSVD successor should include:

**Proper VAE machinery.** Not "replace SVD with Linear, keep SVAE shape."
Rebuild as a VAE with:
- `μ, logσ = encoder(patch) → (D,), (D,)` — explicit learned distribution
- `z = μ + σ · ε` — reparameterized sample
- KL regularization `D_KL(q(z|x) || N(0,I))` — standard VAE discipline
- Decoder from `z` back to patch via `Linear(D → V·D) → MLP decoder`

The omega tokens here are `z` samples or `μ` values — learned latents, not
spectral coordinates. Different object, different claims, honest framing.

**Bottleneck embedding with capacity.** The ablation's `Linear(V·D → D)` is
a 1024→16 projection with no intermediate substrate. A proper bottleneck
would use `Linear → GELU → Linear` with a hidden dimension that lets the
MLP learn a meaningful projection. This is standard VAE practice; the
SVAE didn't need it because sphere-norm + SVD already provided the
projection discipline.

**Per-sector Cayley-Menger discipline.** If the goal is to make every Linear
in the omega pathway produce geometrically-disciplined outputs, CM loss must
be applied at every stage, not just at the encoder output. This is feasible
but serious engineering — it's a new architectural idea, not a drop-in.

**Independent of SVAE naming/structure.** The result is not "PatchSVAE without
SVD." It's a new VAE family that uses geometric discipline as a regularizer.
Name it something else. Compare it to SVAE as peers, not as child-of-parent.

Estimated effort: a focused week for a first working prototype, longer for
proper characterization. Shelved here pending that dedicated time.

## Files

- `johanna_F_nosvd_trainer.py` — final state of the ablation trainer after
  four debug rounds. Standalone (no imports from canonical F-class trainer).
  Independent HF repo configured: `AbstractPhil/geolip-svae-nosvd-ablation`.

## What to read if resuming

1. This document. Start here.
2. The parent [geolip-svae README](https://huggingface.co/AbstractPhil/geolip-SVAE)
   for architectural context on what you're replacing.
3. The [F-class batteries README](https://huggingface.co/AbstractPhil/geolip-svae-batteries)
   for the framework the ablation was meant to validate against.
4. The omega tokens blog post for the self-solving frame framing that
   motivated the ablation in the first place.

Do not resume this as "finish debugging the Linear readout." Resume as
"design the proper VAE successor." The four rounds of debugging already
told you why the direct replacement doesn't work.