geolip-cvae-proto / README.md
AbstractPhil's picture
Update README.md
3147e0c verified
---
license: apache-2.0
---
# Potential foundational piece
This is a different format of VAE that specifically targets CV for autoencoding, but the model is only preliminary and requires many systemic utilities be
instantiated to function.
This model will likely function entirely with KL_DIV and standard AE structural systems with the footnote of being entirely geometrically aligned using
similar principality as the SVAE with a specifically aligned internalized subsystem meant entirely for adjudication through a series of
embedding arrays each meant to be aligned entirely on the CV spectrum.
This system is essentially a CV battery container, that handles hundreds of miniature SVD-trained batteries that are directly implanted
into the substructure as learning starter points. The early design shows promise for rapid learning transfer.
This model **will not require SVD FP64 TO TRAIN** and it will be almost entirely linear upon completion, which means it will be a unique param-heavy model,
rather than the combination of model shapes I've been cobbling together up to this point.
# geolip-svae-nosvd-ablation
**Status: shelved pending proper redesign. See "What's next" section.**
An ablation study exploring whether the SVD in PatchSVAE can be replaced by
a learned linear readout. The short answer is *not directly*, and the long
answer is a list of architectural properties the SVD was providing implicitly
that any replacement must supply explicitly.
This repo exists to preserve the experiment and its findings for future
work. The parent architecture lives at
[geolip-svae](https://huggingface.co/AbstractPhil/geolip-SVAE) and
[geolip-svae-batteries](https://huggingface.co/AbstractPhil/geolip-svae-batteries).
## The motivating realization
During F-class sweep analysis, we articulated a claim that reframed what
PatchSVAE is doing:
**The SVD is a readout, not a decomposer. The encoder + sphere-normalization
is the decomposer.**
The argument:
1. The encoder MLP projects a patch into a V×D matrix space
2. Sphere-normalization (one line, zero parameters) constrains every row to
S^(D-1) — the unit hypersphere in D dimensions
3. The SVD is then exact arithmetic on V points on S^(D-1). Given V unit
vectors in D-space, the factorization U·Σ·V^T is unique up to sign
4. Cross-attention is 0.013% of parameters with alpha coefficients that
barely move during training — per-patch SVD already produces correct
coordinates; cross-attn is verification, not coordination
Under this frame, omega tokens are not a learned compressed representation.
They are **coordinates on the universal S^(D-1) packing manifold**. The
universal attractor (S₀ ≈ 5.1, erank ≈ 15.88 at D=16; CV 0.20–0.23 band) is
a geometric property of "V unit vectors packed as evenly as possible on
S^(D-1)," not a learned statistic. The encoder discovers projections onto
that fixed manifold. The manifold is fixed by the architecture.
If this is right, the SVD should be replaceable by any mechanism that reads
D-dim coordinates off a sphere-normalized V×D matrix. A learned
`Linear(V·D → D)` should work. This repo tests that hypothesis.
## What the ablation actually is
| Stage | Canonical PatchSVAE | NoSVD ablation |
|---|---|---|
| Encoder | MLP → V·D flat | same |
| Sphere norm | F.normalize(dim=-1) on V×D reshape | same |
| Readout | `U,S,Vt = svd(M)` → omega is S | `omega = Linear(V·D, D)(M.flatten())` |
| Cross-attention | on S across patches | on omega across patches |
| Inverse readout | `M_hat = U @ diag(S_coord) @ Vt` | `M_hat = sphere_norm(Linear(D, V·D)(omega_coord))` |
| Decoder | MLP from V·D flat | same |
Everything else — encoder, decoder, cross-attention logic, boundary smoothing,
CV-EMA soft-hand, 16-type noise training, 30 epochs — is identical to the
F-class trainer.
## What happened
Four debug rounds before shelving. Each round revealed an architectural
property the SVD was providing that a naive `Linear` replacement doesn't.
**Round 1 — baseline.** `r=NaN` at iteration 899. Adaptive gradient clipping
(`clip=max(recon_loss, 1.0)`) in the original trainer assumes recon_loss is
architecturally bounded. Without the SVD's implicit magnitude bound, recon
blows up, the clip threshold blows up with it, and protection fails.
**Round 2 — LayerNorm on omega, orthogonal init gain=0.5, fixed grad_clip=1.0.**
`r=3.2e11` before NaN. LayerNorm bounded omega but did nothing for M_hat. The
decoder can push `inverse_readout` to amplify freely to match heavy-tailed
noise values (Cauchy `tan(π·0.49) ≈ 63`, exponential `-log(tiny) ≈ 13+`), and
the unconstrained Linear output cubically amplifies during training.
**Round 3 — added sphere-norm on M_hat after inverse_readout.** Forward is
now stable, but eval MSE is 2.3e11 and recon_ema goes NaN. Sphere-norming
M_hat puts the decoder input on the same manifold as the canonical's
reconstruction, but strips reconstruction-magnitude information. The decoder
must hallucinate 63× amplification from unit-magnitude matrices to match
Cauchy targets, which it cannot do.
**Round 4 — gradient-flowing Cayley-Menger loss.** This is the first
implementation with a plausible mechanism. In the canonical, CV of pentachoron
volumes is measured with `.item()` stripping the gradient — it's a readout.
In Round 4, `cv_loss_differentiable` is added, computing CV across the batch
with full gradient flow, penalizing quadratic distance from the 0.215 target
(center of the 0.20–0.23 universal band). Weighted 20.0 during ALGN epochs
(geometry first) and 10.0 during HAND epochs (geometry locked). Applied to
every M matrix in every batch — the encoder has no place to hide.
Round 4 was in the training file but not run before shelving. The session
ended with a design-level observation:
> It has to hit everything that passes through the linear sector.
The realization is that the CV force as applied only covered one Linear
(the readout bottleneck). The full geometric discipline needs to cover
everything downstream that carries omega information.
## What the four rounds actually taught us
These are the load-bearing architectural properties of the SVD path that
need explicit replacement in any NoSVD design:
1. **Unitary U and V^T bound |M_hat|.** In the canonical, `|M_hat|_F = |S|_2`
because U and V^T are orthogonal. Any learned inverse must be bounded by
construction (sphere-norm is one way; RMS-norm with learned gain is
another; but both fight against magnitude reconstruction).
2. **S magnitude is proportional to input magnitude.** This is the property
that lets the canonical handle heavy-tailed noise. Sphere-norming M kills
magnitude per-row, but S recovers the per-matrix magnitude as the
singular values. Any learned readout that normalizes loses this.
3. **The SVD factorization is exact and input-agnostic.** Sphere-normed
points on S^(D-1) always admit a unique SVD. The learned readout is not
input-agnostic; it must learn to read, and what it learns to read from
Cauchy-driven matrices is not the same as what it learns from Gaussian-
driven matrices.
4. **Gradient-flowing CM is a partial replacement for #3** (input-agnostic
geometric structure), but it has to apply everywhere downstream Linear
operations carry omega information. A single bottleneck Linear with CM
discipline is not enough; the whole inverse/decoder pathway needs
geometric control.
## What's next — proper research direction
The ablation as built is not the right experiment. It's "SVAE minus SVD,"
which treats the SVD as a swappable component in an architecture designed
around it. That's the wrong framing.
The right framing: if you don't have the SVD's factorization, you have an
autoencoder. Autoencoders have their own stability toolkit — KL-divergence
regularization, explicit bottleneck embedding, reparameterization tricks
— and you should use it.
A serious NoSVD successor should include:
**Proper VAE machinery.** Not "replace SVD with Linear, keep SVAE shape."
Rebuild as a VAE with:
- `μ, logσ = encoder(patch) → (D,), (D,)` — explicit learned distribution
- `z = μ + σ · ε` — reparameterized sample
- KL regularization `D_KL(q(z|x) || N(0,I))` — standard VAE discipline
- Decoder from `z` back to patch via `Linear(D → V·D) → MLP decoder`
The omega tokens here are `z` samples or `μ` values — learned latents, not
spectral coordinates. Different object, different claims, honest framing.
**Bottleneck embedding with capacity.** The ablation's `Linear(V·D → D)` is
a 1024→16 projection with no intermediate substrate. A proper bottleneck
would use `Linear → GELU → Linear` with a hidden dimension that lets the
MLP learn a meaningful projection. This is standard VAE practice; the
SVAE didn't need it because sphere-norm + SVD already provided the
projection discipline.
**Per-sector Cayley-Menger discipline.** If the goal is to make every Linear
in the omega pathway produce geometrically-disciplined outputs, CM loss must
be applied at every stage, not just at the encoder output. This is feasible
but serious engineering — it's a new architectural idea, not a drop-in.
**Independent of SVAE naming/structure.** The result is not "PatchSVAE without
SVD." It's a new VAE family that uses geometric discipline as a regularizer.
Name it something else. Compare it to SVAE as peers, not as child-of-parent.
Estimated effort: a focused week for a first working prototype, longer for
proper characterization. Shelved here pending that dedicated time.
## Files
- `johanna_F_nosvd_trainer.py` — final state of the ablation trainer after
four debug rounds. Standalone (no imports from canonical F-class trainer).
Independent HF repo configured: `AbstractPhil/geolip-svae-nosvd-ablation`.
## What to read if resuming
1. This document. Start here.
2. The parent [geolip-svae README](https://huggingface.co/AbstractPhil/geolip-SVAE)
for architectural context on what you're replacing.
3. The [F-class batteries README](https://huggingface.co/AbstractPhil/geolip-svae-batteries)
for the framework the ablation was meant to validate against.
4. The omega tokens blog post for the self-solving frame framing that
motivated the ablation in the first place.
Do not resume this as "finish debugging the Linear readout." Resume as
"design the proper VAE successor." The four rounds of debugging already
told you why the direct replacement doesn't work.