Update README.md

3147e0c verified about 1 month ago

10.5 kB

	---
	license: apache-2.0
	---

	# Potential foundational piece

	This is a different format of VAE that specifically targets CV for autoencoding, but the model is only preliminary and requires many systemic utilities be
	instantiated to function.

	This model will likely function entirely with KL_DIV and standard AE structural systems with the footnote of being entirely geometrically aligned using
	similar principality as the SVAE with a specifically aligned internalized subsystem meant entirely for adjudication through a series of
	embedding arrays each meant to be aligned entirely on the CV spectrum.

	This system is essentially a CV battery container, that handles hundreds of miniature SVD-trained batteries that are directly implanted
	into the substructure as learning starter points. The early design shows promise for rapid learning transfer.

	This model will not require SVD FP64 TO TRAIN and it will be almost entirely linear upon completion, which means it will be a unique param-heavy model,
	rather than the combination of model shapes I've been cobbling together up to this point.

	# geolip-svae-nosvd-ablation

	Status: shelved pending proper redesign. See "What's next" section.

	An ablation study exploring whether the SVD in PatchSVAE can be replaced by
	a learned linear readout. The short answer is not directly, and the long
	answer is a list of architectural properties the SVD was providing implicitly
	that any replacement must supply explicitly.

	This repo exists to preserve the experiment and its findings for future
	work. The parent architecture lives at
	[geolip-svae](https://huggingface.co/AbstractPhil/geolip-SVAE) and
	[geolip-svae-batteries](https://huggingface.co/AbstractPhil/geolip-svae-batteries).

	## The motivating realization

	During F-class sweep analysis, we articulated a claim that reframed what
	PatchSVAE is doing:

	**The SVD is a readout, not a decomposer. The encoder + sphere-normalization
	is the decomposer.**

	The argument:

	1. The encoder MLP projects a patch into a V×D matrix space
	2. Sphere-normalization (one line, zero parameters) constrains every row to
	S^(D-1) — the unit hypersphere in D dimensions
	3. The SVD is then exact arithmetic on V points on S^(D-1). Given V unit
	vectors in D-space, the factorization U·Σ·V^T is unique up to sign
	4. Cross-attention is 0.013% of parameters with alpha coefficients that
	barely move during training — per-patch SVD already produces correct
	coordinates; cross-attn is verification, not coordination

	Under this frame, omega tokens are not a learned compressed representation.
	They are coordinates on the universal S^(D-1) packing manifold. The
	universal attractor (S₀ ≈ 5.1, erank ≈ 15.88 at D=16; CV 0.20–0.23 band) is
	a geometric property of "V unit vectors packed as evenly as possible on
	S^(D-1)," not a learned statistic. The encoder discovers projections onto
	that fixed manifold. The manifold is fixed by the architecture.

	If this is right, the SVD should be replaceable by any mechanism that reads
	D-dim coordinates off a sphere-normalized V×D matrix. A learned
	`Linear(V·D → D)` should work. This repo tests that hypothesis.

	## What the ablation actually is

	\| Stage \| Canonical PatchSVAE \| NoSVD ablation \|
	\|---\|---\|---\|
	\| Encoder \| MLP → V·D flat \| same \|
	\| Sphere norm \| F.normalize(dim=-1) on V×D reshape \| same \|
	\| Readout \| `U,S,Vt = svd(M)` → omega is S \| `omega = Linear(V·D, D)(M.flatten())` \|
	\| Cross-attention \| on S across patches \| on omega across patches \|
	\| Inverse readout \| `M_hat = U @ diag(S_coord) @ Vt` \| `M_hat = sphere_norm(Linear(D, V·D)(omega_coord))` \|
	\| Decoder \| MLP from V·D flat \| same \|

	Everything else — encoder, decoder, cross-attention logic, boundary smoothing,
	CV-EMA soft-hand, 16-type noise training, 30 epochs — is identical to the
	F-class trainer.

	## What happened

	Four debug rounds before shelving. Each round revealed an architectural
	property the SVD was providing that a naive `Linear` replacement doesn't.

	Round 1 — baseline. `r=NaN` at iteration 899. Adaptive gradient clipping
	(`clip=max(recon_loss, 1.0)`) in the original trainer assumes recon_loss is
	architecturally bounded. Without the SVD's implicit magnitude bound, recon
	blows up, the clip threshold blows up with it, and protection fails.

	Round 2 — LayerNorm on omega, orthogonal init gain=0.5, fixed grad_clip=1.0.
	`r=3.2e11` before NaN. LayerNorm bounded omega but did nothing for M_hat. The
	decoder can push `inverse_readout` to amplify freely to match heavy-tailed
	noise values (Cauchy `tan(π·0.49) ≈ 63`, exponential `-log(tiny) ≈ 13+`), and
	the unconstrained Linear output cubically amplifies during training.

	Round 3 — added sphere-norm on M_hat after inverse_readout. Forward is
	now stable, but eval MSE is 2.3e11 and recon_ema goes NaN. Sphere-norming
	M_hat puts the decoder input on the same manifold as the canonical's
	reconstruction, but strips reconstruction-magnitude information. The decoder
	must hallucinate 63× amplification from unit-magnitude matrices to match
	Cauchy targets, which it cannot do.

	Round 4 — gradient-flowing Cayley-Menger loss. This is the first
	implementation with a plausible mechanism. In the canonical, CV of pentachoron
	volumes is measured with `.item()` stripping the gradient — it's a readout.
	In Round 4, `cv_loss_differentiable` is added, computing CV across the batch
	with full gradient flow, penalizing quadratic distance from the 0.215 target
	(center of the 0.20–0.23 universal band). Weighted 20.0 during ALGN epochs
	(geometry first) and 10.0 during HAND epochs (geometry locked). Applied to
	every M matrix in every batch — the encoder has no place to hide.

	Round 4 was in the training file but not run before shelving. The session
	ended with a design-level observation:

	> It has to hit everything that passes through the linear sector.

	The realization is that the CV force as applied only covered one Linear
	(the readout bottleneck). The full geometric discipline needs to cover
	everything downstream that carries omega information.

	## What the four rounds actually taught us

	These are the load-bearing architectural properties of the SVD path that
	need explicit replacement in any NoSVD design:

	1. Unitary U and V^T bound \|M_hat\|. In the canonical, `\|M_hat\|_F = \|S\|_2`
	because U and V^T are orthogonal. Any learned inverse must be bounded by
	construction (sphere-norm is one way; RMS-norm with learned gain is
	another; but both fight against magnitude reconstruction).

	2. S magnitude is proportional to input magnitude. This is the property
	that lets the canonical handle heavy-tailed noise. Sphere-norming M kills
	magnitude per-row, but S recovers the per-matrix magnitude as the
	singular values. Any learned readout that normalizes loses this.

	3. The SVD factorization is exact and input-agnostic. Sphere-normed
	points on S^(D-1) always admit a unique SVD. The learned readout is not
	input-agnostic; it must learn to read, and what it learns to read from
	Cauchy-driven matrices is not the same as what it learns from Gaussian-
	driven matrices.

	4. Gradient-flowing CM is a partial replacement for #3 (input-agnostic
	geometric structure), but it has to apply everywhere downstream Linear
	operations carry omega information. A single bottleneck Linear with CM
	discipline is not enough; the whole inverse/decoder pathway needs
	geometric control.

	## What's next — proper research direction

	The ablation as built is not the right experiment. It's "SVAE minus SVD,"
	which treats the SVD as a swappable component in an architecture designed
	around it. That's the wrong framing.

	The right framing: if you don't have the SVD's factorization, you have an
	autoencoder. Autoencoders have their own stability toolkit — KL-divergence
	regularization, explicit bottleneck embedding, reparameterization tricks
	— and you should use it.

	A serious NoSVD successor should include:

	Proper VAE machinery. Not "replace SVD with Linear, keep SVAE shape."
	Rebuild as a VAE with:
	- `μ, logσ = encoder(patch) → (D,), (D,)` — explicit learned distribution
	- `z = μ + σ · ε` — reparameterized sample
	- KL regularization `D_KL(q(z\|x) \|\| N(0,I))` — standard VAE discipline
	- Decoder from `z` back to patch via `Linear(D → V·D) → MLP decoder`

	The omega tokens here are `z` samples or `μ` values — learned latents, not
	spectral coordinates. Different object, different claims, honest framing.

	Bottleneck embedding with capacity. The ablation's `Linear(V·D → D)` is
	a 1024→16 projection with no intermediate substrate. A proper bottleneck
	would use `Linear → GELU → Linear` with a hidden dimension that lets the
	MLP learn a meaningful projection. This is standard VAE practice; the
	SVAE didn't need it because sphere-norm + SVD already provided the
	projection discipline.

	Per-sector Cayley-Menger discipline. If the goal is to make every Linear
	in the omega pathway produce geometrically-disciplined outputs, CM loss must
	be applied at every stage, not just at the encoder output. This is feasible
	but serious engineering — it's a new architectural idea, not a drop-in.

	Independent of SVAE naming/structure. The result is not "PatchSVAE without
	SVD." It's a new VAE family that uses geometric discipline as a regularizer.
	Name it something else. Compare it to SVAE as peers, not as child-of-parent.

	Estimated effort: a focused week for a first working prototype, longer for
	proper characterization. Shelved here pending that dedicated time.

	## Files

	- `johanna_F_nosvd_trainer.py` — final state of the ablation trainer after
	four debug rounds. Standalone (no imports from canonical F-class trainer).
	Independent HF repo configured: `AbstractPhil/geolip-svae-nosvd-ablation`.

	## What to read if resuming

	1. This document. Start here.
	2. The parent [geolip-svae README](https://huggingface.co/AbstractPhil/geolip-SVAE)
	for architectural context on what you're replacing.
	3. The [F-class batteries README](https://huggingface.co/AbstractPhil/geolip-svae-batteries)
	for the framework the ablation was meant to validate against.
	4. The omega tokens blog post for the self-solving frame framing that
	motivated the ablation in the first place.

	Do not resume this as "finish debugging the Linear readout." Resume as
	"design the proper VAE successor." The four rounds of debugging already
	told you why the direct replacement doesn't work.