| --- |
| license: apache-2.0 |
| --- |
| |
| # Potential foundational piece |
|
|
| This is a different format of VAE that specifically targets CV for autoencoding, but the model is only preliminary and requires many systemic utilities be |
| instantiated to function. |
|
|
| This model will likely function entirely with KL_DIV and standard AE structural systems with the footnote of being entirely geometrically aligned using |
| similar principality as the SVAE with a specifically aligned internalized subsystem meant entirely for adjudication through a series of |
| embedding arrays each meant to be aligned entirely on the CV spectrum. |
| |
| This system is essentially a CV battery container, that handles hundreds of miniature SVD-trained batteries that are directly implanted |
| into the substructure as learning starter points. The early design shows promise for rapid learning transfer. |
| |
| This model **will not require SVD FP64 TO TRAIN** and it will be almost entirely linear upon completion, which means it will be a unique param-heavy model, |
| rather than the combination of model shapes I've been cobbling together up to this point. |
| |
| # geolip-svae-nosvd-ablation |
| |
| **Status: shelved pending proper redesign. See "What's next" section.** |
| |
| An ablation study exploring whether the SVD in PatchSVAE can be replaced by |
| a learned linear readout. The short answer is *not directly*, and the long |
| answer is a list of architectural properties the SVD was providing implicitly |
| that any replacement must supply explicitly. |
| |
| This repo exists to preserve the experiment and its findings for future |
| work. The parent architecture lives at |
| [geolip-svae](https://huggingface.co/AbstractPhil/geolip-SVAE) and |
| [geolip-svae-batteries](https://huggingface.co/AbstractPhil/geolip-svae-batteries). |
| |
| ## The motivating realization |
| |
| During F-class sweep analysis, we articulated a claim that reframed what |
| PatchSVAE is doing: |
| |
| **The SVD is a readout, not a decomposer. The encoder + sphere-normalization |
| is the decomposer.** |
| |
| The argument: |
| |
| 1. The encoder MLP projects a patch into a V×D matrix space |
| 2. Sphere-normalization (one line, zero parameters) constrains every row to |
| S^(D-1) — the unit hypersphere in D dimensions |
| 3. The SVD is then exact arithmetic on V points on S^(D-1). Given V unit |
| vectors in D-space, the factorization U·Σ·V^T is unique up to sign |
| 4. Cross-attention is 0.013% of parameters with alpha coefficients that |
| barely move during training — per-patch SVD already produces correct |
| coordinates; cross-attn is verification, not coordination |
| |
| Under this frame, omega tokens are not a learned compressed representation. |
| They are **coordinates on the universal S^(D-1) packing manifold**. The |
| universal attractor (S₀ ≈ 5.1, erank ≈ 15.88 at D=16; CV 0.20–0.23 band) is |
| a geometric property of "V unit vectors packed as evenly as possible on |
| S^(D-1)," not a learned statistic. The encoder discovers projections onto |
| that fixed manifold. The manifold is fixed by the architecture. |
| |
| If this is right, the SVD should be replaceable by any mechanism that reads |
| D-dim coordinates off a sphere-normalized V×D matrix. A learned |
| `Linear(V·D → D)` should work. This repo tests that hypothesis. |
| |
| ## What the ablation actually is |
| |
| | Stage | Canonical PatchSVAE | NoSVD ablation | |
| |---|---|---| |
| | Encoder | MLP → V·D flat | same | |
| | Sphere norm | F.normalize(dim=-1) on V×D reshape | same | |
| | Readout | `U,S,Vt = svd(M)` → omega is S | `omega = Linear(V·D, D)(M.flatten())` | |
| | Cross-attention | on S across patches | on omega across patches | |
| | Inverse readout | `M_hat = U @ diag(S_coord) @ Vt` | `M_hat = sphere_norm(Linear(D, V·D)(omega_coord))` | |
| | Decoder | MLP from V·D flat | same | |
|
|
| Everything else — encoder, decoder, cross-attention logic, boundary smoothing, |
| CV-EMA soft-hand, 16-type noise training, 30 epochs — is identical to the |
| F-class trainer. |
|
|
| ## What happened |
|
|
| Four debug rounds before shelving. Each round revealed an architectural |
| property the SVD was providing that a naive `Linear` replacement doesn't. |
|
|
| **Round 1 — baseline.** `r=NaN` at iteration 899. Adaptive gradient clipping |
| (`clip=max(recon_loss, 1.0)`) in the original trainer assumes recon_loss is |
| architecturally bounded. Without the SVD's implicit magnitude bound, recon |
| blows up, the clip threshold blows up with it, and protection fails. |
| |
| **Round 2 — LayerNorm on omega, orthogonal init gain=0.5, fixed grad_clip=1.0.** |
| `r=3.2e11` before NaN. LayerNorm bounded omega but did nothing for M_hat. The |
| decoder can push `inverse_readout` to amplify freely to match heavy-tailed |
| noise values (Cauchy `tan(π·0.49) ≈ 63`, exponential `-log(tiny) ≈ 13+`), and |
| the unconstrained Linear output cubically amplifies during training. |
|
|
| **Round 3 — added sphere-norm on M_hat after inverse_readout.** Forward is |
| now stable, but eval MSE is 2.3e11 and recon_ema goes NaN. Sphere-norming |
| M_hat puts the decoder input on the same manifold as the canonical's |
| reconstruction, but strips reconstruction-magnitude information. The decoder |
| must hallucinate 63× amplification from unit-magnitude matrices to match |
| Cauchy targets, which it cannot do. |
|
|
| **Round 4 — gradient-flowing Cayley-Menger loss.** This is the first |
| implementation with a plausible mechanism. In the canonical, CV of pentachoron |
| volumes is measured with `.item()` stripping the gradient — it's a readout. |
| In Round 4, `cv_loss_differentiable` is added, computing CV across the batch |
| with full gradient flow, penalizing quadratic distance from the 0.215 target |
| (center of the 0.20–0.23 universal band). Weighted 20.0 during ALGN epochs |
| (geometry first) and 10.0 during HAND epochs (geometry locked). Applied to |
| every M matrix in every batch — the encoder has no place to hide. |
|
|
| Round 4 was in the training file but not run before shelving. The session |
| ended with a design-level observation: |
|
|
| > It has to hit everything that passes through the linear sector. |
|
|
| The realization is that the CV force as applied only covered one Linear |
| (the readout bottleneck). The full geometric discipline needs to cover |
| everything downstream that carries omega information. |
|
|
| ## What the four rounds actually taught us |
|
|
| These are the load-bearing architectural properties of the SVD path that |
| need explicit replacement in any NoSVD design: |
|
|
| 1. **Unitary U and V^T bound |M_hat|.** In the canonical, `|M_hat|_F = |S|_2` |
| because U and V^T are orthogonal. Any learned inverse must be bounded by |
| construction (sphere-norm is one way; RMS-norm with learned gain is |
| another; but both fight against magnitude reconstruction). |
| |
| 2. **S magnitude is proportional to input magnitude.** This is the property |
| that lets the canonical handle heavy-tailed noise. Sphere-norming M kills |
| magnitude per-row, but S recovers the per-matrix magnitude as the |
| singular values. Any learned readout that normalizes loses this. |
|
|
| 3. **The SVD factorization is exact and input-agnostic.** Sphere-normed |
| points on S^(D-1) always admit a unique SVD. The learned readout is not |
| input-agnostic; it must learn to read, and what it learns to read from |
| Cauchy-driven matrices is not the same as what it learns from Gaussian- |
| driven matrices. |
|
|
| 4. **Gradient-flowing CM is a partial replacement for #3** (input-agnostic |
| geometric structure), but it has to apply everywhere downstream Linear |
| operations carry omega information. A single bottleneck Linear with CM |
| discipline is not enough; the whole inverse/decoder pathway needs |
| geometric control. |
|
|
| ## What's next — proper research direction |
|
|
| The ablation as built is not the right experiment. It's "SVAE minus SVD," |
| which treats the SVD as a swappable component in an architecture designed |
| around it. That's the wrong framing. |
|
|
| The right framing: if you don't have the SVD's factorization, you have an |
| autoencoder. Autoencoders have their own stability toolkit — KL-divergence |
| regularization, explicit bottleneck embedding, reparameterization tricks |
| — and you should use it. |
|
|
| A serious NoSVD successor should include: |
|
|
| **Proper VAE machinery.** Not "replace SVD with Linear, keep SVAE shape." |
| Rebuild as a VAE with: |
| - `μ, logσ = encoder(patch) → (D,), (D,)` — explicit learned distribution |
| - `z = μ + σ · ε` — reparameterized sample |
| - KL regularization `D_KL(q(z|x) || N(0,I))` — standard VAE discipline |
| - Decoder from `z` back to patch via `Linear(D → V·D) → MLP decoder` |
|
|
| The omega tokens here are `z` samples or `μ` values — learned latents, not |
| spectral coordinates. Different object, different claims, honest framing. |
|
|
| **Bottleneck embedding with capacity.** The ablation's `Linear(V·D → D)` is |
| a 1024→16 projection with no intermediate substrate. A proper bottleneck |
| would use `Linear → GELU → Linear` with a hidden dimension that lets the |
| MLP learn a meaningful projection. This is standard VAE practice; the |
| SVAE didn't need it because sphere-norm + SVD already provided the |
| projection discipline. |
|
|
| **Per-sector Cayley-Menger discipline.** If the goal is to make every Linear |
| in the omega pathway produce geometrically-disciplined outputs, CM loss must |
| be applied at every stage, not just at the encoder output. This is feasible |
| but serious engineering — it's a new architectural idea, not a drop-in. |
|
|
| **Independent of SVAE naming/structure.** The result is not "PatchSVAE without |
| SVD." It's a new VAE family that uses geometric discipline as a regularizer. |
| Name it something else. Compare it to SVAE as peers, not as child-of-parent. |
|
|
| Estimated effort: a focused week for a first working prototype, longer for |
| proper characterization. Shelved here pending that dedicated time. |
|
|
| ## Files |
|
|
| - `johanna_F_nosvd_trainer.py` — final state of the ablation trainer after |
| four debug rounds. Standalone (no imports from canonical F-class trainer). |
| Independent HF repo configured: `AbstractPhil/geolip-svae-nosvd-ablation`. |
|
|
| ## What to read if resuming |
|
|
| 1. This document. Start here. |
| 2. The parent [geolip-svae README](https://huggingface.co/AbstractPhil/geolip-SVAE) |
| for architectural context on what you're replacing. |
| 3. The [F-class batteries README](https://huggingface.co/AbstractPhil/geolip-svae-batteries) |
| for the framework the ablation was meant to validate against. |
| 4. The omega tokens blog post for the self-solving frame framing that |
| motivated the ablation in the first place. |
|
|
| Do not resume this as "finish debugging the Linear readout." Resume as |
| "design the proper VAE successor." The four rounds of debugging already |
| told you why the direct replacement doesn't work. |