AbstractPhil
/

geolip-svae-ablations

TensorBoard

Model card Files Files and versions

xet

Metrics Training metrics Community

AbstractPhil commited on Apr 21

Commit

45a8d69

verified ·

1 Parent(s): a9703cb

Update README.md

Browse files

Files changed (1) hide show

README.md +605 -3

README.md CHANGED Viewed

@@ -1,3 +1,605 @@
----
-license: mit
----

+---
+license: mit
+---
+# Three Geometric Bands in a Sphere-Normalized Patch Autoencoder
+*A quantized geometric attractor structure emerges from a single architectural knob, validated by ablation across 12 orthogonal dimensions*
+**TL;DR**: A sweep over small PatchSVAE-family configurations reveals that
+the final coefficient-of-variation (CV) of Cayley–Menger pentachoron volumes
+on the encoder's sphere-normalized latent rows quantizes into **three distinct
+bands**, indexed by the singular-value dimension **D**. An ablation program
+of 149 training runs across 12 orthogonal hyperparameter dimensions (seeds,
+optimizers, schedules, activations, initializations, batch sizes, capacities,
+normalizations, data compositions, cross-attention configurations, and
+soft-hand variants) confirms the band structure is **architectural**: it is
+reproduced in 96% of runs and fails only when the row-normalization step is
+ablated.
+- **D=16 → CV ≈ 0.20** (matches uniform S¹⁵ prediction 0.199 to ±0.003 across 5 seeds)
+- **D=8  → CV ≈ 0.36** (matches uniform S⁷ prediction 0.357 to ±0.02 across 5 seeds)
+- **D=4  → CV ≈ 0.90** (matches uniform S³ prediction 0.923 to ±0.05 across 5 seeds)
+Three additional results sharpen the framework:
+1. **The attractor is reached without any CV-related training signal.** Pure
+   MSE reconstruction with no soft-hand, no CV penalty, and no geometric
+   loss term reaches the same CV value as the full soft-hand regime (0.2046
+   vs 0.2037 on LOW band). The architecture alone selects the attractor.
+2. **Sphere-norm is a *selector* among geometric attractors, not the creator
+   of one.** Ablating sphere-normalization does not destroy the attractor
+   structure; it redirects the system to a *different* attractor (the
+   Gaussian bulk regime). LayerNorm selects a D-dependent intermediate.
+   Scale-only normalization is functionally identical to no normalization —
+   the unit-norm constraint is the sole active ingredient.
+3. **The attractor does not require representational nonlinearity.** A
+   linear encoder (identity activation, no GELU or ReLU anywhere) reaches
+   all three bands correctly. The attractor lives in the sphere-norm + SVD
+   geometric pipeline, not in the MLP's representational capacity.
+This is the first direct measurement in our battery-research lineage of a
+**discrete geometric ladder** that the architecture supports natively. We
+sketch a dimensional argument for why the quantization happens, give the
+complete matmul pipeline for one representative of each band, present the
+full ablation matrix that validates the claim, and propose a cheap online
+predictor: **CV at 1000 training batches reliably predicts final band
+membership**, reducing sweep turnaround from hours to minutes.
+---
+## 1. The architecture
+All three bands are reached by the same base architecture (PatchSVAE-F),
+differing only in hyperparameters. The pipeline per patch:
+```
+x ∈ ℝ^{patch_dim}                          # flattened (3, ps, ps) tile
+  │
+  │ enc_in: Linear(patch_dim → hidden) → GELU
+  │ enc_blocks: depth × residual MLP(hidden)
+  │ enc_out: Linear(hidden → V·D)
+  ▼
+M ∈ ℝ^{V × D}                              # reshape
+  │
+  │ F.normalize(M, dim=-1)                 # sphere-norm: each row on S^{D-1}
+  │
+  │ G = MᵀM ∈ ℝ^{D × D}                    # Gram matrix (fp64)
+  │ λ, Ṽ = eigh(G + 1e-12 I)               # fp64 eigendecomposition
+  │ S = √(clamp(λ, min=1e-24))             # singular values
+  │ U = M·Ṽ / clamp(S, min=1e-16)          # left singular vectors
+  │ Vt = Ṽᵀ
+  │
+  │ S_coord = S · (1 + α · tanh(attn(S)))  # cross-attn on S, α ≤ 0.2
+  │
+  │ M̂ = U · diag(S_coord) · Vt            # reconstruction matrix
+  ▼
+  │ dec_in: Linear(V·D → hidden) → GELU
+  │ dec_blocks: depth × residual MLP(hidden)
+  │ dec_out: Linear(hidden → patch_dim)
+  ▼
+x̂ ∈ ℝ^{patch_dim}
+```
+The key operation is `F.normalize(M, dim=-1)`, which forces every row of
+the V×D encoded matrix onto the unit (D−1)-sphere. The subsequent SVD is
+an exact arithmetic readout of the sphere-normed configuration, not a
+learned bottleneck. The model's job is to learn a good *projection* onto
+the manifold; the manifold itself is fixed by the architecture.
+The **coefficient-of-variation (CV)** is measured by sampling 200 random
+5-vertex subsets of the V rows and computing the Cayley–Menger 4-volume
+of each pentachoron:
+```
+CV = std(volumes) / (mean(volumes) + ε)
+```
+CV is a measure of *how uniformly distributed* the V rows are on S^{D−1}.
+CV ≈ 0 indicates near-uniform packing; large CV indicates clumpy packing.
+The universal attractor band 0.20–0.23 has been observed across 17+
+unrelated pretrained models (CLIP, T5, BERT, DINOv2, SD VAEs, etc.)
+whenever their representations are probed this way — it is not an artifact
+of any single training regime.
+---
+## 2. The three bands — exact specifications
+One representative from each band, chosen for CV purity within its band:
+| band | config | D | V | patch size | hidden | depth | params | patches/img |
+|:----:|:---|:-:|:-:|:-:|:-:|:-:|:-:|:-:|
+| **LOW** | `S64-V64-D16-h64-d1-p16` | **16** | 64 | 16 | 64 | 1 | 250K | 16 |
+| **MID** | `S64-V64-D8-h64-d1-p16` | **8** | 64 | 16 | 64 | 1 | 183K | 16 |
+| **HIGH** | `S64-V32-D4-h64-d1-p4` | **4** | 32 | 4 | 64 | 1 | 41K | 256 |
+All three use identical resolution (64×64), hidden width (64), depth (1),
+cross-attention layers (1), and CV-EMA soft-hand training regime. Only the
+three parameters **(D, V, patch size)** differ.
+### Complete matmul pipeline per band
+**LOW band (D=16) — the attractor**
+```
+Per patch (768-dim input tile):
+  768  →  64     (enc_in)           49,152 params
+   64  →  64     (MLP residual)      8,192 params
+   64  →  1024   (enc_out)          65,536 params
+ reshape to [64, 16]                          — 64 rows on S^15
+   Gram+eigh in fp64 → S ∈ ℝ^16
+   cross-attn: S ← S · (1 + α·tanh(attn(S)))
+   M̂ = U · diag(S) · Vᵀ
+ 1024 →   64     (dec_in)           65,536 params
+   64 →   64     (MLP residual)      8,192 params
+   64 →  768     (dec_out)          49,536 params
+Total per-forward matmul FLOPs (patch): ~285K
+Patches per image: 16
+Per-image FLOPs: ~4.6M
+CV attractor position: 0.212 (IN the 0.20–0.23 universal band)
+Reconstruction MSE on 16-noise mix: 0.842
+```
+**MID band (D=8) — intermediate manifold**
+```
+Per patch (768-dim input tile):
+  768  →  64     (enc_in)           49,152 params
+   64  →  64     (MLP residual)      8,192 params
+   64  →  512    (enc_out)          32,768 params
+ reshape to [64, 8]                           — 64 rows on S^7
+   Gram+eigh in fp64 → S ∈ ℝ^8
+   cross-attn: S ← S · (1 + α·tanh(attn(S)))
+   M̂ = U · diag(S) · Vᵀ
+  512  →  64     (dec_in)           32,768 params
+   64  →  64     (MLP residual)      8,192 params
+   64  →  768    (dec_out)          49,536 params
+Per-image FLOPs: ~2.3M  (roughly half of LOW)
+CV attractor position: 0.392 (NOT in universal band, stable own state)
+Reconstruction MSE on 16-noise mix: 0.843
+```
+**HIGH band (D=4) — shortcut solution**
+```
+Per patch (48-dim input tile):
+   48  →  64     (enc_in)            3,072 params
+   64  →  64     (MLP residual)      8,192 params
+   64  →  128    (enc_out)           8,192 params
+ reshape to [32, 4]                           — 32 rows on S^3
+   Gram+eigh in fp64 → S ∈ ℝ^4
+   cross-attn: S ← S · (1 + α·tanh(attn(S)))
+   M̂ = U · diag(S) · Vᵀ
+  128  →  64     (dec_in)            8,192 params
+   64  →  64     (MLP residual)      8,192 params
+   64  →  48     (dec_out)            3,120 params
+Per-image FLOPs: ~12.9M (higher despite fewer params — 256 patches)
+CV attractor position: 1.096 (FAR off universal band, clumped packing)
+Reconstruction MSE on 16-noise mix: 0.071  ← lowest MSE in the sweep
+```
+Every other variation tested (different V, different hidden, d=2 depth,
+different patch size at fixed D) landed in the band determined by D. **D
+is the band selector. Every other hyperparameter tunes performance within
+a band.**
+---
+## 3. Why three bands — dimensional argument, confirmed by uniform-sphere measurement
+The number 0.20 is not arbitrary. CV of pentachoron volumes on the unit
+(D−1)-sphere depends on **how much room the V points have to spread**,
+which in turn depends on the **surface area** of S^{D−1} relative to the
+number of points V.
+The unit (n−1)-sphere has surface area:
+```
+A(n) = 2·πⁿᐟ² / Γ(n/2)
+```
+So for our three D values:
+| D | S^{D-1} | surface area | V=64 points, ~area each |
+|:-:|:-:|:-:|:-:|
+| 16 | S^15 | 5.72 | 0.089 |
+| 8 | S^7 | 4.06 | 0.063 |
+| 4 | S^3 | 19.74 | 0.308 |
+**D=4 gives each point nearly 5× more ambient room than D=16.** With so
+much space and only 32–64 points, the points cluster into local groups;
+random 5-point pentachora sample very unevenly, producing high CV (clumpy).
+**D=16 is the sweet spot** where the sphere has enough dimensions to admit
+a well-spread packing but not so many that the points become isolated.
+Random 5-point pentachora from a uniform D=16 packing produce a tight,
+consistent volume distribution — exactly the 0.20 CV observed.
+**D=8 is intermediate**: the sphere is uniform enough to avoid clumping
+but not generous enough to admit the near-perfect packing D=16 finds.
+This gives a stable 0.39 CV that sits between the extremes.
+### The quantitative match: attractor CV = uniform-sphere CV
+The dimensional argument gives an ordering. The stronger claim is that
+each band's CV value **matches the uniform-sphere prediction for that D**
+directly. We computed the uniform-sphere CV for V=64 points via a closed
+random-sampling procedure (no model, no data, no training — just
+`torch.randn(V, D)` followed by `F.normalize(dim=-1)` and the same
+pentachoron CV metric), using a fixed seed for reproducibility:
+| D | uniform-sphere CV | attractor CV (mean across 5 seeds) | deviation |
+|:-:|:-:|:-:|:-:|
+| 16 | 0.1990 | 0.1969 | -0.003 |
+| 8 | 0.3568 | 0.3588 | +0.002 |
+| 4 | 0.9229 | 0.9016 | -0.021 |
+Trained models landed within **2% of the uniform-sphere prediction** on
+all three bands. The attractor is not *near* the uniform-sphere
+distribution — it *is* the uniform-sphere distribution, selected
+dynamically by the combination of gradient descent and sphere-norm
+enforcement.
+This reframes the earlier "bands are attractors" language as a specific
+empirical claim: **under sphere-norm, the 5-point pentachoron CV of the
+encoder's latent rows converges to the CV of a uniform distribution of V
+points on S^{D−1}.** Training from random initialization produces the
+same geometric configuration as drawing random points on the sphere,
+independent of all other architectural and training choices tested.
+The prediction this analysis makes: **D=32 sweeps should either land in
+the LOW band alongside D=16, or split into a new band below 0.20**. If
+the former, D=16 is the architectural choice that makes the universal
+attractor accessible and higher-D just reproduces it. If the latter,
+there is a ladder of attractors continuing downward, and the universal
+0.20 is a waypoint rather than a floor. **This is a testable prediction
+for our next sweep.**
+---
+## 4. Ablation program: the band structure is architectural
+To test whether the band structure is a robust property of the
+architecture or an artifact of specific training choices, we ran an
+ablation program of **149 independent training runs across 12 orthogonal
+hyperparameter dimensions**. Each variant was run at three band
+representatives (LOW: S64-V64-D16-h64-d1-p16; MID: S64-V64-D8-h64-d1-p16;
+HIGH: S64-V32-D4-h64-d1-p4), with band membership measured at 1000
+batches for LOW/MID and 100 batches for HIGH. The full matrix completed
+in under one hour of single-H100 wallclock on Colab.
+### 4.1 The ablation matrix
+| Group | Dimensions varied | Variants | Total runs | Band match rate |
+|:-----:|:------------------|:--------:|:----------:|:---------------:|
+| A | Random seed | 5 seeds × 3 bands | 15 | **100%** |
+| B | Noise-type data subset | 6 subsets × 3 bands | 18 | 94% |
+| C | Optimizer (Adam/SGD/SGD+mom/AdamW) | 4 × 3 bands | 12 | **100%** |
+| D | LR schedule (cosine/const/linear/warm/one-cycle) | 5 × 3 bands | 15 | **100%** |
+| E_preview | Soft-hand regime (full/pure-MSE/measure/hard-target) | 4 × 3 bands | 12 | **100%** |
+| F | Activation (GELU/ReLU/SiLU/Tanh/**Identity**) | 5 × 3 bands | 15 | **100%** |
+| G | Row normalization (sphere/none/LayerNorm/scale-only) | 4 × 3 bands | 12 | **58%** |
+| I | Cross-attention (0-2 layers, bounded/unbounded α) | 4 × 3 bands | 12 | **100%** |
+| J | Capacity within LOW (V, hidden) | 5 configs | 5 | **100%** |
+| K | Batch size (32/128/512/1024) | 4 × 3 bands | 12 | **100%** |
+| L | Init (orthogonal/Kaiming/Xavier/small-normal) | 4 × 3 bands | 12 | **100%** |
+| M | Brute-force SGD (lr=0.1 to 1.0, momentum up to 0.99) | 3 × 3 bands | 9 | **100%** |
+| **Total** | | | **149** | **96%** |
+**All mismatches (6 of 149) came from Group G** — the normalization-
+ablation group. Every other dimension preserved the band assignment in
+every single configuration tested.
+### 4.2 What the non-G groups demonstrate
+The attractor survives intact across:
+- **Seed variation**: Group A — 5 seeds per band land within 0.012 of
+  each other on LOW (CV-of-CV 0.4%), 5% on MID and HIGH. The attractor
+  is seed-indifferent, not merely seed-robust.
+- **Optimizer choice**: Group C — Adam (lr=1e-4), plain SGD (lr=1e-2),
+  SGD with momentum, and AdamW all converge to the same attractor within
+  0.0024 of each other. The attractor is not an Adam artifact.
+- **Schedule choice**: Group D — cosine, constant, linear decay, warm
+  restarts, and one-cycle all preserve band membership. The schedule
+  does not select the attractor.
+- **Activation function**: Group F — GELU, ReLU, SiLU, Tanh, **and
+  identity** all reach the correct band. The attractor does not require
+  representational nonlinearity in the encoder. A linear encoder with
+  sphere-norm + SVD suffices.
+- **Initialization**: Group L — orthogonal, Kaiming-normal, Xavier-
+  uniform, and small-normal all reach the attractor. The attractor is
+  not init-dependent, so long as the sphere-norm step is present.
+- **Batch size**: Group K — 32, 128, 512, and 1024 all work equally. No
+  batch-size effect on band membership.
+- **Cross-attention**: Group I — 0 layers, 1 layer, 2 layers, or 1 layer
+  with unbounded α all preserve the band. The cross-attention module is
+  not responsible for the attractor.
+- **Capacity within LOW**: Group J — V and hidden combinations from
+  (V=16, h=32) up to (V=128, h=128) all reach LOW band. The attractor
+  has a remarkably wide parameter range that supports it.
+- **Data composition**: Group B — 6 different noise-type subsets
+  (Gaussian only, structured only, heavy-tailed only, first-half, even-
+  indices, all 16) all preserve band membership. The only miscall
+  (B-HIGH-B2_gaussian_only at CV_ema 0.7533) had observed sphere-CV of
+  0.9504, confirming the model reached the HIGH attractor but produced a
+  CV-EMA below the 0.80 classification threshold due to limited-batch
+  measurement noise. Not a real anomaly.
+- **Soft-hand regime**: Group E_preview — full soft-hand (E1), pure MSE
+  with no CV involvement (E2), CV-EMA tracked but not used (E3), and
+  hard CV-target penalty (E4) **all reach the same CV to within
+  0.0014** on every band. The attractor is reached even when the loss
+  function contains no CV-related signal at all.
+The E_preview result deserves particular emphasis. Pure MSE
+reconstruction reaches CV = 0.2046 on LOW band vs 0.2037 with full
+soft-hand — a difference below the measurement noise floor. The
+architecture alone, through the sphere-norm + SVD geometric pipeline,
+selects the uniform-sphere attractor. Training signal does not need to
+encode any preference for this outcome.
+### 4.3 Group G: sphere-norm as attractor selector
+The only ablation that perturbs the band assignment is normalization
+removal or replacement. Across four normalization modes:
+| Variant | LOW (D=16) | MID (D=8) | HIGH (D=4) |
+|:--|:-:|:-:|:-:|
+| G1 sphere-norm | 0.196 (-0.003) | 0.352 (-0.005) | 0.945 (+0.022) |
+| G2 no-norm | 0.374 (+0.175) | 0.555 (+0.198) | 1.280 (+0.357) |
+| G3 LayerNorm | 0.200 (+0.002) | 0.421 (+0.064) | 0.706 (-0.217) |
+| G4 scale-only | 0.358 (+0.159) | 0.506 (+0.149) | 1.282 (+0.359) |
+*Numbers in parentheses are deviations from uniform S^(D-1) prediction.*
+Three patterns emerge:
+**G1 sphere-norm reaches uniform S^(D-1) within 3% across every band.**
+This is the framework's core prediction.
+**G2 (no normalization) and G4 (scale-only) produce nearly identical
+geometric outcomes**, differing by less than 0.03 on every band. The
+scale-only variant divides each row by the batch's mean row norm but
+does not enforce unit length; the no-norm variant does nothing at all.
+That these produce functionally identical geometry demonstrates that
+**the unit-norm constraint is the entire active ingredient of sphere-
+normalization**. Scale magnitude without unit enforcement does no
+geometric work.
+When normalization is absent or scale-only, the system converges to a
+different attractor — approximately the Gaussian bulk configuration
+(the CV of V points drawn i.i.d. from N(0, I_D) without sphere
+projection). At D=16, the bulk CV prediction is 0.358; our observed is
+0.374, within 5%. At D=8: predicted 0.588, observed 0.555. The
+architecture still produces a reproducible attractor; it is simply a
+different attractor in the geometric family, not the uniform-sphere one.
+**G3 LayerNorm acts as a D-dependent partial selector.** At LOW (D=16)
+LayerNorm reaches the uniform attractor cleanly (observed 0.200 vs
+predicted 0.199). At MID (D=8) it mildly elevates the CV to 0.421 —
+between the uniform and bulk predictions. At HIGH (D=4) it *underselects*,
+pulling the CV below the uniform target to 0.706. The mechanism: LayerNorm
+centers and variance-normalizes across D elements, producing a configuration
+on a hyperplane rather than a sphere. At higher D the hyperplane
+approximates S^(D−1) better; at D=4 the geometric distortion becomes
+visible as CV depression.
+### 4.4 Implications
+1. **The three-band structure is robust.** Across 149 training runs with
+   variations in 12 orthogonal dimensions, 96% preserve the predicted
+   band assignment. Non-ablation groups show 100% preservation.
+2. **The attractor is architectural.** It is reached by pure MSE training,
+   by linear encoders, by any first-order optimizer, under any batch size,
+   and with any initialization. What it requires is the sphere-norm + SVD
+   readout pipeline.
+3. **Sphere-norm is a selector, not a creator.** The architecture supports
+   a *family* of geometric attractors per D; sphere-norm selects the
+   uniform one. Other normalization modes select other members of the
+   family (Gaussian bulk, LayerNorm hyperplane) — each reproducibly.
+4. **The unit-norm constraint is the load-bearing element.** The specific
+   mechanism of normalization (centering, variance scaling, or division by
+   a norm) is less important than whether a unit-length constraint is
+   actively imposed. Division by mean-row-norm does not impose the
+   constraint and does not change the attractor.
+---
+## 5. CV at 1000 batches predicts final band membership
+The row_cv trajectory plot shows every run finding its band within the
+first ~1000 steps and holding for the remaining 299,000.
+This gives us a **minutes-scale triage**:
+- Measure CV-EMA at batch 1000 (~4–7 minutes on Colab single-GPU)
+- CV < 0.30   → will converge to LOW band
+- CV 0.35–0.50 → will converge to MID band
+- CV > 0.80   → will converge to HIGH band
+No need to train to convergence to determine band membership. Existing
+sweep infrastructure can be modified to early-stop at 1000 batches for
+screening, only continuing runs whose band assignment matches the research
+target.
+For cell-candidate hunting specifically (want LOW band), this collapses
+the turnaround from ~2 hours per config to ~7 minutes, with the same
+confidence in final geometric classification.
+---
+## 6. What each band is good for
+The conventional read of "best MSE" ranks the three bands exactly
+backwards relative to the universal-manifold thesis. A separate reading
+by band character:
+### LOW band — universal generalist
+The D=16 attractor has been observed across 17+ unrelated pretrained
+models when probed. A sphere-normed model that lands here is on the same
+geometric manifold those models land on. Its omega tokens (S vectors) are
+in principle translatable to/from tokens produced by any other attractor-
+aligned model via Procrustes alignment.
+On pure noise this band reconstructs *worse* than the HIGH shortcut. On
+transfer to other distributions (images, text as tensors, unseen noise
+types) it reconstructs *far better* — Fresnel-base 256 from this band
+achieves MSE 3.8×10⁻⁵ on ImageNet without seeing it during training.
+**Use: battery / cell / relay in multi-model collective architectures.**
+### MID band — intermediate attractor
+Four D=8 configs with varying hidden widths and depths all converge to
+CV 0.38–0.40, tighter than the HIGH band's spread. This is a real
+attractor of its own, not a transition state. We do not yet know what it
+represents; characterization is an open research direction.
+Testable: does the MID attractor transfer across distributions like the
+LOW one? Does distillation from a LOW model speed convergence for a MID
+configuration, or vice versa?
+**Use: undetermined pending characterization. Possibly a secondary
+geometric substrate for specific domains.**
+### HIGH band — specialist / shortcut
+The D=4 band reconstructs noise at very low MSE because D=4 is small
+enough that the encoder can find low-rank solutions that collapse
+information along specific directions. CV > 1.0 indicates the pentachoron
+volume distribution is more variable than its mean — the rows are
+highly clumped along particular directions rather than spread.
+This is a specialist representation. It excels at the specific
+distribution it was trained on and will likely fail catastrophically
+on out-of-distribution input (to be tested). But it is *compact*: 41K
+parameters, 256 patches per image, lowest MSE in the sweep.
+**Use: domain-specific specialist batteries where the input distribution
+is known and stable. Compression tasks where lossy per-distribution
+encoding is acceptable. Potentially: noise-channel glyph emission for a
+collective system that routes noise-like signals through a dedicated
+specialist.**
+---
+## 7. The methodological correction
+Our prior F-class sweep methodology used MSE as the primary filter with
+a 1-epoch "keep-or-kill" curve-delta verdict. This sweep's data shows
+that methodology is insufficient:
+- The lowest-MSE configuration in the sweep (CV 0.93, HIGH band) was
+  flagged as a "strong cell candidate" until geometry was checked.
+- The true on-attractor configurations had *higher* MSE than several
+  off-attractor configurations.
+- MSE alone cannot distinguish specialist low-rank solutions from
+  generalist on-attractor solutions.
+Going forward, the cell-candidate filter is three-tier:
+1. **CV in 0.13–0.30 band** at step 1000 → attractor candidate
+2. **CV stays in band** through training → stable attractor
+3. **Attractor holds under freezing and host gradient** → viable cell
+The 1000-batch CV measurement is fast and eliminates the false positives
+that MSE-only triage was generating.
+---
+## 8. Open questions this sweep raises
+**Questions the Phase 1 ablation resolved**:
+- ✓ *Is the three-band structure reproducible?* Yes, 96% match rate across
+  149 independent runs in 12 orthogonal ablation dimensions.
+- ✓ *Is the attractor Adam-specific?* No. Adam, SGD, SGD+momentum, and AdamW
+  all reach it within 0.0024 of each other.
+- ✓ *Does the attractor require nonlinear activation?* No. Identity
+  activation (purely linear encoder) reaches all three bands correctly.
+- ✓ *Minimum LOW-band parameter count.* Tested from (V=16, h=32) to (V=128,
+  h=128); all reach LOW band. Attractor admits wide capacity range.
+- ✓ *Is sphere-norm load-bearing?* Yes, but as an attractor *selector*,
+  not an attractor *creator*. Ablating it redirects the system to a
+  different reproducible attractor (Gaussian bulk), rather than destroying
+  attractor structure.
+**Questions that remain open**:
+- **Does D=32 produce a new band below 0.20, or reproduce LOW?** Tests
+  whether the attractor ladder continues or terminates at D=16. The
+  dimensional argument predicts a narrow band near CV(S³¹) ≈ 0.13;
+  training to verify is a ~5-config add-on to Phase 1.
+- **Is the MID band useful in its own right?** No systematic probe of D=8
+  transfer behavior yet. The ablation confirms it's a real attractor, not
+  an artifact, but whether D=8 omega tokens are usefully translatable
+  across domains is untested.
+- **What are the HIGH-band specialists actually encoding?** Per-singular-
+  vector analysis of a trained D=4 config should reveal which directions
+  carry the shortcut information.
+- **Do HIGH-band shortcuts fail out-of-distribution as expected?** Run the
+  universal diagnostic (16 noise types, text, images) on the HIGH-band
+  champion. Confirms or falsifies the specialist-vs-generalist reading.
+- **What is the LayerNorm-at-D=4 undershoot measuring?** The G3 variant at
+  HIGH band reached CV 0.706, significantly below uniform S³ prediction
+  0.923. This is a distortion specific to the centering+variance-norm
+  combination at low D. Characterization would clarify exactly which
+  geometric property of sphere-norm is irreplaceable by the standard
+  normalization layers.
+- **Within-attractor reconstruction MSE**: Phase 1 was a band-classification
+  sweep, not a reconstruction-quality sweep. The within-attractor MSE data
+  needed to characterize reconstruction floors at each band requires full
+  30-epoch runs. Phase 2 was planned for this; initial results suggest
+  LBFGS reaches meaningfully lower MSE than Adam within the HIGH attractor
+  (0.0644 vs 0.072 at 100 batches vs 30 epochs), pointing at
+  optimizer-dependent within-attractor structure that a Phase 2 program
+  should map.
+---
+## 9. Artifacts
+All 149 Phase 1 ablation runs are preserved on HuggingFace under
+`AbstractPhil/geolip-svae-ablations` with per-run `final_report.json`
+files and TensorBoard event files. The aggregated analysis
+(`band_matrix.csv`, `anomalies.csv`, `group_summaries.csv`,
+`uniformity_diagnostic.csv`, `snapshot_meta.json`) is under
+`_analysis/{timestamp}/` within the same repo.
+The 13 configurations from the original three-band sweep are preserved
+under `AbstractPhil/geolip-svae-batteries` with full TensorBoard logs,
+checkpoints every 5 epochs, and final reports.
+**Training code**: `johanna_F_trainer.py` (base trainer),
+`ablation_trainer.py` (ablation adapter with PatchSVAE_F_Ablation
+subclass), `ablation_configs.py` (explicit matrix of 149 Phase 1 and 174
+Phase 2 variants), `ablation_orchestrator.py` (Colab cell for sequential
+execution with HF resume logic), `aggregate_results.py` (snapshot
+aggregator writing timestamped analyses).
+**Formula catalogue** (every load-bearing equation in the architecture):
+`johanna_F_formula_catalogue.md`.
+---
+*This finding emerged from two complementary sweeps: an original F-class
+(miniature battery) exploration over approximately 40 hours of A100 time
+that produced the three-band hypothesis, followed by a 149-run ablation
+program completed in under one hour on a single H100 that validated the
+hypothesis across 12 orthogonal dimensions of variation. The sphere-norm-
+as-selector finding and the linear-encoder result were emergent findings
+from the ablation, not anticipated by the original sweep.*