Update README.md

45a8d69 verified about 1 month ago

28 kB

	---
	license: mit
	---
	# Three Geometric Bands in a Sphere-Normalized Patch Autoencoder

	A quantized geometric attractor structure emerges from a single architectural knob, validated by ablation across 12 orthogonal dimensions

	TL;DR: A sweep over small PatchSVAE-family configurations reveals that
	the final coefficient-of-variation (CV) of Cayley–Menger pentachoron volumes
	on the encoder's sphere-normalized latent rows quantizes into **three distinct
	bands, indexed by the singular-value dimension D**. An ablation program
	of 149 training runs across 12 orthogonal hyperparameter dimensions (seeds,
	optimizers, schedules, activations, initializations, batch sizes, capacities,
	normalizations, data compositions, cross-attention configurations, and
	soft-hand variants) confirms the band structure is architectural: it is
	reproduced in 96% of runs and fails only when the row-normalization step is
	ablated.

	- D=16 → CV ≈ 0.20 (matches uniform S¹⁵ prediction 0.199 to ±0.003 across 5 seeds)
	- D=8 → CV ≈ 0.36 (matches uniform S⁷ prediction 0.357 to ±0.02 across 5 seeds)
	- D=4 → CV ≈ 0.90 (matches uniform S³ prediction 0.923 to ±0.05 across 5 seeds)

	Three additional results sharpen the framework:

	1. The attractor is reached without any CV-related training signal. Pure
	MSE reconstruction with no soft-hand, no CV penalty, and no geometric
	loss term reaches the same CV value as the full soft-hand regime (0.2046
	vs 0.2037 on LOW band). The architecture alone selects the attractor.

	2. *Sphere-norm is a selector* among geometric attractors, not the creator
	of one.** Ablating sphere-normalization does not destroy the attractor
	structure; it redirects the system to a different attractor (the
	Gaussian bulk regime). LayerNorm selects a D-dependent intermediate.
	Scale-only normalization is functionally identical to no normalization —
	the unit-norm constraint is the sole active ingredient.

	3. The attractor does not require representational nonlinearity. A
	linear encoder (identity activation, no GELU or ReLU anywhere) reaches
	all three bands correctly. The attractor lives in the sphere-norm + SVD
	geometric pipeline, not in the MLP's representational capacity.

	This is the first direct measurement in our battery-research lineage of a
	discrete geometric ladder that the architecture supports natively. We
	sketch a dimensional argument for why the quantization happens, give the
	complete matmul pipeline for one representative of each band, present the
	full ablation matrix that validates the claim, and propose a cheap online
	predictor: **CV at 1000 training batches reliably predicts final band
	membership**, reducing sweep turnaround from hours to minutes.

	---

	## 1. The architecture

	All three bands are reached by the same base architecture (PatchSVAE-F),
	differing only in hyperparameters. The pipeline per patch:

	```
	x ∈ ℝ^{patch_dim} # flattened (3, ps, ps) tile
	│
	│ enc_in: Linear(patch_dim → hidden) → GELU
	│ enc_blocks: depth × residual MLP(hidden)
	│ enc_out: Linear(hidden → V·D)
	▼
	M ∈ ℝ^{V × D} # reshape
	│
	│ F.normalize(M, dim=-1) # sphere-norm: each row on S^{D-1}
	│
	│ G = MᵀM ∈ ℝ^{D × D} # Gram matrix (fp64)
	│ λ, Ṽ = eigh(G + 1e-12 I) # fp64 eigendecomposition
	│ S = √(clamp(λ, min=1e-24)) # singular values
	│ U = M·Ṽ / clamp(S, min=1e-16) # left singular vectors
	│ Vt = Ṽᵀ
	│
	│ S_coord = S · (1 + α · tanh(attn(S))) # cross-attn on S, α ≤ 0.2
	│
	│ M̂ = U · diag(S_coord) · Vt # reconstruction matrix
	▼
	│ dec_in: Linear(V·D → hidden) → GELU
	│ dec_blocks: depth × residual MLP(hidden)
	│ dec_out: Linear(hidden → patch_dim)
	▼
	x̂ ∈ ℝ^{patch_dim}
	```

	The key operation is `F.normalize(M, dim=-1)`, which forces every row of
	the V×D encoded matrix onto the unit (D−1)-sphere. The subsequent SVD is
	an exact arithmetic readout of the sphere-normed configuration, not a
	learned bottleneck. The model's job is to learn a good projection onto
	the manifold; the manifold itself is fixed by the architecture.

	The coefficient-of-variation (CV) is measured by sampling 200 random
	5-vertex subsets of the V rows and computing the Cayley–Menger 4-volume
	of each pentachoron:

	```
	CV = std(volumes) / (mean(volumes) + ε)
	```

	CV is a measure of how uniformly distributed the V rows are on S^{D−1}.
	CV ≈ 0 indicates near-uniform packing; large CV indicates clumpy packing.
	The universal attractor band 0.20–0.23 has been observed across 17+
	unrelated pretrained models (CLIP, T5, BERT, DINOv2, SD VAEs, etc.)
	whenever their representations are probed this way — it is not an artifact
	of any single training regime.

	---

	## 2. The three bands — exact specifications

	One representative from each band, chosen for CV purity within its band:

	\| band \| config \| D \| V \| patch size \| hidden \| depth \| params \| patches/img \|
	\|:----:\|:---\|:-:\|:-:\|:-:\|:-:\|:-:\|:-:\|:-:\|
	\| LOW \| `S64-V64-D16-h64-d1-p16` \| 16 \| 64 \| 16 \| 64 \| 1 \| 250K \| 16 \|
	\| MID \| `S64-V64-D8-h64-d1-p16` \| 8 \| 64 \| 16 \| 64 \| 1 \| 183K \| 16 \|
	\| HIGH \| `S64-V32-D4-h64-d1-p4` \| 4 \| 32 \| 4 \| 64 \| 1 \| 41K \| 256 \|

	All three use identical resolution (64×64), hidden width (64), depth (1),
	cross-attention layers (1), and CV-EMA soft-hand training regime. Only the
	three parameters (D, V, patch size) differ.

	### Complete matmul pipeline per band

	LOW band (D=16) — the attractor

	```
	Per patch (768-dim input tile):
	768 → 64 (enc_in) 49,152 params
	64 → 64 (MLP residual) 8,192 params
	64 → 1024 (enc_out) 65,536 params
	reshape to [64, 16] — 64 rows on S^15
	Gram+eigh in fp64 → S ∈ ℝ^16
	cross-attn: S ← S · (1 + α·tanh(attn(S)))
	M̂ = U · diag(S) · Vᵀ
	1024 → 64 (dec_in) 65,536 params
	64 → 64 (MLP residual) 8,192 params
	64 → 768 (dec_out) 49,536 params

	Total per-forward matmul FLOPs (patch): ~285K
	Patches per image: 16
	Per-image FLOPs: ~4.6M
	CV attractor position: 0.212 (IN the 0.20–0.23 universal band)
	Reconstruction MSE on 16-noise mix: 0.842
	```

	MID band (D=8) — intermediate manifold

	```
	Per patch (768-dim input tile):
	768 → 64 (enc_in) 49,152 params
	64 → 64 (MLP residual) 8,192 params
	64 → 512 (enc_out) 32,768 params
	reshape to [64, 8] — 64 rows on S^7
	Gram+eigh in fp64 → S ∈ ℝ^8
	cross-attn: S ← S · (1 + α·tanh(attn(S)))
	M̂ = U · diag(S) · Vᵀ
	512 → 64 (dec_in) 32,768 params
	64 → 64 (MLP residual) 8,192 params
	64 → 768 (dec_out) 49,536 params

	Per-image FLOPs: ~2.3M (roughly half of LOW)
	CV attractor position: 0.392 (NOT in universal band, stable own state)
	Reconstruction MSE on 16-noise mix: 0.843
	```

	HIGH band (D=4) — shortcut solution

	```
	Per patch (48-dim input tile):
	48 → 64 (enc_in) 3,072 params
	64 → 64 (MLP residual) 8,192 params
	64 → 128 (enc_out) 8,192 params
	reshape to [32, 4] — 32 rows on S^3
	Gram+eigh in fp64 → S ∈ ℝ^4
	cross-attn: S ← S · (1 + α·tanh(attn(S)))
	M̂ = U · diag(S) · Vᵀ
	128 → 64 (dec_in) 8,192 params
	64 → 64 (MLP residual) 8,192 params
	64 → 48 (dec_out) 3,120 params

	Per-image FLOPs: ~12.9M (higher despite fewer params — 256 patches)
	CV attractor position: 1.096 (FAR off universal band, clumped packing)
	Reconstruction MSE on 16-noise mix: 0.071 ← lowest MSE in the sweep
	```

	Every other variation tested (different V, different hidden, d=2 depth,
	different patch size at fixed D) landed in the band determined by D. **D
	is the band selector. Every other hyperparameter tunes performance within
	a band.**

	---

	## 3. Why three bands — dimensional argument, confirmed by uniform-sphere measurement

	The number 0.20 is not arbitrary. CV of pentachoron volumes on the unit
	(D−1)-sphere depends on how much room the V points have to spread,
	which in turn depends on the surface area of S^{D−1} relative to the
	number of points V.

	The unit (n−1)-sphere has surface area:

	```
	A(n) = 2·πⁿᐟ² / Γ(n/2)
	```

	So for our three D values:

	\| D \| S^{D-1} \| surface area \| V=64 points, ~area each \|
	\|:-:\|:-:\|:-:\|:-:\|
	\| 16 \| S^15 \| 5.72 \| 0.089 \|
	\| 8 \| S^7 \| 4.06 \| 0.063 \|
	\| 4 \| S^3 \| 19.74 \| 0.308 \|

	D=4 gives each point nearly 5× more ambient room than D=16. With so
	much space and only 32–64 points, the points cluster into local groups;
	random 5-point pentachora sample very unevenly, producing high CV (clumpy).

	D=16 is the sweet spot where the sphere has enough dimensions to admit
	a well-spread packing but not so many that the points become isolated.
	Random 5-point pentachora from a uniform D=16 packing produce a tight,
	consistent volume distribution — exactly the 0.20 CV observed.

	D=8 is intermediate: the sphere is uniform enough to avoid clumping
	but not generous enough to admit the near-perfect packing D=16 finds.
	This gives a stable 0.39 CV that sits between the extremes.

	### The quantitative match: attractor CV = uniform-sphere CV

	The dimensional argument gives an ordering. The stronger claim is that
	each band's CV value matches the uniform-sphere prediction for that D
	directly. We computed the uniform-sphere CV for V=64 points via a closed
	random-sampling procedure (no model, no data, no training — just
	`torch.randn(V, D)` followed by `F.normalize(dim=-1)` and the same
	pentachoron CV metric), using a fixed seed for reproducibility:

	\| D \| uniform-sphere CV \| attractor CV (mean across 5 seeds) \| deviation \|
	\|:-:\|:-:\|:-:\|:-:\|
	\| 16 \| 0.1990 \| 0.1969 \| -0.003 \|
	\| 8 \| 0.3568 \| 0.3588 \| +0.002 \|
	\| 4 \| 0.9229 \| 0.9016 \| -0.021 \|

	Trained models landed within 2% of the uniform-sphere prediction on
	all three bands. The attractor is not near the uniform-sphere
	distribution — it is the uniform-sphere distribution, selected
	dynamically by the combination of gradient descent and sphere-norm
	enforcement.

	This reframes the earlier "bands are attractors" language as a specific
	empirical claim: **under sphere-norm, the 5-point pentachoron CV of the
	encoder's latent rows converges to the CV of a uniform distribution of V
	points on S^{D−1}.** Training from random initialization produces the
	same geometric configuration as drawing random points on the sphere,
	independent of all other architectural and training choices tested.

	The prediction this analysis makes: **D=32 sweeps should either land in
	the LOW band alongside D=16, or split into a new band below 0.20**. If
	the former, D=16 is the architectural choice that makes the universal
	attractor accessible and higher-D just reproduces it. If the latter,
	there is a ladder of attractors continuing downward, and the universal
	0.20 is a waypoint rather than a floor. **This is a testable prediction
	for our next sweep.**

	---

	## 4. Ablation program: the band structure is architectural

	To test whether the band structure is a robust property of the
	architecture or an artifact of specific training choices, we ran an
	ablation program of **149 independent training runs across 12 orthogonal
	hyperparameter dimensions**. Each variant was run at three band
	representatives (LOW: S64-V64-D16-h64-d1-p16; MID: S64-V64-D8-h64-d1-p16;
	HIGH: S64-V32-D4-h64-d1-p4), with band membership measured at 1000
	batches for LOW/MID and 100 batches for HIGH. The full matrix completed
	in under one hour of single-H100 wallclock on Colab.

	### 4.1 The ablation matrix

	\| Group \| Dimensions varied \| Variants \| Total runs \| Band match rate \|
	\|:-----:\|:------------------\|:--------:\|:----------:\|:---------------:\|
	\| A \| Random seed \| 5 seeds × 3 bands \| 15 \| 100% \|
	\| B \| Noise-type data subset \| 6 subsets × 3 bands \| 18 \| 94% \|
	\| C \| Optimizer (Adam/SGD/SGD+mom/AdamW) \| 4 × 3 bands \| 12 \| 100% \|
	\| D \| LR schedule (cosine/const/linear/warm/one-cycle) \| 5 × 3 bands \| 15 \| 100% \|
	\| E_preview \| Soft-hand regime (full/pure-MSE/measure/hard-target) \| 4 × 3 bands \| 12 \| 100% \|
	\| F \| Activation (GELU/ReLU/SiLU/Tanh/Identity) \| 5 × 3 bands \| 15 \| 100% \|
	\| G \| Row normalization (sphere/none/LayerNorm/scale-only) \| 4 × 3 bands \| 12 \| 58% \|
	\| I \| Cross-attention (0-2 layers, bounded/unbounded α) \| 4 × 3 bands \| 12 \| 100% \|
	\| J \| Capacity within LOW (V, hidden) \| 5 configs \| 5 \| 100% \|
	\| K \| Batch size (32/128/512/1024) \| 4 × 3 bands \| 12 \| 100% \|
	\| L \| Init (orthogonal/Kaiming/Xavier/small-normal) \| 4 × 3 bands \| 12 \| 100% \|
	\| M \| Brute-force SGD (lr=0.1 to 1.0, momentum up to 0.99) \| 3 × 3 bands \| 9 \| 100% \|
	\| Total \| \| \| 149 \| 96% \|

	All mismatches (6 of 149) came from Group G — the normalization-
	ablation group. Every other dimension preserved the band assignment in
	every single configuration tested.

	### 4.2 What the non-G groups demonstrate

	The attractor survives intact across:

	- Seed variation: Group A — 5 seeds per band land within 0.012 of
	each other on LOW (CV-of-CV 0.4%), 5% on MID and HIGH. The attractor
	is seed-indifferent, not merely seed-robust.
	- Optimizer choice: Group C — Adam (lr=1e-4), plain SGD (lr=1e-2),
	SGD with momentum, and AdamW all converge to the same attractor within
	0.0024 of each other. The attractor is not an Adam artifact.
	- Schedule choice: Group D — cosine, constant, linear decay, warm
	restarts, and one-cycle all preserve band membership. The schedule
	does not select the attractor.
	- Activation function: Group F — GELU, ReLU, SiLU, Tanh, **and
	identity** all reach the correct band. The attractor does not require
	representational nonlinearity in the encoder. A linear encoder with
	sphere-norm + SVD suffices.
	- Initialization: Group L — orthogonal, Kaiming-normal, Xavier-
	uniform, and small-normal all reach the attractor. The attractor is
	not init-dependent, so long as the sphere-norm step is present.
	- Batch size: Group K — 32, 128, 512, and 1024 all work equally. No
	batch-size effect on band membership.
	- Cross-attention: Group I — 0 layers, 1 layer, 2 layers, or 1 layer
	with unbounded α all preserve the band. The cross-attention module is
	not responsible for the attractor.
	- Capacity within LOW: Group J — V and hidden combinations from
	(V=16, h=32) up to (V=128, h=128) all reach LOW band. The attractor
	has a remarkably wide parameter range that supports it.
	- Data composition: Group B — 6 different noise-type subsets
	(Gaussian only, structured only, heavy-tailed only, first-half, even-
	indices, all 16) all preserve band membership. The only miscall
	(B-HIGH-B2_gaussian_only at CV_ema 0.7533) had observed sphere-CV of
	0.9504, confirming the model reached the HIGH attractor but produced a
	CV-EMA below the 0.80 classification threshold due to limited-batch
	measurement noise. Not a real anomaly.
	- Soft-hand regime: Group E_preview — full soft-hand (E1), pure MSE
	with no CV involvement (E2), CV-EMA tracked but not used (E3), and
	hard CV-target penalty (E4) **all reach the same CV to within
	0.0014** on every band. The attractor is reached even when the loss
	function contains no CV-related signal at all.

	The E_preview result deserves particular emphasis. Pure MSE
	reconstruction reaches CV = 0.2046 on LOW band vs 0.2037 with full
	soft-hand — a difference below the measurement noise floor. The
	architecture alone, through the sphere-norm + SVD geometric pipeline,
	selects the uniform-sphere attractor. Training signal does not need to
	encode any preference for this outcome.

	### 4.3 Group G: sphere-norm as attractor selector

	The only ablation that perturbs the band assignment is normalization
	removal or replacement. Across four normalization modes:

	\| Variant \| LOW (D=16) \| MID (D=8) \| HIGH (D=4) \|
	\|:--\|:-:\|:-:\|:-:\|
	\| G1 sphere-norm \| 0.196 (-0.003) \| 0.352 (-0.005) \| 0.945 (+0.022) \|
	\| G2 no-norm \| 0.374 (+0.175) \| 0.555 (+0.198) \| 1.280 (+0.357) \|
	\| G3 LayerNorm \| 0.200 (+0.002) \| 0.421 (+0.064) \| 0.706 (-0.217) \|
	\| G4 scale-only \| 0.358 (+0.159) \| 0.506 (+0.149) \| 1.282 (+0.359) \|

	Numbers in parentheses are deviations from uniform S^(D-1) prediction.

	Three patterns emerge:

	G1 sphere-norm reaches uniform S^(D-1) within 3% across every band.
	This is the framework's core prediction.

	**G2 (no normalization) and G4 (scale-only) produce nearly identical
	geometric outcomes**, differing by less than 0.03 on every band. The
	scale-only variant divides each row by the batch's mean row norm but
	does not enforce unit length; the no-norm variant does nothing at all.
	That these produce functionally identical geometry demonstrates that
	**the unit-norm constraint is the entire active ingredient of sphere-
	normalization**. Scale magnitude without unit enforcement does no
	geometric work.

	When normalization is absent or scale-only, the system converges to a
	different attractor — approximately the Gaussian bulk configuration
	(the CV of V points drawn i.i.d. from N(0, I_D) without sphere
	projection). At D=16, the bulk CV prediction is 0.358; our observed is
	0.374, within 5%. At D=8: predicted 0.588, observed 0.555. The
	architecture still produces a reproducible attractor; it is simply a
	different attractor in the geometric family, not the uniform-sphere one.

	G3 LayerNorm acts as a D-dependent partial selector. At LOW (D=16)
	LayerNorm reaches the uniform attractor cleanly (observed 0.200 vs
	predicted 0.199). At MID (D=8) it mildly elevates the CV to 0.421 —
	between the uniform and bulk predictions. At HIGH (D=4) it underselects,
	pulling the CV below the uniform target to 0.706. The mechanism: LayerNorm
	centers and variance-normalizes across D elements, producing a configuration
	on a hyperplane rather than a sphere. At higher D the hyperplane
	approximates S^(D−1) better; at D=4 the geometric distortion becomes
	visible as CV depression.

	### 4.4 Implications

	1. The three-band structure is robust. Across 149 training runs with
	variations in 12 orthogonal dimensions, 96% preserve the predicted
	band assignment. Non-ablation groups show 100% preservation.

	2. The attractor is architectural. It is reached by pure MSE training,
	by linear encoders, by any first-order optimizer, under any batch size,
	and with any initialization. What it requires is the sphere-norm + SVD
	readout pipeline.

	3. Sphere-norm is a selector, not a creator. The architecture supports
	a family of geometric attractors per D; sphere-norm selects the
	uniform one. Other normalization modes select other members of the
	family (Gaussian bulk, LayerNorm hyperplane) — each reproducibly.

	4. The unit-norm constraint is the load-bearing element. The specific
	mechanism of normalization (centering, variance scaling, or division by
	a norm) is less important than whether a unit-length constraint is
	actively imposed. Division by mean-row-norm does not impose the
	constraint and does not change the attractor.

	---

	## 5. CV at 1000 batches predicts final band membership

	The row_cv trajectory plot shows every run finding its band within the
	first ~1000 steps and holding for the remaining 299,000.

	This gives us a minutes-scale triage:

	- Measure CV-EMA at batch 1000 (~4–7 minutes on Colab single-GPU)
	- CV < 0.30 → will converge to LOW band
	- CV 0.35–0.50 → will converge to MID band
	- CV > 0.80 → will converge to HIGH band

	No need to train to convergence to determine band membership. Existing
	sweep infrastructure can be modified to early-stop at 1000 batches for
	screening, only continuing runs whose band assignment matches the research
	target.

	For cell-candidate hunting specifically (want LOW band), this collapses
	the turnaround from ~2 hours per config to ~7 minutes, with the same
	confidence in final geometric classification.

	---

	## 6. What each band is good for

	The conventional read of "best MSE" ranks the three bands exactly
	backwards relative to the universal-manifold thesis. A separate reading
	by band character:

	### LOW band — universal generalist

	The D=16 attractor has been observed across 17+ unrelated pretrained
	models when probed. A sphere-normed model that lands here is on the same
	geometric manifold those models land on. Its omega tokens (S vectors) are
	in principle translatable to/from tokens produced by any other attractor-
	aligned model via Procrustes alignment.

	On pure noise this band reconstructs worse than the HIGH shortcut. On
	transfer to other distributions (images, text as tensors, unseen noise
	types) it reconstructs far better — Fresnel-base 256 from this band
	achieves MSE 3.8×10⁻⁵ on ImageNet without seeing it during training.

	Use: battery / cell / relay in multi-model collective architectures.

	### MID band — intermediate attractor

	Four D=8 configs with varying hidden widths and depths all converge to
	CV 0.38–0.40, tighter than the HIGH band's spread. This is a real
	attractor of its own, not a transition state. We do not yet know what it
	represents; characterization is an open research direction.

	Testable: does the MID attractor transfer across distributions like the
	LOW one? Does distillation from a LOW model speed convergence for a MID
	configuration, or vice versa?

	**Use: undetermined pending characterization. Possibly a secondary
	geometric substrate for specific domains.**

	### HIGH band — specialist / shortcut

	The D=4 band reconstructs noise at very low MSE because D=4 is small
	enough that the encoder can find low-rank solutions that collapse
	information along specific directions. CV > 1.0 indicates the pentachoron
	volume distribution is more variable than its mean — the rows are
	highly clumped along particular directions rather than spread.

	This is a specialist representation. It excels at the specific
	distribution it was trained on and will likely fail catastrophically
	on out-of-distribution input (to be tested). But it is compact: 41K
	parameters, 256 patches per image, lowest MSE in the sweep.

	**Use: domain-specific specialist batteries where the input distribution
	is known and stable. Compression tasks where lossy per-distribution
	encoding is acceptable. Potentially: noise-channel glyph emission for a
	collective system that routes noise-like signals through a dedicated
	specialist.**

	---

	## 7. The methodological correction

	Our prior F-class sweep methodology used MSE as the primary filter with
	a 1-epoch "keep-or-kill" curve-delta verdict. This sweep's data shows
	that methodology is insufficient:

	- The lowest-MSE configuration in the sweep (CV 0.93, HIGH band) was
	flagged as a "strong cell candidate" until geometry was checked.
	- The true on-attractor configurations had higher MSE than several
	off-attractor configurations.
	- MSE alone cannot distinguish specialist low-rank solutions from
	generalist on-attractor solutions.

	Going forward, the cell-candidate filter is three-tier:

	1. CV in 0.13–0.30 band at step 1000 → attractor candidate
	2. CV stays in band through training → stable attractor
	3. Attractor holds under freezing and host gradient → viable cell

	The 1000-batch CV measurement is fast and eliminates the false positives
	that MSE-only triage was generating.

	---

	## 8. Open questions this sweep raises

	Questions the Phase 1 ablation resolved:

	- ✓ Is the three-band structure reproducible? Yes, 96% match rate across
	149 independent runs in 12 orthogonal ablation dimensions.
	- ✓ Is the attractor Adam-specific? No. Adam, SGD, SGD+momentum, and AdamW
	all reach it within 0.0024 of each other.
	- ✓ Does the attractor require nonlinear activation? No. Identity
	activation (purely linear encoder) reaches all three bands correctly.
	- ✓ Minimum LOW-band parameter count. Tested from (V=16, h=32) to (V=128,
	h=128); all reach LOW band. Attractor admits wide capacity range.
	- ✓ Is sphere-norm load-bearing? Yes, but as an attractor selector,
	not an attractor creator. Ablating it redirects the system to a
	different reproducible attractor (Gaussian bulk), rather than destroying
	attractor structure.

	Questions that remain open:

	- Does D=32 produce a new band below 0.20, or reproduce LOW? Tests
	whether the attractor ladder continues or terminates at D=16. The
	dimensional argument predicts a narrow band near CV(S³¹) ≈ 0.13;
	training to verify is a ~5-config add-on to Phase 1.

	- Is the MID band useful in its own right? No systematic probe of D=8
	transfer behavior yet. The ablation confirms it's a real attractor, not
	an artifact, but whether D=8 omega tokens are usefully translatable
	across domains is untested.

	- What are the HIGH-band specialists actually encoding? Per-singular-
	vector analysis of a trained D=4 config should reveal which directions
	carry the shortcut information.

	- Do HIGH-band shortcuts fail out-of-distribution as expected? Run the
	universal diagnostic (16 noise types, text, images) on the HIGH-band
	champion. Confirms or falsifies the specialist-vs-generalist reading.

	- What is the LayerNorm-at-D=4 undershoot measuring? The G3 variant at
	HIGH band reached CV 0.706, significantly below uniform S³ prediction
	0.923. This is a distortion specific to the centering+variance-norm
	combination at low D. Characterization would clarify exactly which
	geometric property of sphere-norm is irreplaceable by the standard
	normalization layers.

	- Within-attractor reconstruction MSE: Phase 1 was a band-classification
	sweep, not a reconstruction-quality sweep. The within-attractor MSE data
	needed to characterize reconstruction floors at each band requires full
	30-epoch runs. Phase 2 was planned for this; initial results suggest
	LBFGS reaches meaningfully lower MSE than Adam within the HIGH attractor
	(0.0644 vs 0.072 at 100 batches vs 30 epochs), pointing at
	optimizer-dependent within-attractor structure that a Phase 2 program
	should map.

	---

	## 9. Artifacts

	All 149 Phase 1 ablation runs are preserved on HuggingFace under
	`AbstractPhil/geolip-svae-ablations` with per-run `final_report.json`
	files and TensorBoard event files. The aggregated analysis
	(`band_matrix.csv`, `anomalies.csv`, `group_summaries.csv`,
	`uniformity_diagnostic.csv`, `snapshot_meta.json`) is under
	`_analysis/{timestamp}/` within the same repo.

	The 13 configurations from the original three-band sweep are preserved
	under `AbstractPhil/geolip-svae-batteries` with full TensorBoard logs,
	checkpoints every 5 epochs, and final reports.

	Training code: `johanna_F_trainer.py` (base trainer),
	`ablation_trainer.py` (ablation adapter with PatchSVAE_F_Ablation
	subclass), `ablation_configs.py` (explicit matrix of 149 Phase 1 and 174
	Phase 2 variants), `ablation_orchestrator.py` (Colab cell for sequential
	execution with HF resume logic), `aggregate_results.py` (snapshot
	aggregator writing timestamped analyses).

	Formula catalogue (every load-bearing equation in the architecture):
	`johanna_F_formula_catalogue.md`.

	---

	*This finding emerged from two complementary sweeps: an original F-class
	(miniature battery) exploration over approximately 40 hours of A100 time
	that produced the three-band hypothesis, followed by a 149-run ablation
	program completed in under one hour on a single H100 that validated the
	hypothesis across 12 orthogonal dimensions of variation. The sphere-norm-
	as-selector finding and the linear-encoder result were emergent findings
	from the ablation, not anticipated by the original sweep.*