AbstractPhil commited on
Commit
45a8d69
Β·
verified Β·
1 Parent(s): a9703cb

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +605 -3
README.md CHANGED
@@ -1,3 +1,605 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ ---
4
+ # Three Geometric Bands in a Sphere-Normalized Patch Autoencoder
5
+
6
+ *A quantized geometric attractor structure emerges from a single architectural knob, validated by ablation across 12 orthogonal dimensions*
7
+
8
+ **TL;DR**: A sweep over small PatchSVAE-family configurations reveals that
9
+ the final coefficient-of-variation (CV) of Cayley–Menger pentachoron volumes
10
+ on the encoder's sphere-normalized latent rows quantizes into **three distinct
11
+ bands**, indexed by the singular-value dimension **D**. An ablation program
12
+ of 149 training runs across 12 orthogonal hyperparameter dimensions (seeds,
13
+ optimizers, schedules, activations, initializations, batch sizes, capacities,
14
+ normalizations, data compositions, cross-attention configurations, and
15
+ soft-hand variants) confirms the band structure is **architectural**: it is
16
+ reproduced in 96% of runs and fails only when the row-normalization step is
17
+ ablated.
18
+
19
+ - **D=16 β†’ CV β‰ˆ 0.20** (matches uniform S¹⁡ prediction 0.199 to Β±0.003 across 5 seeds)
20
+ - **D=8 β†’ CV β‰ˆ 0.36** (matches uniform S⁷ prediction 0.357 to Β±0.02 across 5 seeds)
21
+ - **D=4 β†’ CV β‰ˆ 0.90** (matches uniform SΒ³ prediction 0.923 to Β±0.05 across 5 seeds)
22
+
23
+ Three additional results sharpen the framework:
24
+
25
+ 1. **The attractor is reached without any CV-related training signal.** Pure
26
+ MSE reconstruction with no soft-hand, no CV penalty, and no geometric
27
+ loss term reaches the same CV value as the full soft-hand regime (0.2046
28
+ vs 0.2037 on LOW band). The architecture alone selects the attractor.
29
+
30
+ 2. **Sphere-norm is a *selector* among geometric attractors, not the creator
31
+ of one.** Ablating sphere-normalization does not destroy the attractor
32
+ structure; it redirects the system to a *different* attractor (the
33
+ Gaussian bulk regime). LayerNorm selects a D-dependent intermediate.
34
+ Scale-only normalization is functionally identical to no normalization β€”
35
+ the unit-norm constraint is the sole active ingredient.
36
+
37
+ 3. **The attractor does not require representational nonlinearity.** A
38
+ linear encoder (identity activation, no GELU or ReLU anywhere) reaches
39
+ all three bands correctly. The attractor lives in the sphere-norm + SVD
40
+ geometric pipeline, not in the MLP's representational capacity.
41
+
42
+ This is the first direct measurement in our battery-research lineage of a
43
+ **discrete geometric ladder** that the architecture supports natively. We
44
+ sketch a dimensional argument for why the quantization happens, give the
45
+ complete matmul pipeline for one representative of each band, present the
46
+ full ablation matrix that validates the claim, and propose a cheap online
47
+ predictor: **CV at 1000 training batches reliably predicts final band
48
+ membership**, reducing sweep turnaround from hours to minutes.
49
+
50
+ ---
51
+
52
+ ## 1. The architecture
53
+
54
+ All three bands are reached by the same base architecture (PatchSVAE-F),
55
+ differing only in hyperparameters. The pipeline per patch:
56
+
57
+ ```
58
+ x ∈ ℝ^{patch_dim} # flattened (3, ps, ps) tile
59
+ β”‚
60
+ β”‚ enc_in: Linear(patch_dim β†’ hidden) β†’ GELU
61
+ β”‚ enc_blocks: depth Γ— residual MLP(hidden)
62
+ β”‚ enc_out: Linear(hidden β†’ VΒ·D)
63
+ β–Ό
64
+ M ∈ ℝ^{V Γ— D} # reshape
65
+ β”‚
66
+ β”‚ F.normalize(M, dim=-1) # sphere-norm: each row on S^{D-1}
67
+ β”‚
68
+ β”‚ G = Mα΅€M ∈ ℝ^{D Γ— D} # Gram matrix (fp64)
69
+ β”‚ Ξ», αΉΌ = eigh(G + 1e-12 I) # fp64 eigendecomposition
70
+ β”‚ S = √(clamp(Ξ», min=1e-24)) # singular values
71
+ β”‚ U = MΒ·αΉΌ / clamp(S, min=1e-16) # left singular vectors
72
+ β”‚ Vt = αΉΌα΅€
73
+ β”‚
74
+ β”‚ S_coord = S Β· (1 + Ξ± Β· tanh(attn(S))) # cross-attn on S, Ξ± ≀ 0.2
75
+ β”‚
76
+ β”‚ MΜ‚ = U Β· diag(S_coord) Β· Vt # reconstruction matrix
77
+ β–Ό
78
+ β”‚ dec_in: Linear(VΒ·D β†’ hidden) β†’ GELU
79
+ β”‚ dec_blocks: depth Γ— residual MLP(hidden)
80
+ β”‚ dec_out: Linear(hidden β†’ patch_dim)
81
+ β–Ό
82
+ xΜ‚ ∈ ℝ^{patch_dim}
83
+ ```
84
+
85
+ The key operation is `F.normalize(M, dim=-1)`, which forces every row of
86
+ the VΓ—D encoded matrix onto the unit (Dβˆ’1)-sphere. The subsequent SVD is
87
+ an exact arithmetic readout of the sphere-normed configuration, not a
88
+ learned bottleneck. The model's job is to learn a good *projection* onto
89
+ the manifold; the manifold itself is fixed by the architecture.
90
+
91
+ The **coefficient-of-variation (CV)** is measured by sampling 200 random
92
+ 5-vertex subsets of the V rows and computing the Cayley–Menger 4-volume
93
+ of each pentachoron:
94
+
95
+ ```
96
+ CV = std(volumes) / (mean(volumes) + Ξ΅)
97
+ ```
98
+
99
+ CV is a measure of *how uniformly distributed* the V rows are on S^{Dβˆ’1}.
100
+ CV β‰ˆ 0 indicates near-uniform packing; large CV indicates clumpy packing.
101
+ The universal attractor band 0.20–0.23 has been observed across 17+
102
+ unrelated pretrained models (CLIP, T5, BERT, DINOv2, SD VAEs, etc.)
103
+ whenever their representations are probed this way β€” it is not an artifact
104
+ of any single training regime.
105
+
106
+ ---
107
+
108
+ ## 2. The three bands β€” exact specifications
109
+
110
+ One representative from each band, chosen for CV purity within its band:
111
+
112
+ | band | config | D | V | patch size | hidden | depth | params | patches/img |
113
+ |:----:|:---|:-:|:-:|:-:|:-:|:-:|:-:|:-:|
114
+ | **LOW** | `S64-V64-D16-h64-d1-p16` | **16** | 64 | 16 | 64 | 1 | 250K | 16 |
115
+ | **MID** | `S64-V64-D8-h64-d1-p16` | **8** | 64 | 16 | 64 | 1 | 183K | 16 |
116
+ | **HIGH** | `S64-V32-D4-h64-d1-p4` | **4** | 32 | 4 | 64 | 1 | 41K | 256 |
117
+
118
+ All three use identical resolution (64Γ—64), hidden width (64), depth (1),
119
+ cross-attention layers (1), and CV-EMA soft-hand training regime. Only the
120
+ three parameters **(D, V, patch size)** differ.
121
+
122
+ ### Complete matmul pipeline per band
123
+
124
+ **LOW band (D=16) β€” the attractor**
125
+
126
+ ```
127
+ Per patch (768-dim input tile):
128
+ 768 β†’ 64 (enc_in) 49,152 params
129
+ 64 β†’ 64 (MLP residual) 8,192 params
130
+ 64 β†’ 1024 (enc_out) 65,536 params
131
+ reshape to [64, 16] β€” 64 rows on S^15
132
+ Gram+eigh in fp64 β†’ S ∈ ℝ^16
133
+ cross-attn: S ← S Β· (1 + Ξ±Β·tanh(attn(S)))
134
+ MΜ‚ = U Β· diag(S) Β· Vα΅€
135
+ 1024 β†’ 64 (dec_in) 65,536 params
136
+ 64 β†’ 64 (MLP residual) 8,192 params
137
+ 64 β†’ 768 (dec_out) 49,536 params
138
+
139
+ Total per-forward matmul FLOPs (patch): ~285K
140
+ Patches per image: 16
141
+ Per-image FLOPs: ~4.6M
142
+ CV attractor position: 0.212 (IN the 0.20–0.23 universal band)
143
+ Reconstruction MSE on 16-noise mix: 0.842
144
+ ```
145
+
146
+ **MID band (D=8) β€” intermediate manifold**
147
+
148
+ ```
149
+ Per patch (768-dim input tile):
150
+ 768 β†’ 64 (enc_in) 49,152 params
151
+ 64 β†’ 64 (MLP residual) 8,192 params
152
+ 64 β†’ 512 (enc_out) 32,768 params
153
+ reshape to [64, 8] β€” 64 rows on S^7
154
+ Gram+eigh in fp64 β†’ S ∈ ℝ^8
155
+ cross-attn: S ← S Β· (1 + Ξ±Β·tanh(attn(S)))
156
+ MΜ‚ = U Β· diag(S) Β· Vα΅€
157
+ 512 β†’ 64 (dec_in) 32,768 params
158
+ 64 β†’ 64 (MLP residual) 8,192 params
159
+ 64 β†’ 768 (dec_out) 49,536 params
160
+
161
+ Per-image FLOPs: ~2.3M (roughly half of LOW)
162
+ CV attractor position: 0.392 (NOT in universal band, stable own state)
163
+ Reconstruction MSE on 16-noise mix: 0.843
164
+ ```
165
+
166
+ **HIGH band (D=4) β€” shortcut solution**
167
+
168
+ ```
169
+ Per patch (48-dim input tile):
170
+ 48 β†’ 64 (enc_in) 3,072 params
171
+ 64 β†’ 64 (MLP residual) 8,192 params
172
+ 64 β†’ 128 (enc_out) 8,192 params
173
+ reshape to [32, 4] β€” 32 rows on S^3
174
+ Gram+eigh in fp64 β†’ S ∈ ℝ^4
175
+ cross-attn: S ← S Β· (1 + Ξ±Β·tanh(attn(S)))
176
+ MΜ‚ = U Β· diag(S) Β· Vα΅€
177
+ 128 β†’ 64 (dec_in) 8,192 params
178
+ 64 β†’ 64 (MLP residual) 8,192 params
179
+ 64 β†’ 48 (dec_out) 3,120 params
180
+
181
+ Per-image FLOPs: ~12.9M (higher despite fewer params β€” 256 patches)
182
+ CV attractor position: 1.096 (FAR off universal band, clumped packing)
183
+ Reconstruction MSE on 16-noise mix: 0.071 ← lowest MSE in the sweep
184
+ ```
185
+
186
+ Every other variation tested (different V, different hidden, d=2 depth,
187
+ different patch size at fixed D) landed in the band determined by D. **D
188
+ is the band selector. Every other hyperparameter tunes performance within
189
+ a band.**
190
+
191
+ ---
192
+
193
+ ## 3. Why three bands β€” dimensional argument, confirmed by uniform-sphere measurement
194
+
195
+ The number 0.20 is not arbitrary. CV of pentachoron volumes on the unit
196
+ (Dβˆ’1)-sphere depends on **how much room the V points have to spread**,
197
+ which in turn depends on the **surface area** of S^{Dβˆ’1} relative to the
198
+ number of points V.
199
+
200
+ The unit (nβˆ’1)-sphere has surface area:
201
+
202
+ ```
203
+ A(n) = 2Β·Ο€βΏαŸΒ² / Ξ“(n/2)
204
+ ```
205
+
206
+ So for our three D values:
207
+
208
+ | D | S^{D-1} | surface area | V=64 points, ~area each |
209
+ |:-:|:-:|:-:|:-:|
210
+ | 16 | S^15 | 5.72 | 0.089 |
211
+ | 8 | S^7 | 4.06 | 0.063 |
212
+ | 4 | S^3 | 19.74 | 0.308 |
213
+
214
+ **D=4 gives each point nearly 5Γ— more ambient room than D=16.** With so
215
+ much space and only 32–64 points, the points cluster into local groups;
216
+ random 5-point pentachora sample very unevenly, producing high CV (clumpy).
217
+
218
+ **D=16 is the sweet spot** where the sphere has enough dimensions to admit
219
+ a well-spread packing but not so many that the points become isolated.
220
+ Random 5-point pentachora from a uniform D=16 packing produce a tight,
221
+ consistent volume distribution β€” exactly the 0.20 CV observed.
222
+
223
+ **D=8 is intermediate**: the sphere is uniform enough to avoid clumping
224
+ but not generous enough to admit the near-perfect packing D=16 finds.
225
+ This gives a stable 0.39 CV that sits between the extremes.
226
+
227
+ ### The quantitative match: attractor CV = uniform-sphere CV
228
+
229
+ The dimensional argument gives an ordering. The stronger claim is that
230
+ each band's CV value **matches the uniform-sphere prediction for that D**
231
+ directly. We computed the uniform-sphere CV for V=64 points via a closed
232
+ random-sampling procedure (no model, no data, no training β€” just
233
+ `torch.randn(V, D)` followed by `F.normalize(dim=-1)` and the same
234
+ pentachoron CV metric), using a fixed seed for reproducibility:
235
+
236
+ | D | uniform-sphere CV | attractor CV (mean across 5 seeds) | deviation |
237
+ |:-:|:-:|:-:|:-:|
238
+ | 16 | 0.1990 | 0.1969 | -0.003 |
239
+ | 8 | 0.3568 | 0.3588 | +0.002 |
240
+ | 4 | 0.9229 | 0.9016 | -0.021 |
241
+
242
+ Trained models landed within **2% of the uniform-sphere prediction** on
243
+ all three bands. The attractor is not *near* the uniform-sphere
244
+ distribution β€” it *is* the uniform-sphere distribution, selected
245
+ dynamically by the combination of gradient descent and sphere-norm
246
+ enforcement.
247
+
248
+ This reframes the earlier "bands are attractors" language as a specific
249
+ empirical claim: **under sphere-norm, the 5-point pentachoron CV of the
250
+ encoder's latent rows converges to the CV of a uniform distribution of V
251
+ points on S^{Dβˆ’1}.** Training from random initialization produces the
252
+ same geometric configuration as drawing random points on the sphere,
253
+ independent of all other architectural and training choices tested.
254
+
255
+ The prediction this analysis makes: **D=32 sweeps should either land in
256
+ the LOW band alongside D=16, or split into a new band below 0.20**. If
257
+ the former, D=16 is the architectural choice that makes the universal
258
+ attractor accessible and higher-D just reproduces it. If the latter,
259
+ there is a ladder of attractors continuing downward, and the universal
260
+ 0.20 is a waypoint rather than a floor. **This is a testable prediction
261
+ for our next sweep.**
262
+
263
+ ---
264
+
265
+ ## 4. Ablation program: the band structure is architectural
266
+
267
+ To test whether the band structure is a robust property of the
268
+ architecture or an artifact of specific training choices, we ran an
269
+ ablation program of **149 independent training runs across 12 orthogonal
270
+ hyperparameter dimensions**. Each variant was run at three band
271
+ representatives (LOW: S64-V64-D16-h64-d1-p16; MID: S64-V64-D8-h64-d1-p16;
272
+ HIGH: S64-V32-D4-h64-d1-p4), with band membership measured at 1000
273
+ batches for LOW/MID and 100 batches for HIGH. The full matrix completed
274
+ in under one hour of single-H100 wallclock on Colab.
275
+
276
+ ### 4.1 The ablation matrix
277
+
278
+ | Group | Dimensions varied | Variants | Total runs | Band match rate |
279
+ |:-----:|:------------------|:--------:|:----------:|:---------------:|
280
+ | A | Random seed | 5 seeds Γ— 3 bands | 15 | **100%** |
281
+ | B | Noise-type data subset | 6 subsets Γ— 3 bands | 18 | 94% |
282
+ | C | Optimizer (Adam/SGD/SGD+mom/AdamW) | 4 Γ— 3 bands | 12 | **100%** |
283
+ | D | LR schedule (cosine/const/linear/warm/one-cycle) | 5 Γ— 3 bands | 15 | **100%** |
284
+ | E_preview | Soft-hand regime (full/pure-MSE/measure/hard-target) | 4 Γ— 3 bands | 12 | **100%** |
285
+ | F | Activation (GELU/ReLU/SiLU/Tanh/**Identity**) | 5 Γ— 3 bands | 15 | **100%** |
286
+ | G | Row normalization (sphere/none/LayerNorm/scale-only) | 4 Γ— 3 bands | 12 | **58%** |
287
+ | I | Cross-attention (0-2 layers, bounded/unbounded Ξ±) | 4 Γ— 3 bands | 12 | **100%** |
288
+ | J | Capacity within LOW (V, hidden) | 5 configs | 5 | **100%** |
289
+ | K | Batch size (32/128/512/1024) | 4 Γ— 3 bands | 12 | **100%** |
290
+ | L | Init (orthogonal/Kaiming/Xavier/small-normal) | 4 Γ— 3 bands | 12 | **100%** |
291
+ | M | Brute-force SGD (lr=0.1 to 1.0, momentum up to 0.99) | 3 Γ— 3 bands | 9 | **100%** |
292
+ | **Total** | | | **149** | **96%** |
293
+
294
+ **All mismatches (6 of 149) came from Group G** β€” the normalization-
295
+ ablation group. Every other dimension preserved the band assignment in
296
+ every single configuration tested.
297
+
298
+ ### 4.2 What the non-G groups demonstrate
299
+
300
+ The attractor survives intact across:
301
+
302
+ - **Seed variation**: Group A β€” 5 seeds per band land within 0.012 of
303
+ each other on LOW (CV-of-CV 0.4%), 5% on MID and HIGH. The attractor
304
+ is seed-indifferent, not merely seed-robust.
305
+ - **Optimizer choice**: Group C β€” Adam (lr=1e-4), plain SGD (lr=1e-2),
306
+ SGD with momentum, and AdamW all converge to the same attractor within
307
+ 0.0024 of each other. The attractor is not an Adam artifact.
308
+ - **Schedule choice**: Group D β€” cosine, constant, linear decay, warm
309
+ restarts, and one-cycle all preserve band membership. The schedule
310
+ does not select the attractor.
311
+ - **Activation function**: Group F β€” GELU, ReLU, SiLU, Tanh, **and
312
+ identity** all reach the correct band. The attractor does not require
313
+ representational nonlinearity in the encoder. A linear encoder with
314
+ sphere-norm + SVD suffices.
315
+ - **Initialization**: Group L β€” orthogonal, Kaiming-normal, Xavier-
316
+ uniform, and small-normal all reach the attractor. The attractor is
317
+ not init-dependent, so long as the sphere-norm step is present.
318
+ - **Batch size**: Group K β€” 32, 128, 512, and 1024 all work equally. No
319
+ batch-size effect on band membership.
320
+ - **Cross-attention**: Group I β€” 0 layers, 1 layer, 2 layers, or 1 layer
321
+ with unbounded Ξ± all preserve the band. The cross-attention module is
322
+ not responsible for the attractor.
323
+ - **Capacity within LOW**: Group J β€” V and hidden combinations from
324
+ (V=16, h=32) up to (V=128, h=128) all reach LOW band. The attractor
325
+ has a remarkably wide parameter range that supports it.
326
+ - **Data composition**: Group B β€” 6 different noise-type subsets
327
+ (Gaussian only, structured only, heavy-tailed only, first-half, even-
328
+ indices, all 16) all preserve band membership. The only miscall
329
+ (B-HIGH-B2_gaussian_only at CV_ema 0.7533) had observed sphere-CV of
330
+ 0.9504, confirming the model reached the HIGH attractor but produced a
331
+ CV-EMA below the 0.80 classification threshold due to limited-batch
332
+ measurement noise. Not a real anomaly.
333
+ - **Soft-hand regime**: Group E_preview β€” full soft-hand (E1), pure MSE
334
+ with no CV involvement (E2), CV-EMA tracked but not used (E3), and
335
+ hard CV-target penalty (E4) **all reach the same CV to within
336
+ 0.0014** on every band. The attractor is reached even when the loss
337
+ function contains no CV-related signal at all.
338
+
339
+ The E_preview result deserves particular emphasis. Pure MSE
340
+ reconstruction reaches CV = 0.2046 on LOW band vs 0.2037 with full
341
+ soft-hand β€” a difference below the measurement noise floor. The
342
+ architecture alone, through the sphere-norm + SVD geometric pipeline,
343
+ selects the uniform-sphere attractor. Training signal does not need to
344
+ encode any preference for this outcome.
345
+
346
+ ### 4.3 Group G: sphere-norm as attractor selector
347
+
348
+ The only ablation that perturbs the band assignment is normalization
349
+ removal or replacement. Across four normalization modes:
350
+
351
+ | Variant | LOW (D=16) | MID (D=8) | HIGH (D=4) |
352
+ |:--|:-:|:-:|:-:|
353
+ | G1 sphere-norm | 0.196 (-0.003) | 0.352 (-0.005) | 0.945 (+0.022) |
354
+ | G2 no-norm | 0.374 (+0.175) | 0.555 (+0.198) | 1.280 (+0.357) |
355
+ | G3 LayerNorm | 0.200 (+0.002) | 0.421 (+0.064) | 0.706 (-0.217) |
356
+ | G4 scale-only | 0.358 (+0.159) | 0.506 (+0.149) | 1.282 (+0.359) |
357
+
358
+ *Numbers in parentheses are deviations from uniform S^(D-1) prediction.*
359
+
360
+ Three patterns emerge:
361
+
362
+ **G1 sphere-norm reaches uniform S^(D-1) within 3% across every band.**
363
+ This is the framework's core prediction.
364
+
365
+ **G2 (no normalization) and G4 (scale-only) produce nearly identical
366
+ geometric outcomes**, differing by less than 0.03 on every band. The
367
+ scale-only variant divides each row by the batch's mean row norm but
368
+ does not enforce unit length; the no-norm variant does nothing at all.
369
+ That these produce functionally identical geometry demonstrates that
370
+ **the unit-norm constraint is the entire active ingredient of sphere-
371
+ normalization**. Scale magnitude without unit enforcement does no
372
+ geometric work.
373
+
374
+ When normalization is absent or scale-only, the system converges to a
375
+ different attractor β€” approximately the Gaussian bulk configuration
376
+ (the CV of V points drawn i.i.d. from N(0, I_D) without sphere
377
+ projection). At D=16, the bulk CV prediction is 0.358; our observed is
378
+ 0.374, within 5%. At D=8: predicted 0.588, observed 0.555. The
379
+ architecture still produces a reproducible attractor; it is simply a
380
+ different attractor in the geometric family, not the uniform-sphere one.
381
+
382
+ **G3 LayerNorm acts as a D-dependent partial selector.** At LOW (D=16)
383
+ LayerNorm reaches the uniform attractor cleanly (observed 0.200 vs
384
+ predicted 0.199). At MID (D=8) it mildly elevates the CV to 0.421 β€”
385
+ between the uniform and bulk predictions. At HIGH (D=4) it *underselects*,
386
+ pulling the CV below the uniform target to 0.706. The mechanism: LayerNorm
387
+ centers and variance-normalizes across D elements, producing a configuration
388
+ on a hyperplane rather than a sphere. At higher D the hyperplane
389
+ approximates S^(Dβˆ’1) better; at D=4 the geometric distortion becomes
390
+ visible as CV depression.
391
+
392
+ ### 4.4 Implications
393
+
394
+ 1. **The three-band structure is robust.** Across 149 training runs with
395
+ variations in 12 orthogonal dimensions, 96% preserve the predicted
396
+ band assignment. Non-ablation groups show 100% preservation.
397
+
398
+ 2. **The attractor is architectural.** It is reached by pure MSE training,
399
+ by linear encoders, by any first-order optimizer, under any batch size,
400
+ and with any initialization. What it requires is the sphere-norm + SVD
401
+ readout pipeline.
402
+
403
+ 3. **Sphere-norm is a selector, not a creator.** The architecture supports
404
+ a *family* of geometric attractors per D; sphere-norm selects the
405
+ uniform one. Other normalization modes select other members of the
406
+ family (Gaussian bulk, LayerNorm hyperplane) β€” each reproducibly.
407
+
408
+ 4. **The unit-norm constraint is the load-bearing element.** The specific
409
+ mechanism of normalization (centering, variance scaling, or division by
410
+ a norm) is less important than whether a unit-length constraint is
411
+ actively imposed. Division by mean-row-norm does not impose the
412
+ constraint and does not change the attractor.
413
+
414
+ ---
415
+
416
+ ## 5. CV at 1000 batches predicts final band membership
417
+
418
+ The row_cv trajectory plot shows every run finding its band within the
419
+ first ~1000 steps and holding for the remaining 299,000.
420
+
421
+ This gives us a **minutes-scale triage**:
422
+
423
+ - Measure CV-EMA at batch 1000 (~4–7 minutes on Colab single-GPU)
424
+ - CV < 0.30 β†’ will converge to LOW band
425
+ - CV 0.35–0.50 β†’ will converge to MID band
426
+ - CV > 0.80 β†’ will converge to HIGH band
427
+
428
+ No need to train to convergence to determine band membership. Existing
429
+ sweep infrastructure can be modified to early-stop at 1000 batches for
430
+ screening, only continuing runs whose band assignment matches the research
431
+ target.
432
+
433
+ For cell-candidate hunting specifically (want LOW band), this collapses
434
+ the turnaround from ~2 hours per config to ~7 minutes, with the same
435
+ confidence in final geometric classification.
436
+
437
+ ---
438
+
439
+ ## 6. What each band is good for
440
+
441
+ The conventional read of "best MSE" ranks the three bands exactly
442
+ backwards relative to the universal-manifold thesis. A separate reading
443
+ by band character:
444
+
445
+ ### LOW band β€” universal generalist
446
+
447
+ The D=16 attractor has been observed across 17+ unrelated pretrained
448
+ models when probed. A sphere-normed model that lands here is on the same
449
+ geometric manifold those models land on. Its omega tokens (S vectors) are
450
+ in principle translatable to/from tokens produced by any other attractor-
451
+ aligned model via Procrustes alignment.
452
+
453
+ On pure noise this band reconstructs *worse* than the HIGH shortcut. On
454
+ transfer to other distributions (images, text as tensors, unseen noise
455
+ types) it reconstructs *far better* β€” Fresnel-base 256 from this band
456
+ achieves MSE 3.8Γ—10⁻⁡ on ImageNet without seeing it during training.
457
+
458
+ **Use: battery / cell / relay in multi-model collective architectures.**
459
+
460
+ ### MID band β€” intermediate attractor
461
+
462
+ Four D=8 configs with varying hidden widths and depths all converge to
463
+ CV 0.38–0.40, tighter than the HIGH band's spread. This is a real
464
+ attractor of its own, not a transition state. We do not yet know what it
465
+ represents; characterization is an open research direction.
466
+
467
+ Testable: does the MID attractor transfer across distributions like the
468
+ LOW one? Does distillation from a LOW model speed convergence for a MID
469
+ configuration, or vice versa?
470
+
471
+ **Use: undetermined pending characterization. Possibly a secondary
472
+ geometric substrate for specific domains.**
473
+
474
+ ### HIGH band β€” specialist / shortcut
475
+
476
+ The D=4 band reconstructs noise at very low MSE because D=4 is small
477
+ enough that the encoder can find low-rank solutions that collapse
478
+ information along specific directions. CV > 1.0 indicates the pentachoron
479
+ volume distribution is more variable than its mean β€” the rows are
480
+ highly clumped along particular directions rather than spread.
481
+
482
+ This is a specialist representation. It excels at the specific
483
+ distribution it was trained on and will likely fail catastrophically
484
+ on out-of-distribution input (to be tested). But it is *compact*: 41K
485
+ parameters, 256 patches per image, lowest MSE in the sweep.
486
+
487
+ **Use: domain-specific specialist batteries where the input distribution
488
+ is known and stable. Compression tasks where lossy per-distribution
489
+ encoding is acceptable. Potentially: noise-channel glyph emission for a
490
+ collective system that routes noise-like signals through a dedicated
491
+ specialist.**
492
+
493
+ ---
494
+
495
+ ## 7. The methodological correction
496
+
497
+ Our prior F-class sweep methodology used MSE as the primary filter with
498
+ a 1-epoch "keep-or-kill" curve-delta verdict. This sweep's data shows
499
+ that methodology is insufficient:
500
+
501
+ - The lowest-MSE configuration in the sweep (CV 0.93, HIGH band) was
502
+ flagged as a "strong cell candidate" until geometry was checked.
503
+ - The true on-attractor configurations had *higher* MSE than several
504
+ off-attractor configurations.
505
+ - MSE alone cannot distinguish specialist low-rank solutions from
506
+ generalist on-attractor solutions.
507
+
508
+ Going forward, the cell-candidate filter is three-tier:
509
+
510
+ 1. **CV in 0.13–0.30 band** at step 1000 β†’ attractor candidate
511
+ 2. **CV stays in band** through training β†’ stable attractor
512
+ 3. **Attractor holds under freezing and host gradient** β†’ viable cell
513
+
514
+ The 1000-batch CV measurement is fast and eliminates the false positives
515
+ that MSE-only triage was generating.
516
+
517
+ ---
518
+
519
+ ## 8. Open questions this sweep raises
520
+
521
+ **Questions the Phase 1 ablation resolved**:
522
+
523
+ - βœ“ *Is the three-band structure reproducible?* Yes, 96% match rate across
524
+ 149 independent runs in 12 orthogonal ablation dimensions.
525
+ - βœ“ *Is the attractor Adam-specific?* No. Adam, SGD, SGD+momentum, and AdamW
526
+ all reach it within 0.0024 of each other.
527
+ - βœ“ *Does the attractor require nonlinear activation?* No. Identity
528
+ activation (purely linear encoder) reaches all three bands correctly.
529
+ - βœ“ *Minimum LOW-band parameter count.* Tested from (V=16, h=32) to (V=128,
530
+ h=128); all reach LOW band. Attractor admits wide capacity range.
531
+ - βœ“ *Is sphere-norm load-bearing?* Yes, but as an attractor *selector*,
532
+ not an attractor *creator*. Ablating it redirects the system to a
533
+ different reproducible attractor (Gaussian bulk), rather than destroying
534
+ attractor structure.
535
+
536
+ **Questions that remain open**:
537
+
538
+ - **Does D=32 produce a new band below 0.20, or reproduce LOW?** Tests
539
+ whether the attractor ladder continues or terminates at D=16. The
540
+ dimensional argument predicts a narrow band near CV(SΒ³ΒΉ) β‰ˆ 0.13;
541
+ training to verify is a ~5-config add-on to Phase 1.
542
+
543
+ - **Is the MID band useful in its own right?** No systematic probe of D=8
544
+ transfer behavior yet. The ablation confirms it's a real attractor, not
545
+ an artifact, but whether D=8 omega tokens are usefully translatable
546
+ across domains is untested.
547
+
548
+ - **What are the HIGH-band specialists actually encoding?** Per-singular-
549
+ vector analysis of a trained D=4 config should reveal which directions
550
+ carry the shortcut information.
551
+
552
+ - **Do HIGH-band shortcuts fail out-of-distribution as expected?** Run the
553
+ universal diagnostic (16 noise types, text, images) on the HIGH-band
554
+ champion. Confirms or falsifies the specialist-vs-generalist reading.
555
+
556
+ - **What is the LayerNorm-at-D=4 undershoot measuring?** The G3 variant at
557
+ HIGH band reached CV 0.706, significantly below uniform SΒ³ prediction
558
+ 0.923. This is a distortion specific to the centering+variance-norm
559
+ combination at low D. Characterization would clarify exactly which
560
+ geometric property of sphere-norm is irreplaceable by the standard
561
+ normalization layers.
562
+
563
+ - **Within-attractor reconstruction MSE**: Phase 1 was a band-classification
564
+ sweep, not a reconstruction-quality sweep. The within-attractor MSE data
565
+ needed to characterize reconstruction floors at each band requires full
566
+ 30-epoch runs. Phase 2 was planned for this; initial results suggest
567
+ LBFGS reaches meaningfully lower MSE than Adam within the HIGH attractor
568
+ (0.0644 vs 0.072 at 100 batches vs 30 epochs), pointing at
569
+ optimizer-dependent within-attractor structure that a Phase 2 program
570
+ should map.
571
+
572
+ ---
573
+
574
+ ## 9. Artifacts
575
+
576
+ All 149 Phase 1 ablation runs are preserved on HuggingFace under
577
+ `AbstractPhil/geolip-svae-ablations` with per-run `final_report.json`
578
+ files and TensorBoard event files. The aggregated analysis
579
+ (`band_matrix.csv`, `anomalies.csv`, `group_summaries.csv`,
580
+ `uniformity_diagnostic.csv`, `snapshot_meta.json`) is under
581
+ `_analysis/{timestamp}/` within the same repo.
582
+
583
+ The 13 configurations from the original three-band sweep are preserved
584
+ under `AbstractPhil/geolip-svae-batteries` with full TensorBoard logs,
585
+ checkpoints every 5 epochs, and final reports.
586
+
587
+ **Training code**: `johanna_F_trainer.py` (base trainer),
588
+ `ablation_trainer.py` (ablation adapter with PatchSVAE_F_Ablation
589
+ subclass), `ablation_configs.py` (explicit matrix of 149 Phase 1 and 174
590
+ Phase 2 variants), `ablation_orchestrator.py` (Colab cell for sequential
591
+ execution with HF resume logic), `aggregate_results.py` (snapshot
592
+ aggregator writing timestamped analyses).
593
+
594
+ **Formula catalogue** (every load-bearing equation in the architecture):
595
+ `johanna_F_formula_catalogue.md`.
596
+
597
+ ---
598
+
599
+ *This finding emerged from two complementary sweeps: an original F-class
600
+ (miniature battery) exploration over approximately 40 hours of A100 time
601
+ that produced the three-band hypothesis, followed by a 149-run ablation
602
+ program completed in under one hour on a single H100 that validated the
603
+ hypothesis across 12 orthogonal dimensions of variation. The sphere-norm-
604
+ as-selector finding and the linear-encoder result were emergent findings
605
+ from the ablation, not anticipated by the original sweep.*