AbstractPhil commited on
Commit
db13661
Β·
verified Β·
1 Parent(s): 2b5d572

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +72 -16
README.md CHANGED
@@ -1,5 +1,57 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
4
 
5
  # GeoLIP Core β€” Geometric Linear Interpolative Patchwork
@@ -97,11 +149,11 @@ Pipeline per forward pass:
97
  6. Gated residual: `gate Γ— patchwork_output + (1-gate) Γ— input_patch`
98
  7. Global skip connection: `input + blended_output`
99
 
100
- *Why it exists*: Attention (softmax-weighted averaging) destroys angular structure. Each attention layer without residual halves effective dimensionality (measured: 62β†’28). The relay measures and gates β€” it never averages angular relationships. Empirically preserves 99.4% cosine similarity at depth 16, compared to 7.4% for attention.
101
 
102
  *SLERP stroboscope*: Each relay layer's anchors interpolate between their home position (initialization) and their current learned position via spherical linear interpolation. Triangulating at 3 phases along this path provides angular measurements that no single-shot triangulation can capture β€” the rate of change of distance as the anchor moves reveals curvature information invisible to static measurement.
103
 
104
- *Cold gating*: Gates initialize at sigmoid(-3) β‰ˆ 0.047. At initialization, the relay is nearly transparent β€” 95.3% of the signal passes through unchanged. This means stacking N relays at init is approximately the identity function. The gates open during training only where the relay's geometric processing provides useful signal. This prevents early training instability from deep stacks.
105
 
106
  *Critical rule*: Relay anchors (like constellation anchors) must be excluded from weight decay. They live on per-patch spheres S^(patch_dim-1).
107
 
@@ -135,7 +187,7 @@ emb_proj(dim→relay/2) + tri_proj(A→relay/4) + raw_mag(1) → ctx_proj → re
135
 
136
  *Stats bias*: After each anchor push operation, MagnitudeFlow receives per-compartment momentum statistics from the push. These are added as a bias to the magnitude output, allowing the magnitude to account for how rapidly the anchor field is changing in each region.
137
 
138
- *Why per-compartment, not per-anchor*: Per-anchor magnitude (512+ scalars) is too fine-grained β€” the model hallucinates confidence at individual anchor resolution. Per-compartment (4-8 scalars) is coarse enough to represent regional confidence without overfitting to individual anchor positions.
139
 
140
  ### AnchorPush
141
 
@@ -157,7 +209,7 @@ Periodically repositions anchors toward class centroids computed from accumulate
157
 
158
  **`FlowAttention(dim, n_anchors, flow_dim, n_steps, time_dim, gate_init)`**
159
 
160
- 3-step Euler ODE integration in the tangent plane of S^(d-1), conditioned on sinusoidal timestep embeddings and push statistics. Achieved 69.8% single-view accuracy on CIFAR-100 but was superseded by the relay architecture, which provides equivalent geometric processing without the ODE overhead and with better depth scaling.
161
 
162
  Retained for backward compatibility with existing checkpoints. New architectures should use RelayLayer or ConstellationRelay.
163
 
@@ -203,9 +255,11 @@ Raw Cayley-Menger determinant computation. Given (B, N, D) point sets, returns (
203
 
204
  *Why CV exists*: The coefficient of variation of simplex volumes on S^(d-1) measures how regularly the embeddings fill the sphere. CV β‰ˆ 0 means all simplices have identical volume (perfectly uniform distribution). CV >> 1 means volumes vary wildly (tight clusters with voids).
205
 
206
- *The natural basin*: Extensive experimentation (43 configurations, noise prediction with zero data structure) established that smooth optimization on S^(d-1) naturally converges to CV β‰ˆ 0.23 at d=128 with no CV loss applied. This is the equilibrium between the sphere's rigidity (fixed curvature) and gradient descent's smoothness (continuous updates). Setting cv_target=0.80 and applying weight=100 in a full training run still produced CV=0.20 β€” the sphere's geometry overrides the loss.
207
 
208
- *The floor*: Even with extreme force (weight=100, target=0.00), CV cannot be pushed below ~0.11. This is the hard geometric floor β€” the maximum volume regularity achievable by smooth functions on S^(d-1).
 
 
209
 
210
  *Batched computation*: The default `batched=True` eliminates the Python loop over samples. All n_samples pentachora are sampled simultaneously via `argsort(rand)`, all Cayley-Menger matrices are constructed in parallel, and a single `torch.linalg.det` call on shape `(n_samples, 6, 6)` computes all volumes at once. Measured speedup: **141x** at n=200 samples. The `batched=False` fallback exists for debugging and validation.
211
 
@@ -297,7 +351,9 @@ The complete image classification pipeline: pixels in, logits out, with full geo
297
 
298
  **`ConvEncoder(output_dim)`**
299
 
300
- 8-layer convolutional encoder in 4 blocks: (conv3Γ—3-BN-GELU) Γ— 2 + MaxPool. Channels progress 64 β†’ 128 β†’ 256 β†’ 384. Final adaptive average pooling to 1Γ—1, followed by a linear projection to `output_dim` with LayerNorm.
 
 
301
 
302
  L2 normalization is intentionally NOT applied inside the encoder. The raw feature norm carries information (encoder confidence), which MagnitudeFlow uses as input. The caller applies `F.normalize()` after extracting the raw magnitude.
303
 
@@ -335,11 +391,11 @@ The full pipeline. Combines ConvEncoder + MagnitudeFlow + InternalConstellationC
335
 
336
  ## Empirical Constants
337
 
338
- Two constants have been validated across 17+ models, all architectures and modalities:
339
 
340
- **CV β‰ˆ 0.20–0.23** β€” The natural coefficient of variation of pentachoron volumes on S^(d-1) under smooth optimization. Observed at effective dimension ~16 across contrastive models, language models, diffusion models, and VAEs. Setting cv_target=0.80 with weight=100 still produces CV=0.20. The constant is a property of the sphere's curvature interacting with gradient descent's smoothness, not a hyperparameter.
341
 
342
- **0.29154 radians** β€” The binding/separation phase boundary. Below this angular distance, an embedding is structurally bound to its nearest anchor (local geometry dominates). Above it, task pressure has moved the embedding beyond local curvature (classification dominates). Observed independently in contrastive training, language modeling, ODE flow matching, and alpha parameter convergence across architectures.
343
 
344
  ---
345
 
@@ -347,14 +403,14 @@ Two constants have been validated across 17+ models, all architectures and modal
347
 
348
  These are not preferences β€” they are empirically validated constraints. Violating them produces measurable degradation.
349
 
350
- 1. **Never use attention in geometric pipelines.** Softmax averaging destroys angular structure. Each attention layer without residual halves effective dimensionality. Use ConstellationRelay or RelayLayer instead.
351
 
352
- 2. **Never use global average pooling in geometric encoders.** It collapses spatial structure. Flatten or use spatial statistics (mean+std per channel minimum). Confirmed empirically: 243-d avg pool drops accuracy from ~70% to ~29% vs 15552-d flatten.
353
 
354
- 3. **Never apply weight decay to anchor parameters.** Anchors live on S^(d-1). Decay pulls them toward the origin, destroying normalization. Use `make_optimizer()` or manually separate param groups.
355
 
356
- 4. **Never let classification gradient reach anchor positions.** Detach assignment targets in the bridge. Let push handle global anchor repositioning. CE gradient on anchors creates classification shortcuts that destroy the Voronoi structure.
357
 
358
- 5. **Always L2-normalize before triangulation.** Triangulation measures angular position. Unnormalized embeddings mix magnitude and direction, making distances meaningless.
359
 
360
- 6. **Relay gates must initialize cold.** `gate_init=-3.0` (sigmoid β‰ˆ 0.047). Hot initialization causes deep relay stacks to diverge before the network has learned meaningful features.
 
1
  ---
2
  license: apache-2.0
3
+ library_name: pytorch
4
+ pipeline_tag: image-classification
5
+ language:
6
+ - en
7
+ tags:
8
+ - geometric-deep-learning
9
+ - constellation
10
+ - patchwork
11
+ - hypersphere
12
+ - contrastive-learning
13
+ - image-classification
14
+ - cifar-100
15
+ - relay
16
+ - no-attention
17
+ - geolip
18
+ - parameter-efficient
19
+ - simplex-volume
20
+ - cayley-menger
21
+ - angular-measurement
22
+ datasets:
23
+ - cifar100
24
+ metrics:
25
+ - accuracy
26
+ model-index:
27
+ - name: GeoLIPImageEncoder
28
+ results:
29
+ - task:
30
+ type: image-classification
31
+ name: Image Classification
32
+ dataset:
33
+ type: cifar100
34
+ name: CIFAR-100
35
+ config: default
36
+ split: test
37
+ metrics:
38
+ - type: accuracy
39
+ value: 66.1
40
+ name: Top-1 Accuracy
41
+ verified: false
42
+ - task:
43
+ type: image-classification
44
+ name: Image Classification (kNN)
45
+ dataset:
46
+ type: cifar100
47
+ name: CIFAR-100
48
+ config: default
49
+ split: test
50
+ metrics:
51
+ - type: accuracy
52
+ value: 61.2
53
+ name: kNN Accuracy (k=1)
54
+ verified: false
55
  ---
56
 
57
  # GeoLIP Core β€” Geometric Linear Interpolative Patchwork
 
149
  6. Gated residual: `gate Γ— patchwork_output + (1-gate) Γ— input_patch`
150
  7. Global skip connection: `input + blended_output`
151
 
152
+ *Why it exists*: Attention (softmax-weighted averaging) destroys angular structure. Each attention layer without residual halves effective dimensionality (measured: 62β†’28 per layer). With residual connections, the residual signal dominates ~11:1, making attention a small perturbation β€” the residual is doing the geometric preservation, not the attention. The relay replaces the attention mechanism entirely: it measures and gates rather than averaging angular relationships. Empirically preserves 99.4% cosine similarity at depth 16, compared to 7.4% for attention stacks without residual.
153
 
154
  *SLERP stroboscope*: Each relay layer's anchors interpolate between their home position (initialization) and their current learned position via spherical linear interpolation. Triangulating at 3 phases along this path provides angular measurements that no single-shot triangulation can capture β€” the rate of change of distance as the anchor moves reveals curvature information invisible to static measurement.
155
 
156
+ *Cold gating*: Gates initialize at sigmoid(-3) β‰ˆ 0.047. The relay blends `gate Γ— patchwork_output + (1-gate) Γ— input_patch` and adds it to the global skip. At initialization, 95.3% of the blended signal comes from the input patch (near-identity), and only 4.7% from the random untrained patchwork. This means stacking N relays at init produces minimal noise accumulation. The gates open during training only where the relay's geometric processing provides useful signal, allowing selective depth utilization.
157
 
158
  *Critical rule*: Relay anchors (like constellation anchors) must be excluded from weight decay. They live on per-patch spheres S^(patch_dim-1).
159
 
 
187
 
188
  *Stats bias*: After each anchor push operation, MagnitudeFlow receives per-compartment momentum statistics from the push. These are added as a bias to the magnitude output, allowing the magnitude to account for how rapidly the anchor field is changing in each region.
189
 
190
+ *Why per-compartment, not per-anchor or global*: Three approaches were tested empirically. A transformer-based magnitude predictor (3 tokens, self-attention) averaged across tokens, producing near-uniform magnitude across compartments and inverted confidence (wrong predictions scored higher than correct, Ξ”=-0.14). Per-compartment independent MLPs reduced the inversion (Ξ”=-0.07) but couldn't eliminate it. The relay stack achieved Ξ”β‰ˆ+0.001 (hallucination eliminated) with genuine compartment differentiation (std=0.69). The failure mode of the transformer was specifically that attention averaged the 3-token context, destroying per-region signal. Per-compartment granularity (4-8 scalars) is coarse enough to prevent overfitting to individual anchor noise while fine enough to capture regional confidence differences.
191
 
192
  ### AnchorPush
193
 
 
209
 
210
  **`FlowAttention(dim, n_anchors, flow_dim, n_steps, time_dim, gate_init)`**
211
 
212
+ 3-step Euler ODE integration in the tangent plane of S^(d-1), conditioned on sinusoidal timestep embeddings and push statistics. The full pipeline (Conv8 encoder + 6-step flow variant + learned head) achieved 69.8% single-view accuracy on CIFAR-100, but the approach was superseded by the relay architecture, which provides equivalent geometric processing without the ODE overhead and with better depth scaling (relay preserves 99.4% geometry at depth 16; flow accuracy degrades with step count beyond 6, and cross-constellation testing revealed 45% angular displacement β€” cos(pre,post)=0.555).
213
 
214
  Retained for backward compatibility with existing checkpoints. New architectures should use RelayLayer or ConstellationRelay.
215
 
 
255
 
256
  *Why CV exists*: The coefficient of variation of simplex volumes on S^(d-1) measures how regularly the embeddings fill the sphere. CV β‰ˆ 0 means all simplices have identical volume (perfectly uniform distribution). CV >> 1 means volumes vary wildly (tight clusters with voids).
257
 
258
+ *The natural basin*: Extensive experimentation (43 configurations, pure noise prediction with zero data structure) established that smooth optimization on S^(d-1) at d=128 naturally converges to CV β‰ˆ 0.24 (mean 0.2393 across 5 seeds, spread 0.013) with no CV loss applied. In trained full models with structured data, CV converges to the tighter 0.20–0.23 band observed across 17+ architectures. The natural basin is strong: at low CV weight (≀0.01), the loss cannot displace CV from the basin regardless of target β€” targets from 0.00 to 1.00 all produce CV in [0.23, 0.25]. However, CV is not immovable: at weight β‰₯ 1.0, the optimizer CAN escape the basin (w=1.0/t=0.80 β†’ CVβ‰ˆ0.71; w=100/t=0.80 β†’ CVβ‰ˆ0.82). The key finding is that the basin exists as an attractor independent of the loss function β€” it emerges from the interaction between the sphere's curvature and gradient descent's smoothness.
259
 
260
+ *The floor*: Even with extreme force (weight=100, target=0.00), CV cannot be pushed below ~0.11–0.12. This is the hard geometric floor β€” the maximum volume regularity achievable by smooth functions on S^(d-1). Below this, simplex volumes are constrained by the curvature itself.
261
+
262
+ *Dimension dependence*: The basin is not universal across dimensions. The noise sweep showed: d=16 β†’ CV=0.37, d=32 β†’ CV=0.26, d=64 β†’ CV=0.23, d=128 β†’ CV=0.24, d=256 β†’ CV=0.27, d=512 β†’ CV=0.33. The minimum is around d=64–128, corresponding to effective geometric dimension ~16–47. At very low dimension, the sphere is too curved for regular volumes. At very high dimension, the model uses only a subspace (effective dim ~41–50 regardless of ambient dim), and the unused dimensions introduce volume irregularity.
263
 
264
  *Batched computation*: The default `batched=True` eliminates the Python loop over samples. All n_samples pentachora are sampled simultaneously via `argsort(rand)`, all Cayley-Menger matrices are constructed in parallel, and a single `torch.linalg.det` call on shape `(n_samples, 6, 6)` computes all volumes at once. Measured speedup: **141x** at n=200 samples. The `batched=False` fallback exists for debugging and validation.
265
 
 
351
 
352
  **`ConvEncoder(output_dim)`**
353
 
354
+ 8-layer convolutional encoder in 4 blocks: (conv3Γ—3-BN-GELU) Γ— 2 + MaxPool. Channels progress 64 β†’ 128 β†’ 256 β†’ 384. Final `AdaptiveAvgPool2d(1)` reduces spatial dimensions to 1Γ—1, followed by a linear projection to `output_dim` with LayerNorm.
355
+
356
+ The adaptive average pooling here operates on conv feature maps — statistical spatial features where pooling is standard dimensionality reduction. This is distinct from the Design Rule against global average pooling on geometric features (triangulation outputs, scattering coefficients), where pooling destroys component-wise structure that downstream geometric layers depend on. The conv→pool→project pipeline produces the input to S^(d-1); geometric constraints apply after normalization, not before.
357
 
358
  L2 normalization is intentionally NOT applied inside the encoder. The raw feature norm carries information (encoder confidence), which MagnitudeFlow uses as input. The caller applies `F.normalize()` after extracting the raw magnitude.
359
 
 
391
 
392
  ## Empirical Constants
393
 
394
+ Two constants have emerged across extensive experimentation:
395
 
396
+ **CV β‰ˆ 0.20–0.23 (trained models)** β€” The coefficient of variation of pentachoron volumes converges to this band in trained models across contrastive learning, language models, diffusion models, and VAEs (17+ models profiled via Procrustes analysis). On pure noise with no data structure, the natural basin is slightly wider at ~0.24. The constant reflects a dimension-dependent equilibrium between the sphere's curvature and gradient descent's smoothness. At the effective dimensions used by trained models (~16–47), the basin is strongest. The CV loss can be used to regularize toward this band, but at low weight the optimizer converges there naturally β€” and at high weight the optimizer CAN be forced away from it (at the cost of degraded task performance). The practical recommendation: use cv_weight ≀ 0.01, or omit CV loss entirely and monitor it as a health metric.
397
 
398
+ **0.29154 radians** β€” The binding/separation phase boundary. Below this angular distance from its nearest anchor, an embedding's local geometry is dominated by the sphere's curvature (the anchor's Voronoi cell provides structural context). Above it, the embedding has been displaced far enough that task pressure dominates over local curvature. Observed independently in contrastive training (MinimalShunts), language modeling (T5 generation), ODE flow matching (alpha convergence), and CLIP projection geometry. Its complement (0.70846) appears as the separation threshold. Whether this is a true universal constant or an artifact of the specific architectures tested remains an open question β€” it is treated as theory-level in ongoing research but should be verified independently.
399
 
400
  ---
401
 
 
403
 
404
  These are not preferences β€” they are empirically validated constraints. Violating them produces measurable degradation.
405
 
406
+ 1. **Never use attention in geometric pipelines.** Softmax-weighted averaging destroys angular structure. Each attention layer without residual halves effective dimensionality (measured: 62β†’28). With residual connections, the residual dominates ~11:1, meaning the attention itself contributes almost nothing β€” the residual is preserving geometry despite the attention, not because of it. The relay architecture provides direct geometric processing without this compromise. Use ConstellationRelay or RelayLayer instead.
407
 
408
+ 2. **Never use global average pooling on geometric features.** This applies to triangulation outputs, scattering features, and any representation where spatial or component-wise structure carries geometric meaning. Confirmed empirically on wavelet scattering features: 243-d global average pool dropped accuracy from ~70% to ~29% compared to 15552-d flatten, because pooling destroys the per-coefficient spatial structure that geometric downstream layers depend on. Use flatten or spatial statistics (mean+std per channel minimum) instead. **Note:** Standard spatial pooling in conv encoders (e.g., `AdaptiveAvgPool2d(1)` after conv feature maps) is not subject to this rule β€” conv feature maps are statistical representations, not geometric ones. The ConvEncoder uses adaptive avg pooling before projecting to S^(d-1), which is correct. The rule applies after projection to the sphere, where geometric structure must be preserved.
409
 
410
+ 3. **Never apply weight decay to anchor parameters.** Anchors live on S^(d-1). Weight decay adds λ×w to the gradient, pulling parameters toward the origin. For anchors, this shrinks their norm below 1.0 between L2-renormalization steps, creating a systematic bias in the gradient direction (toward the origin rather than along the sphere). Over many steps, this interferes with the optimizer's angular momentum. Use `make_optimizer()` or manually place anchor parameters in a `weight_decay=0` param group. This applies to both constellation anchors and relay layer anchors.
411
 
412
+ 4. **Never let classification gradient reach anchor positions.** The bridge detaches assignment targets: `assign_target = assign.detach()`. Without detachment, CE gradient flows backward through the bridge β†’ patchwork β†’ triangulation β†’ anchors, repositioning anchors to create classification shortcuts (e.g., collapsing multiple class anchors together to simplify the decision boundary). Push handles global anchor repositioning based on class centroids; gradient handles fine local adjustment via internal losses (spread, attraction, assignment BCE). Mixing CE gradient into anchor positions conflates these two distinct mechanisms.
413
 
414
+ 5. **Always L2-normalize before triangulation.** Triangulation computes `emb @ anchors.T`, producing cosine similarities. If embeddings are not unit-normalized, this product mixes magnitude and direction: a high-magnitude embedding appears "closer" to all anchors than a low-magnitude one, regardless of angular position. The resulting distances are no longer purely angular and the patchwork's compartment specialization breaks down. The MagnitudeFlow exists specifically to recover magnitude information through a separate geometric channel rather than contaminating the angular measurements.
415
 
416
+ 6. **Relay gates must initialize cold.** `gate_init=-3.0` (sigmoid β‰ˆ 0.047). At initialization, the relay's patchwork outputs are random β€” they carry no useful geometric signal. The relay uses additive gating: `out = x + gate Γ— patchwork + (1-gate) Γ— patches`. The global skip preserves x, but each layer adds `gate Γ— random_noise` to the residual stream. With hot gates (sigmoid(0) = 0.5), an 8-layer stack accumulates ~8 Γ— 0.5 = 4x the patch magnitude as random noise on top of the signal, causing early training instability. With cold gates (0.047), the accumulated noise after 8 layers is ~8 Γ— 0.047 = 0.38x β€” well below the signal magnitude. The gates open selectively during training as the relay learns meaningful features, naturally titrating the contribution of each layer.