Update README.md

Browse files

Files changed (1) hide show

README.md +72 -16

README.md CHANGED Viewed

@@ -1,5 +1,57 @@
 ---
 license: apache-2.0
 ---
 # GeoLIP Core — Geometric Linear Interpolative Patchwork
@@ -97,11 +149,11 @@ Pipeline per forward pass:
 6. Gated residual: `gate × patchwork_output + (1-gate) × input_patch`
 7. Global skip connection: `input + blended_output`
-*Why it exists*: Attention (softmax-weighted averaging) destroys angular structure. Each attention layer without residual halves effective dimensionality (measured: 62→28). The relay measures and gates — it never averages angular relationships. Empirically preserves 99.4% cosine similarity at depth 16, compared to 7.4% for attention.
 *SLERP stroboscope*: Each relay layer's anchors interpolate between their home position (initialization) and their current learned position via spherical linear interpolation. Triangulating at 3 phases along this path provides angular measurements that no single-shot triangulation can capture — the rate of change of distance as the anchor moves reveals curvature information invisible to static measurement.
-*Cold gating*: Gates initialize at sigmoid(-3) ≈ 0.047. At initialization, the relay is nearly transparent — 95.3% of the signal passes through unchanged. This means stacking N relays at init is approximately the identity function. The gates open during training only where the relay's geometric processing provides useful signal. This prevents early training instability from deep stacks.
 *Critical rule*: Relay anchors (like constellation anchors) must be excluded from weight decay. They live on per-patch spheres S^(patch_dim-1).
@@ -135,7 +187,7 @@ emb_proj(dim→relay/2) + tri_proj(A→relay/4) + raw_mag(1) → ctx_proj → re
 *Stats bias*: After each anchor push operation, MagnitudeFlow receives per-compartment momentum statistics from the push. These are added as a bias to the magnitude output, allowing the magnitude to account for how rapidly the anchor field is changing in each region.
-*Why per-compartment, not per-anchor*: Per-anchor magnitude (512+ scalars) is too fine-grained — the model hallucinates confidence at individual anchor resolution. Per-compartment (4-8 scalars) is coarse enough to represent regional confidence without overfitting to individual anchor positions.
 ### AnchorPush
@@ -157,7 +209,7 @@ Periodically repositions anchors toward class centroids computed from accumulate
 **`FlowAttention(dim, n_anchors, flow_dim, n_steps, time_dim, gate_init)`**
-3-step Euler ODE integration in the tangent plane of S^(d-1), conditioned on sinusoidal timestep embeddings and push statistics. Achieved 69.8% single-view accuracy on CIFAR-100 but was superseded by the relay architecture, which provides equivalent geometric processing without the ODE overhead and with better depth scaling.
 Retained for backward compatibility with existing checkpoints. New architectures should use RelayLayer or ConstellationRelay.
@@ -203,9 +255,11 @@ Raw Cayley-Menger determinant computation. Given (B, N, D) point sets, returns (
 *Why CV exists*: The coefficient of variation of simplex volumes on S^(d-1) measures how regularly the embeddings fill the sphere. CV ≈ 0 means all simplices have identical volume (perfectly uniform distribution). CV >> 1 means volumes vary wildly (tight clusters with voids).
-*The natural basin*: Extensive experimentation (43 configurations, noise prediction with zero data structure) established that smooth optimization on S^(d-1) naturally converges to CV ≈ 0.23 at d=128 with no CV loss applied. This is the equilibrium between the sphere's rigidity (fixed curvature) and gradient descent's smoothness (continuous updates). Setting cv_target=0.80 and applying weight=100 in a full training run still produced CV=0.20 — the sphere's geometry overrides the loss.
-*The floor*: Even with extreme force (weight=100, target=0.00), CV cannot be pushed below ~0.11. This is the hard geometric floor — the maximum volume regularity achievable by smooth functions on S^(d-1).
 *Batched computation*: The default `batched=True` eliminates the Python loop over samples. All n_samples pentachora are sampled simultaneously via `argsort(rand)`, all Cayley-Menger matrices are constructed in parallel, and a single `torch.linalg.det` call on shape `(n_samples, 6, 6)` computes all volumes at once. Measured speedup: **141x** at n=200 samples. The `batched=False` fallback exists for debugging and validation.
@@ -297,7 +351,9 @@ The complete image classification pipeline: pixels in, logits out, with full geo
 **`ConvEncoder(output_dim)`**
-8-layer convolutional encoder in 4 blocks: (conv3×3-BN-GELU) × 2 + MaxPool. Channels progress 64 → 128 → 256 → 384. Final adaptive average pooling to 1×1, followed by a linear projection to `output_dim` with LayerNorm.
 L2 normalization is intentionally NOT applied inside the encoder. The raw feature norm carries information (encoder confidence), which MagnitudeFlow uses as input. The caller applies `F.normalize()` after extracting the raw magnitude.
@@ -335,11 +391,11 @@ The full pipeline. Combines ConvEncoder + MagnitudeFlow + InternalConstellationC
 ## Empirical Constants
-Two constants have been validated across 17+ models, all architectures and modalities:
-**CV ≈ 0.20–0.23** — The natural coefficient of variation of pentachoron volumes on S^(d-1) under smooth optimization. Observed at effective dimension ~16 across contrastive models, language models, diffusion models, and VAEs. Setting cv_target=0.80 with weight=100 still produces CV=0.20. The constant is a property of the sphere's curvature interacting with gradient descent's smoothness, not a hyperparameter.
-**0.29154 radians** — The binding/separation phase boundary. Below this angular distance, an embedding is structurally bound to its nearest anchor (local geometry dominates). Above it, task pressure has moved the embedding beyond local curvature (classification dominates). Observed independently in contrastive training, language modeling, ODE flow matching, and alpha parameter convergence across architectures.
 ---
@@ -347,14 +403,14 @@ Two constants have been validated across 17+ models, all architectures and modal
 These are not preferences — they are empirically validated constraints. Violating them produces measurable degradation.
-1. **Never use attention in geometric pipelines.** Softmax averaging destroys angular structure. Each attention layer without residual halves effective dimensionality. Use ConstellationRelay or RelayLayer instead.
-2. **Never use global average pooling in geometric encoders.** It collapses spatial structure. Flatten or use spatial statistics (mean+std per channel minimum). Confirmed empirically: 243-d avg pool drops accuracy from ~70% to ~29% vs 15552-d flatten.
-3. **Never apply weight decay to anchor parameters.** Anchors live on S^(d-1). Decay pulls them toward the origin, destroying normalization. Use `make_optimizer()` or manually separate param groups.
-4. **Never let classification gradient reach anchor positions.** Detach assignment targets in the bridge. Let push handle global anchor repositioning. CE gradient on anchors creates classification shortcuts that destroy the Voronoi structure.
-5. **Always L2-normalize before triangulation.** Triangulation measures angular position. Unnormalized embeddings mix magnitude and direction, making distances meaningless.
-6. **Relay gates must initialize cold.** `gate_init=-3.0` (sigmoid ≈ 0.047). Hot initialization causes deep relay stacks to diverge before the network has learned meaningful features.

 ---
 license: apache-2.0
+library_name: pytorch
+pipeline_tag: image-classification
+language:
+  - en
+tags:
+  - geometric-deep-learning
+  - constellation
+  - patchwork
+  - hypersphere
+  - contrastive-learning
+  - image-classification
+  - cifar-100
+  - relay
+  - no-attention
+  - geolip
+  - parameter-efficient
+  - simplex-volume
+  - cayley-menger
+  - angular-measurement
+datasets:
+  - cifar100
+metrics:
+  - accuracy
+model-index:
+  - name: GeoLIPImageEncoder
+    results:
+      - task:
+          type: image-classification
+          name: Image Classification
+        dataset:
+          type: cifar100
+          name: CIFAR-100
+          config: default
+          split: test
+        metrics:
+          - type: accuracy
+            value: 66.1
+            name: Top-1 Accuracy
+            verified: false
+      - task:
+          type: image-classification
+          name: Image Classification (kNN)
+        dataset:
+          type: cifar100
+          name: CIFAR-100
+          config: default
+          split: test
+        metrics:
+          - type: accuracy
+            value: 61.2
+            name: kNN Accuracy (k=1)
+            verified: false
 ---
 # GeoLIP Core — Geometric Linear Interpolative Patchwork
 6. Gated residual: `gate × patchwork_output + (1-gate) × input_patch`
 7. Global skip connection: `input + blended_output`
+*Why it exists*: Attention (softmax-weighted averaging) destroys angular structure. Each attention layer without residual halves effective dimensionality (measured: 62→28 per layer). With residual connections, the residual signal dominates ~11:1, making attention a small perturbation — the residual is doing the geometric preservation, not the attention. The relay replaces the attention mechanism entirely: it measures and gates rather than averaging angular relationships. Empirically preserves 99.4% cosine similarity at depth 16, compared to 7.4% for attention stacks without residual.
 *SLERP stroboscope*: Each relay layer's anchors interpolate between their home position (initialization) and their current learned position via spherical linear interpolation. Triangulating at 3 phases along this path provides angular measurements that no single-shot triangulation can capture — the rate of change of distance as the anchor moves reveals curvature information invisible to static measurement.
+*Cold gating*: Gates initialize at sigmoid(-3) ≈ 0.047. The relay blends `gate × patchwork_output + (1-gate) × input_patch` and adds it to the global skip. At initialization, 95.3% of the blended signal comes from the input patch (near-identity), and only 4.7% from the random untrained patchwork. This means stacking N relays at init produces minimal noise accumulation. The gates open during training only where the relay's geometric processing provides useful signal, allowing selective depth utilization.
 *Critical rule*: Relay anchors (like constellation anchors) must be excluded from weight decay. They live on per-patch spheres S^(patch_dim-1).
 *Stats bias*: After each anchor push operation, MagnitudeFlow receives per-compartment momentum statistics from the push. These are added as a bias to the magnitude output, allowing the magnitude to account for how rapidly the anchor field is changing in each region.
+*Why per-compartment, not per-anchor or global*: Three approaches were tested empirically. A transformer-based magnitude predictor (3 tokens, self-attention) averaged across tokens, producing near-uniform magnitude across compartments and inverted confidence (wrong predictions scored higher than correct, Δ=-0.14). Per-compartment independent MLPs reduced the inversion (Δ=-0.07) but couldn't eliminate it. The relay stack achieved Δ≈+0.001 (hallucination eliminated) with genuine compartment differentiation (std=0.69). The failure mode of the transformer was specifically that attention averaged the 3-token context, destroying per-region signal. Per-compartment granularity (4-8 scalars) is coarse enough to prevent overfitting to individual anchor noise while fine enough to capture regional confidence differences.
 ### AnchorPush
 **`FlowAttention(dim, n_anchors, flow_dim, n_steps, time_dim, gate_init)`**
+3-step Euler ODE integration in the tangent plane of S^(d-1), conditioned on sinusoidal timestep embeddings and push statistics. The full pipeline (Conv8 encoder + 6-step flow variant + learned head) achieved 69.8% single-view accuracy on CIFAR-100, but the approach was superseded by the relay architecture, which provides equivalent geometric processing without the ODE overhead and with better depth scaling (relay preserves 99.4% geometry at depth 16; flow accuracy degrades with step count beyond 6, and cross-constellation testing revealed 45% angular displacement — cos(pre,post)=0.555).
 Retained for backward compatibility with existing checkpoints. New architectures should use RelayLayer or ConstellationRelay.
 *Why CV exists*: The coefficient of variation of simplex volumes on S^(d-1) measures how regularly the embeddings fill the sphere. CV ≈ 0 means all simplices have identical volume (perfectly uniform distribution). CV >> 1 means volumes vary wildly (tight clusters with voids).
+*The natural basin*: Extensive experimentation (43 configurations, pure noise prediction with zero data structure) established that smooth optimization on S^(d-1) at d=128 naturally converges to CV ≈ 0.24 (mean 0.2393 across 5 seeds, spread 0.013) with no CV loss applied. In trained full models with structured data, CV converges to the tighter 0.20–0.23 band observed across 17+ architectures. The natural basin is strong: at low CV weight (≤0.01), the loss cannot displace CV from the basin regardless of target — targets from 0.00 to 1.00 all produce CV in [0.23, 0.25]. However, CV is not immovable: at weight ≥ 1.0, the optimizer CAN escape the basin (w=1.0/t=0.80 → CV≈0.71; w=100/t=0.80 → CV≈0.82). The key finding is that the basin exists as an attractor independent of the loss function — it emerges from the interaction between the sphere's curvature and gradient descent's smoothness.
+*The floor*: Even with extreme force (weight=100, target=0.00), CV cannot be pushed below ~0.11–0.12. This is the hard geometric floor — the maximum volume regularity achievable by smooth functions on S^(d-1). Below this, simplex volumes are constrained by the curvature itself.
+*Dimension dependence*: The basin is not universal across dimensions. The noise sweep showed: d=16 → CV=0.37, d=32 → CV=0.26, d=64 → CV=0.23, d=128 → CV=0.24, d=256 → CV=0.27, d=512 → CV=0.33. The minimum is around d=64–128, corresponding to effective geometric dimension ~16–47. At very low dimension, the sphere is too curved for regular volumes. At very high dimension, the model uses only a subspace (effective dim ~41–50 regardless of ambient dim), and the unused dimensions introduce volume irregularity.
 *Batched computation*: The default `batched=True` eliminates the Python loop over samples. All n_samples pentachora are sampled simultaneously via `argsort(rand)`, all Cayley-Menger matrices are constructed in parallel, and a single `torch.linalg.det` call on shape `(n_samples, 6, 6)` computes all volumes at once. Measured speedup: **141x** at n=200 samples. The `batched=False` fallback exists for debugging and validation.
 **`ConvEncoder(output_dim)`**
+8-layer convolutional encoder in 4 blocks: (conv3×3-BN-GELU) × 2 + MaxPool. Channels progress 64 → 128 → 256 → 384. Final `AdaptiveAvgPool2d(1)` reduces spatial dimensions to 1×1, followed by a linear projection to `output_dim` with LayerNorm.
+The adaptive average pooling here operates on conv feature maps — statistical spatial features where pooling is standard dimensionality reduction. This is distinct from the Design Rule against global average pooling on geometric features (triangulation outputs, scattering coefficients), where pooling destroys component-wise structure that downstream geometric layers depend on. The conv→pool→project pipeline produces the input to S^(d-1); geometric constraints apply after normalization, not before.
 L2 normalization is intentionally NOT applied inside the encoder. The raw feature norm carries information (encoder confidence), which MagnitudeFlow uses as input. The caller applies `F.normalize()` after extracting the raw magnitude.
 ## Empirical Constants
+Two constants have emerged across extensive experimentation:
+**CV ≈ 0.20–0.23 (trained models)** — The coefficient of variation of pentachoron volumes converges to this band in trained models across contrastive learning, language models, diffusion models, and VAEs (17+ models profiled via Procrustes analysis). On pure noise with no data structure, the natural basin is slightly wider at ~0.24. The constant reflects a dimension-dependent equilibrium between the sphere's curvature and gradient descent's smoothness. At the effective dimensions used by trained models (~16–47), the basin is strongest. The CV loss can be used to regularize toward this band, but at low weight the optimizer converges there naturally — and at high weight the optimizer CAN be forced away from it (at the cost of degraded task performance). The practical recommendation: use cv_weight ≤ 0.01, or omit CV loss entirely and monitor it as a health metric.
+**0.29154 radians** — The binding/separation phase boundary. Below this angular distance from its nearest anchor, an embedding's local geometry is dominated by the sphere's curvature (the anchor's Voronoi cell provides structural context). Above it, the embedding has been displaced far enough that task pressure dominates over local curvature. Observed independently in contrastive training (MinimalShunts), language modeling (T5 generation), ODE flow matching (alpha convergence), and CLIP projection geometry. Its complement (0.70846) appears as the separation threshold. Whether this is a true universal constant or an artifact of the specific architectures tested remains an open question — it is treated as theory-level in ongoing research but should be verified independently.
 ---
 These are not preferences — they are empirically validated constraints. Violating them produces measurable degradation.
+1. **Never use attention in geometric pipelines.** Softmax-weighted averaging destroys angular structure. Each attention layer without residual halves effective dimensionality (measured: 62→28). With residual connections, the residual dominates ~11:1, meaning the attention itself contributes almost nothing — the residual is preserving geometry despite the attention, not because of it. The relay architecture provides direct geometric processing without this compromise. Use ConstellationRelay or RelayLayer instead.
+2. **Never use global average pooling on geometric features.** This applies to triangulation outputs, scattering features, and any representation where spatial or component-wise structure carries geometric meaning. Confirmed empirically on wavelet scattering features: 243-d global average pool dropped accuracy from ~70% to ~29% compared to 15552-d flatten, because pooling destroys the per-coefficient spatial structure that geometric downstream layers depend on. Use flatten or spatial statistics (mean+std per channel minimum) instead. **Note:** Standard spatial pooling in conv encoders (e.g., `AdaptiveAvgPool2d(1)` after conv feature maps) is not subject to this rule — conv feature maps are statistical representations, not geometric ones. The ConvEncoder uses adaptive avg pooling before projecting to S^(d-1), which is correct. The rule applies after projection to the sphere, where geometric structure must be preserved.
+3. **Never apply weight decay to anchor parameters.** Anchors live on S^(d-1). Weight decay adds λ×w to the gradient, pulling parameters toward the origin. For anchors, this shrinks their norm below 1.0 between L2-renormalization steps, creating a systematic bias in the gradient direction (toward the origin rather than along the sphere). Over many steps, this interferes with the optimizer's angular momentum. Use `make_optimizer()` or manually place anchor parameters in a `weight_decay=0` param group. This applies to both constellation anchors and relay layer anchors.
+4. **Never let classification gradient reach anchor positions.** The bridge detaches assignment targets: `assign_target = assign.detach()`. Without detachment, CE gradient flows backward through the bridge → patchwork → triangulation → anchors, repositioning anchors to create classification shortcuts (e.g., collapsing multiple class anchors together to simplify the decision boundary). Push handles global anchor repositioning based on class centroids; gradient handles fine local adjustment via internal losses (spread, attraction, assignment BCE). Mixing CE gradient into anchor positions conflates these two distinct mechanisms.
+5. **Always L2-normalize before triangulation.** Triangulation computes `emb @ anchors.T`, producing cosine similarities. If embeddings are not unit-normalized, this product mixes magnitude and direction: a high-magnitude embedding appears "closer" to all anchors than a low-magnitude one, regardless of angular position. The resulting distances are no longer purely angular and the patchwork's compartment specialization breaks down. The MagnitudeFlow exists specifically to recover magnitude information through a separate geometric channel rather than contaminating the angular measurements.
+6. **Relay gates must initialize cold.** `gate_init=-3.0` (sigmoid ≈ 0.047). At initialization, the relay's patchwork outputs are random — they carry no useful geometric signal. The relay uses additive gating: `out = x + gate × patchwork + (1-gate) × patches`. The global skip preserves x, but each layer adds `gate × random_noise` to the residual stream. With hot gates (sigmoid(0) = 0.5), an 8-layer stack accumulates ~8 × 0.5 = 4x the patch magnitude as random noise on top of the signal, causing early training instability. With cold gates (0.047), the accumulated noise after 8 layers is ~8 × 0.047 = 0.38x — well below the signal magnitude. The gates open selectively during training as the relay learns meaningful features, naturally titrating the contribution of each layer.