Update README.md

Browse files

Files changed (1) hide show

README.md +355 -4

README.md CHANGED Viewed

@@ -1,9 +1,360 @@
 ---
 license: apache-2.0
 ---
-GEOLIP has finally evolved to a stable state through heavy experimentation and development.
-I present to you, the geolip core.
-The composable and reusable geometric components I've developed for analysis, differentiation, cross-correlation, cross-contrast, analysis, and
-the subsequent utilities they provide throughout their utilization and systemic formation.

 ---
 license: apache-2.0
 ---
+# GeoLIP Core — Geometric Linear Interpolative Patchwork
+GeoLIP is a geometric deep learning framework built on a single premise: **the unit hypersphere has structure, and that structure is computable**. Rather than scaling parameters to approximate statistical boundaries between classes, GeoLIP places learnable reference points (anchors) on S^(d-1) and measures angular relationships between embeddings and those anchors. The resulting distance patterns — interpreted through compartmentalized patchwork networks — produce compact, reusable geometric features that generalize without requiring billions of parameters.
+This repository contains the stable core: three files, 1315 lines, covering every component from anchor initialization through magnitude prediction to the full three-domain training objective. Every piece is composable, every loss is interchangeable, and every design decision traces back to empirically validated geometric principles.
+## Architecture Overview
+```
+Pixels → ConvEncoder → features → L2 normalize → S^(d-1)
+                                        ↓
+                    MagnitudeFlow (relay stack, no attention)
+                         ↓                    ↓
+                    per-anchor magnitude    embedding
+                         ↓                    ↓
+              tri = (1 - cos) × magnitude    cos → soft assignment
+                         ↓                    ↓
+                    Patchwork              soft_assign
+                    (compartments)              ↓
+                         ↓              ┌──────┴──────┐
+                    bridge prediction   task head → logits
+                                        (reads all three)
+```
+Three independent loss domains shape the model cooperatively:
+- **External** (task): cross-entropy + embedding InfoNCE
+- **Geometric** (structure): patchwork InfoNCE + bridge
+- **Internal** (self-organization): assignment crispness, triangulation consistency, attraction, CV regularization, anchor spread
+The constellation discovers its own Voronoi structure through internal losses. The task head reads that structure but does not write to it. This separation prevents classification shortcuts from hijacking anchor geometry.
+---
+## File 1: `geolip_core.py` — Geometric Building Blocks
+Everything structural. No losses, no training loops. Pure composable components that can be assembled into any geometric architecture.
+### Activations
+**`SquaredReLU`** — `x → ReLU(x)²`. Empirically the strongest activation across all GeoLIP patchwork configurations. The squaring amplifies separation between active and inactive neurons, producing sharper compartment specialization than GELU or standard ReLU. Used as the default throughout.
+**`StarReLU`** — `x → ReLU(x)² × scale + bias` with learnable scale and bias. Runner-up to SquaredReLU in bulk activation tests. The learnable parameters allow the activation to adapt its dynamic range per layer, useful when compartments operate at different magnitude scales.
+**`make_activation(name)`** — Factory function. Supports `squared_relu`, `star_relu`, `gelu`, `relu`, `sigmoid`. Every patchwork and task head references activations by name through this factory, making architecture-wide activation swaps trivial.
+### Anchor Initialization
+Anchor placement on S^(d-1) at initialization determines the starting Voronoi tessellation. Poor initialization can leave large regions of the sphere unmeasured.
+**`init_anchors_xavier(n, d)`** — Xavier normal, then L2-normalize. Fast, reasonable for moderate anchor counts. Near-orthogonal in high dimensions by concentration of measure.
+**`init_anchors_orthogonal(n, d)`** — QR decomposition for exact orthonormal basis when n ≤ d. When n > d, fills the first d anchors with the orthonormal basis and adds random normalized vectors for the remainder. Guarantees zero mutual cosine similarity for the first d anchors.
+**`init_anchors_repulsion(n, d)`** — QR initialization followed by 200 iterations of nearest-neighbor repulsion. Each step pushes every anchor away from its closest neighbor, producing even coverage of S^(d-1). The proven default. Costs ~50ms at initialization but produces measurably better early-training geometry than Xavier or orthogonal alone.
+### Constellation
+**`Constellation(n_anchors, dim, anchor_drop, anchor_init)`**
+The fundamental measurement instrument. Anchors are learnable parameters living on S^(d-1), re-normalized every forward pass. Triangulation computes cosine similarity between each embedding and every anchor, producing an angular distance profile that uniquely identifies the embedding's position on the sphere.
+*Why it exists*: A single embedding vector is a point. A triangulation against N anchors is a measurement — it describes where that point sits relative to N known reference positions. This is the difference between "a location" and "a location on a map." The patchwork reads the map, not the location.
+*Anchor dropout*: During training, randomly masks a fraction of anchors (default 15%). Forces the patchwork to develop redundant measurement pathways rather than depending on any single anchor. Proven to improve generalization.
+*Critical rule*: Anchors are always L2-normalized before use. They live on S^(d-1), not in ambient R^d. Weight decay must be disabled for anchor parameters — decay pulls them toward the origin, destroying their geometric meaning.
+### Patchwork
+**`Patchwork(n_anchors, n_comp, d_comp, activation)`**
+Round-robin compartmentalized interpreter. Each compartment reads a disjoint subset of anchor distances (anchor k goes to compartment k % n_comp) through a 2-layer MLP with LayerNorm.
+*Why it exists*: Raw triangulation distances are high-dimensional and redundant. The patchwork compresses them into compartment-specific features, where each compartment specializes in a different angular region of the sphere. This specialization is enforced by architecture (non-overlapping anchor subsets) and verified empirically (compartment correlation < 0.15 in trained models).
+*Why round-robin*: Consecutive anchors tend to cluster spatially after push operations. Round-robin assignment (0→comp0, 1→comp1, ..., 8→comp0, 9→comp1, ...) ensures each compartment receives anchors distributed across the sphere rather than spatially concentrated. This maximizes measurement diversity per compartment.
+*Output*: `(B, n_comp × d_comp)` — a structured geometric descriptor with independently interpretable compartments.
+### RelayLayer
+**`RelayLayer(input_dim, patch_dim, n_anchors, n_phases, pw_hidden, gate_init)`**
+The core geometric processing primitive. Replaces attention entirely. Operates on S^(patch_dim-1) (default S^15, where the CV attractor is a geometric fact).
+Pipeline per forward pass:
+1. LayerNorm the input
+2. Reshape into P patches of dimension `patch_dim`
+3. L2-normalize each patch to S^(patch_dim-1)
+4. Triangulate against per-patch anchors at 3 SLERP phases (t=0, 1/3, 2/3)
+5. Independent patchwork MLP per patch interprets the triangulation
+6. Gated residual: `gate × patchwork_output + (1-gate) × input_patch`
+7. Global skip connection: `input + blended_output`
+*Why it exists*: Attention (softmax-weighted averaging) destroys angular structure. Each attention layer without residual halves effective dimensionality (measured: 62→28). The relay measures and gates — it never averages angular relationships. Empirically preserves 99.4% cosine similarity at depth 16, compared to 7.4% for attention.
+*SLERP stroboscope*: Each relay layer's anchors interpolate between their home position (initialization) and their current learned position via spherical linear interpolation. Triangulating at 3 phases along this path provides angular measurements that no single-shot triangulation can capture — the rate of change of distance as the anchor moves reveals curvature information invisible to static measurement.
+*Cold gating*: Gates initialize at sigmoid(-3) ≈ 0.047. At initialization, the relay is nearly transparent — 95.3% of the signal passes through unchanged. This means stacking N relays at init is approximately the identity function. The gates open during training only where the relay's geometric processing provides useful signal. This prevents early training instability from deep stacks.
+*Critical rule*: Relay anchors (like constellation anchors) must be excluded from weight decay. They live on per-patch spheres S^(patch_dim-1).
+### ConstellationRelay
+**`ConstellationRelay(dim, n_anchors, n_comp, d_comp, gate_init, anchor_init, activation)`**
+Sequence-aware wrapper around the relay concept, using the full Constellation + Patchwork pipeline instead of the einsum-based RelayLayer. Handles both `(B, D)` and `(B, S, D)` inputs, making it usable as a drop-in replacement for attention layers in any transformer-like architecture.
+*Why both RelayLayer and ConstellationRelay exist*: RelayLayer is optimized for the MagnitudeFlow's internal processing (fixed patch_dim=16, einsum-based, maximum throughput). ConstellationRelay is the general-purpose version that accepts arbitrary dimensions and sequence inputs.
+### MagnitudeFlow
+**`MagnitudeFlow(dim, n_anchors, hidden_dim, n_heads, n_layers, mag_min, mag_max, n_comp)`**
+Per-compartment magnitude prediction through a stack of RelayLayers. No attention anywhere in the stack.
+The core insight: L2 normalization projects embeddings onto S^(d-1), destroying magnitude information. But the pre-normalization magnitude carries signal — it reflects encoder confidence. MagnitudeFlow recovers this signal geometrically: it takes the embedding direction, the triangulation profile, and the raw magnitude as context, processes them through N relay layers on S^15, and outputs per-compartment magnitude scalars.
+These scalars weight the triangulation distances per compartment before the patchwork reads them: `tri_weighted = tri × magnitude`. Compartments receiving higher magnitude become more influential in the patchwork's interpretation. This gives the model a per-region confidence mechanism without any attention.
+*Architecture*:
+```
+emb_proj(dim→relay/2) + tri_proj(A→relay/4) + raw_mag(1) → ctx_proj → relay_dim
+    → RelayLayer 1 (own anchors on S^15) → gated residual → skip
+    → RelayLayer 2 (own anchors on S^15) → gated residual → skip
+    → ...
+    → reshape to (B, n_comp, 16) → per-compartment MLP → sigmoid → [mag_min, mag_max]
+    → expand to per-anchor (B, A)
+```
+*Stats bias*: After each anchor push operation, MagnitudeFlow receives per-compartment momentum statistics from the push. These are added as a bias to the magnitude output, allowing the magnitude to account for how rapidly the anchor field is changing in each region.
+*Why per-compartment, not per-anchor*: Per-anchor magnitude (512+ scalars) is too fine-grained — the model hallucinates confidence at individual anchor resolution. Per-compartment (4-8 scalars) is coarse enough to represent regional confidence without overfitting to individual anchor positions.
+### AnchorPush
+**`AnchorPush(strategy, n_anchors, dim, **kwargs)`**
+Periodically repositions anchors toward class centroids computed from accumulated embeddings. Three strategies:
+**`raw`**: Fixed learning rate blend toward target. `anchor = normalize(anchor + lr × (target - anchor))`. Simple, effective, no state.
+**`gru`**: Statistics-gated SLERP. Maintains EMA of utilization and drift per anchor. Update gate z scales with misalignment + underuse. Reset gate r controls blending with previous position. Produces variable-speed updates: underused anchors move faster, well-placed anchors stay put.
+**`momentum`**: SGD with momentum on S^(d-1). Accumulates residuals in tangent space with configurable decay. Reprojection onto the tangent plane at each step keeps the accumulator geometrically valid. Dead anchors (utilization below floor) receive forced correction. The proven default for production training.
+*Why push exists*: Anchors are learnable parameters that receive gradient from internal losses (spread, attraction, assignment). But gradient-based anchor movement is slow and can get trapped in local optima. Push provides a periodic global correction based on the actual class structure of the embedding space, analogous to batch normalization providing periodic mean/variance correction.
+*Critical rule*: Push writes directly to `anchors.data`, bypassing the optimizer. It operates in parameter space, not gradient space. The optimizer's momentum and adaptive learning rates for anchor parameters are therefore stale after each push — this is intentional. The optimizer handles fine local adjustment; push handles coarse global repositioning.
+### FlowAttention (Historical)
+**`FlowAttention(dim, n_anchors, flow_dim, n_steps, time_dim, gate_init)`**
+3-step Euler ODE integration in the tangent plane of S^(d-1), conditioned on sinusoidal timestep embeddings and push statistics. Achieved 69.8% single-view accuracy on CIFAR-100 but was superseded by the relay architecture, which provides equivalent geometric processing without the ODE overhead and with better depth scaling.
+Retained for backward compatibility with existing checkpoints. New architectures should use RelayLayer or ConstellationRelay.
+### GeometricAutograd
+**`GeometricAutograd`** — A custom autograd Function that is the identity in the forward pass but modifies gradients in the backward pass. Two corrections:
+1. **Tangential projection**: Attenuates the radial component of gradients (the component pointing toward/away from the origin). On S^(d-1), only tangential movement is meaningful — radial gradients push the embedding off the sphere, which L2 renormalization then undoes. Removing them before they accumulate in optimizer state reduces wasted momentum.
+2. **Anchor separation**: Projects out the component of the gradient pointing toward the nearest anchor. This prevents embeddings from collapsing onto their nearest anchor, maintaining measurement diversity.
+### Utilities
+**`param_count(module, name)`** — Counts total and trainable parameters. Prints a formatted line when name is provided.
+**`model_summary(model)`** — Prints per-submodule parameter breakdown. Essential for verifying that parameter budget is allocated as intended.
+---
+## File 2: `geolip_losses.py` — Losses & Regularization
+Every loss function and monitoring metric, with uniform interfaces. All losses return differentiable scalar tensors. All metrics return Python floats.
+### CV — Coefficient of Variation of Pentachoron Volumes
+The signature geometric measurement. Samples random 5-point simplices (pentachora) from the embedding space, computes their volumes via Cayley-Menger determinants, and measures the coefficient of variation (std/mean) of those volumes.
+**`cv_loss(emb, target=0.22, n_samples=64, batched=True)`**
+Differentiable loss: `(CV - target)²`. Pushes the embedding distribution toward a target volume regularity.
+**`cv_metric(emb, n_samples=200, batched=True)`**
+Non-differentiable monitoring metric. Reports the raw CV value.
+**`cv_multi_scale(emb, scales=(3,4,5,6,7,8), n_samples=100, batched=True)`**
+CV computed at multiple simplex sizes. Healthy geometry shows CV in [0.18, 0.25] at all scales. Scale-dependent CV indicates that the embedding distribution has different structure at different resolutions.
+**`cayley_menger_vol2(points)`**
+Raw Cayley-Menger determinant computation. Given (B, N, D) point sets, returns (B,) squared simplex volumes. The mathematical foundation underlying all CV computation.
+*Why CV exists*: The coefficient of variation of simplex volumes on S^(d-1) measures how regularly the embeddings fill the sphere. CV ≈ 0 means all simplices have identical volume (perfectly uniform distribution). CV >> 1 means volumes vary wildly (tight clusters with voids).
+*The natural basin*: Extensive experimentation (43 configurations, noise prediction with zero data structure) established that smooth optimization on S^(d-1) naturally converges to CV ≈ 0.23 at d=128 with no CV loss applied. This is the equilibrium between the sphere's rigidity (fixed curvature) and gradient descent's smoothness (continuous updates). Setting cv_target=0.80 and applying weight=100 in a full training run still produced CV=0.20 — the sphere's geometry overrides the loss.
+*The floor*: Even with extreme force (weight=100, target=0.00), CV cannot be pushed below ~0.11. This is the hard geometric floor — the maximum volume regularity achievable by smooth functions on S^(d-1).
+*Batched computation*: The default `batched=True` eliminates the Python loop over samples. All n_samples pentachora are sampled simultaneously via `argsort(rand)`, all Cayley-Menger matrices are constructed in parallel, and a single `torch.linalg.det` call on shape `(n_samples, 6, 6)` computes all volumes at once. Measured speedup: **141x** at n=200 samples. The `batched=False` fallback exists for debugging and validation.
+### InfoNCE
+**`nce_loss(z1, z2, temperature=0.07, normalize=True)`**
+Standard symmetric InfoNCE contrastive loss. Two augmented views of the same sample should produce similar embeddings; different samples should produce dissimilar ones. Returns both the loss and the accuracy (fraction of correctly matched pairs).
+*Why it exists at three levels*: InfoNCE is applied to embeddings (external domain), patchwork outputs (geometric domain), and triangulations (internal domain). Each level enforces view consistency at a different stage of the pipeline. Embedding NCE ensures the encoder produces stable features. Patchwork NCE ensures the geometric interpretation is view-invariant. Triangulation NCE ensures the angular distance profile is stable across augmentations.
+The temperature parameter controls sharpness: lower temperature makes the loss more sensitive to small similarity differences. The default 0.07 for embeddings is sharper than the 0.1 used for assignments, reflecting the higher precision expected at the embedding level.
+### Classification
+**`ce_loss(logits, targets)`** — Standard cross-entropy. Returns loss and accuracy.
+**`ce_loss_paired(logits1, logits2, targets)`** — Averaged cross-entropy over two augmented views. Both views should classify correctly; averaging prevents the model from specializing on one augmentation style.
+### Bridge
+**`bridge_loss(bridge_logits, assign_targets, detach_targets=True)`**
+The bridge forces the patchwork to understand the constellation's assignment. The patchwork receives triangulation distances and produces an interpretation. The bridge head takes that interpretation and predicts which anchor each embedding was assigned to. If the patchwork has learned to read the constellation's structure, this prediction is easy. If not, the bridge loss provides gradient that teaches it.
+*Why detach*: Assignment targets are detached from the computation graph by default. This makes the bridge one-way: the constellation teaches the patchwork, but the patchwork cannot modify the constellation's assignment. Without detachment, classification gradients would flow backward through the bridge into anchor positions, defeating the separation between internal and external domains.
+**`bridge_loss_paired(bridge1, bridge2, assign1, assign2)`** — Averaged over two views.
+### Assignment
+**`assign_bce_loss(soft_assign, cos_to_anchors)`**
+Binary cross-entropy between the soft assignment (softmax over cosine similarities) and a hard one-hot target at the nearest anchor. Pushes assignments toward crispness — each embedding should clearly belong to one anchor, not be smeared across many.
+*Why BCE, not CE*: The target is one-hot over A anchors (256-2048). Standard cross-entropy treats this as a classification problem and applies log-softmax, which is numerically appropriate. But the soft assignment is already a probability distribution (output of softmax), and we want to measure how close it is to a specific target distribution. BCE operates element-wise, measuring the divergence at every anchor position independently.
+**`assign_nce_loss(assign1, assign2, temperature=0.1)`**
+InfoNCE between assignments from two augmented views. Two views of the same image should produce the same assignment pattern. This is the internal domain's view-consistency signal.
+### Attraction
+**`attraction_loss(cos_to_anchors)`**
+`1 - max_cos`, averaged over the batch. Pulls each embedding toward its nearest anchor. Without this force, embeddings can drift to regions of S^(d-1) far from any anchor, where triangulation distances are large and uninformative.
+*Balance*: Attraction pulls embeddings toward anchors; spread pushes anchors apart. The equilibrium produces a Voronoi tessellation where each embedding is close to its designated anchor but anchors are maximally separated. Weight 0.25 in the standard configuration.
+### Spread
+**`spread_loss(anchors, target_cos=0.0)`**
+`ReLU(cos_similarity - target_cos)`, averaged over all anchor pairs. Penalizes any pair of anchors whose cosine similarity exceeds the target (default 0.0, meaning orthogonal). Keeps anchors spread across the sphere rather than collapsing into clusters.
+*Why ReLU*: Only penalizes similarity above the target. Anchors that are already orthogonal or anti-aligned receive zero gradient. This is a soft constraint, not a hard one — it permits temporary clustering during training when the task demands it, while providing a restoring force toward spread.
+### kNN Accuracy
+**`knn_accuracy(embeddings, targets, k=1)`**
+Non-differentiable metric. Classifies each embedding by the label of its nearest neighbor (or majority vote of k neighbors) in embedding space. This validates the geometric structure independently of the task head — if kNN accuracy is high, the embedding space has learned a geometry that separates classes without requiring a learned classifier.
+*Why it matters*: The gap between task head accuracy and kNN accuracy measures how much the model depends on the learned classifier versus the raw geometric structure. A small gap means the geometry is doing the work. A large gap means the task head is compensating for poor geometry.
+### Three-Domain Compound Loss
+**`three_domain_loss(output, targets, constellation, ...)`**
+The complete cooperative loss function with all weight arguments exposed. Standalone alternative to `InternalConstellationCore.compute_loss()` for use outside the standard encoder pipeline.
+Default weights:
+```
+EXTERNAL:   CE × 1.0 + NCE_emb × 0.5
+GEOMETRIC:  NCE_pw × 1.0 + bridge × 1.0
+INTERNAL:   assign × 0.5 + assign_nce × 0.25 + NCE_tri × 0.5
+            + attract × 0.25 + CV × 0.01 + spread × 0.01
+```
+These weights were tuned to give each domain approximately equal total gradient magnitude. Without explicit balancing, the internal domain (6 terms) dominates the external domain (2 terms) by raw term count.
+---
+## File 3: `geolip_encoder.py` — Trainable Model
+The complete image classification pipeline: pixels in, logits out, with full geometric structure exposed at every stage.
+### ConvEncoder
+**`ConvEncoder(output_dim)`**
+8-layer convolutional encoder in 4 blocks: (conv3×3-BN-GELU) × 2 + MaxPool. Channels progress 64 → 128 → 256 → 384. Final adaptive average pooling to 1×1, followed by a linear projection to `output_dim` with LayerNorm.
+L2 normalization is intentionally NOT applied inside the encoder. The raw feature norm carries information (encoder confidence), which MagnitudeFlow uses as input. The caller applies `F.normalize()` after extracting the raw magnitude.
+### InternalConstellationCore
+**`InternalConstellationCore(num_classes, dim, n_anchors, n_comp, d_comp, ...)`**
+The three-domain head that owns the constellation geometry. Contains:
+- **Constellation**: anchors on S^(d-1), shaped by internal losses and push
+- **Patchwork**: interprets magnitude-weighted triangulation distances
+- **Bridge**: linear projection from patchwork to anchor space (proves patchwork understands the constellation)
+- **Task head**: MLP reading `[soft_assignment, patchwork, embedding]` → logits
+The `forward_paired()` method processes two augmented views simultaneously, returning a dict with all intermediate representations needed by every loss term. The `compute_loss()` method computes all three domains with configurable per-term weights.
+*Separation principle*: The task head reads the constellation's assignment but does not shape it. CE gradient flows through the task head to the encoder (improving features) and through the patchwork (improving interpretation), but constellation anchors are shaped only by internal losses + push. This prevents the classifier from repositioning anchors as classification shortcuts.
+### GeoLIPImageEncoder
+**`GeoLIPImageEncoder(num_classes, output_dim, n_anchors, n_comp, d_comp, ...)`**
+The full pipeline. Combines ConvEncoder + MagnitudeFlow + InternalConstellationCore into a single module with clean interfaces:
+- `forward(x)` — Single-view eval: pixels → logits + all geometric outputs
+- `forward_paired(v1, v2)` — Two-view training: paired inputs → dict for loss computation
+- `compute_loss(output, targets, **kwargs)` — Three-domain loss with all weights exposed
+- `make_optimizer(lr, weight_decay)` — Builds AdamW with proper anchor parameter exclusion
+- `get_anchor_param_ids()` — Returns param IDs that must have `weight_decay=0`
+- `summary()` — Prints parameter breakdown by submodule
+*`make_optimizer()`*: Constellation anchors and relay layer anchors live on spheres. Weight decay pulls parameters toward the origin, which would collapse anchors to zero norm and destroy their geometric meaning. `make_optimizer()` automatically identifies all anchor parameters and places them in a separate param group with `weight_decay=0`.
+---
+## Empirical Constants
+Two constants have been validated across 17+ models, all architectures and modalities:
+**CV ≈ 0.20–0.23** — The natural coefficient of variation of pentachoron volumes on S^(d-1) under smooth optimization. Observed at effective dimension ~16 across contrastive models, language models, diffusion models, and VAEs. Setting cv_target=0.80 with weight=100 still produces CV=0.20. The constant is a property of the sphere's curvature interacting with gradient descent's smoothness, not a hyperparameter.
+**0.29154 radians** — The binding/separation phase boundary. Below this angular distance, an embedding is structurally bound to its nearest anchor (local geometry dominates). Above it, task pressure has moved the embedding beyond local curvature (classification dominates). Observed independently in contrastive training, language modeling, ODE flow matching, and alpha parameter convergence across architectures.
+---
+## Design Rules
+These are not preferences — they are empirically validated constraints. Violating them produces measurable degradation.
+1. **Never use attention in geometric pipelines.** Softmax averaging destroys angular structure. Each attention layer without residual halves effective dimensionality. Use ConstellationRelay or RelayLayer instead.
+2. **Never use global average pooling in geometric encoders.** It collapses spatial structure. Flatten or use spatial statistics (mean+std per channel minimum). Confirmed empirically: 243-d avg pool drops accuracy from ~70% to ~29% vs 15552-d flatten.
+3. **Never apply weight decay to anchor parameters.** Anchors live on S^(d-1). Decay pulls them toward the origin, destroying normalization. Use `make_optimizer()` or manually separate param groups.
+4. **Never let classification gradient reach anchor positions.** Detach assignment targets in the bridge. Let push handle global anchor repositioning. CE gradient on anchors creates classification shortcuts that destroy the Voronoi structure.
+5. **Always L2-normalize before triangulation.** Triangulation measures angular position. Unnormalized embeddings mix magnitude and direction, making distances meaningless.
+6. **Relay gates must initialize cold.** `gate_init=-3.0` (sigmoid ≈ 0.047). Hot initialization causes deep relay stacks to diverge before the network has learned meaningful features.