GeoLIP Core β€” Geometric Linear Interpolative Patchwork

GeoLIP is a geometric deep learning framework built on a single premise: the unit hypersphere has structure, and that structure is computable. Rather than scaling parameters to approximate statistical boundaries between classes, GeoLIP places learnable reference points (anchors) on S^(d-1) and measures angular relationships between embeddings and those anchors. The resulting distance patterns β€” interpreted through compartmentalized patchwork networks β€” produce compact, reusable geometric features that generalize without requiring billions of parameters.

This repository contains the stable core: three files, 1315 lines, covering every component from anchor initialization through magnitude prediction to the full three-domain training objective. Every piece is composable, every loss is interchangeable, and every design decision traces back to empirically validated geometric principles.

Architecture Overview

Pixels β†’ ConvEncoder β†’ features β†’ L2 normalize β†’ S^(d-1)
                                        ↓
                    MagnitudeFlow (relay stack, no attention)
                         ↓                    ↓
                    per-anchor magnitude    embedding
                         ↓                    ↓
              tri = (1 - cos) Γ— magnitude    cos β†’ soft assignment
                         ↓                    ↓
                    Patchwork              soft_assign
                    (compartments)              ↓
                         ↓              β”Œβ”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”
                    bridge prediction   task head β†’ logits
                                        (reads all three)

Three independent loss domains shape the model cooperatively:

  • External (task): cross-entropy + embedding InfoNCE
  • Geometric (structure): patchwork InfoNCE + bridge
  • Internal (self-organization): assignment crispness, triangulation consistency, attraction, CV regularization, anchor spread

The constellation discovers its own Voronoi structure through internal losses. The task head reads that structure but does not write to it. This separation prevents classification shortcuts from hijacking anchor geometry.


File 1: geolip_core.py β€” Geometric Building Blocks

Everything structural. No losses, no training loops. Pure composable components that can be assembled into any geometric architecture.

Activations

SquaredReLU β€” x β†’ ReLU(x)Β². Empirically the strongest activation across all GeoLIP patchwork configurations. The squaring amplifies separation between active and inactive neurons, producing sharper compartment specialization than GELU or standard ReLU. Used as the default throughout.

StarReLU β€” x β†’ ReLU(x)Β² Γ— scale + bias with learnable scale and bias. Runner-up to SquaredReLU in bulk activation tests. The learnable parameters allow the activation to adapt its dynamic range per layer, useful when compartments operate at different magnitude scales.

make_activation(name) β€” Factory function. Supports squared_relu, star_relu, gelu, relu, sigmoid. Every patchwork and task head references activations by name through this factory, making architecture-wide activation swaps trivial.

Anchor Initialization

Anchor placement on S^(d-1) at initialization determines the starting Voronoi tessellation. Poor initialization can leave large regions of the sphere unmeasured.

init_anchors_xavier(n, d) β€” Xavier normal, then L2-normalize. Fast, reasonable for moderate anchor counts. Near-orthogonal in high dimensions by concentration of measure.

init_anchors_orthogonal(n, d) β€” QR decomposition for exact orthonormal basis when n ≀ d. When n > d, fills the first d anchors with the orthonormal basis and adds random normalized vectors for the remainder. Guarantees zero mutual cosine similarity for the first d anchors.

init_anchors_repulsion(n, d) β€” QR initialization followed by 200 iterations of nearest-neighbor repulsion. Each step pushes every anchor away from its closest neighbor, producing even coverage of S^(d-1). The proven default. Costs ~50ms at initialization but produces measurably better early-training geometry than Xavier or orthogonal alone.

Constellation

Constellation(n_anchors, dim, anchor_drop, anchor_init)

The fundamental measurement instrument. Anchors are learnable parameters living on S^(d-1), re-normalized every forward pass. Triangulation computes cosine similarity between each embedding and every anchor, producing an angular distance profile that uniquely identifies the embedding's position on the sphere.

Why it exists: A single embedding vector is a point. A triangulation against N anchors is a measurement β€” it describes where that point sits relative to N known reference positions. This is the difference between "a location" and "a location on a map." The patchwork reads the map, not the location.

Anchor dropout: During training, randomly masks a fraction of anchors (default 15%). Forces the patchwork to develop redundant measurement pathways rather than depending on any single anchor. Proven to improve generalization.

Critical rule: Anchors are always L2-normalized before use. They live on S^(d-1), not in ambient R^d. Weight decay must be disabled for anchor parameters β€” decay pulls them toward the origin, destroying their geometric meaning.

Patchwork

Patchwork(n_anchors, n_comp, d_comp, activation)

Round-robin compartmentalized interpreter. Each compartment reads a disjoint subset of anchor distances (anchor k goes to compartment k % n_comp) through a 2-layer MLP with LayerNorm.

Why it exists: Raw triangulation distances are high-dimensional and redundant. The patchwork compresses them into compartment-specific features, where each compartment specializes in a different angular region of the sphere. This specialization is enforced by architecture (non-overlapping anchor subsets) and verified empirically (compartment correlation < 0.15 in trained models).

Why round-robin: Consecutive anchors tend to cluster spatially after push operations. Round-robin assignment (0β†’comp0, 1β†’comp1, ..., 8β†’comp0, 9β†’comp1, ...) ensures each compartment receives anchors distributed across the sphere rather than spatially concentrated. This maximizes measurement diversity per compartment.

Output: (B, n_comp Γ— d_comp) β€” a structured geometric descriptor with independently interpretable compartments.

RelayLayer

RelayLayer(input_dim, patch_dim, n_anchors, n_phases, pw_hidden, gate_init)

The core geometric processing primitive. Replaces attention entirely. Operates on S^(patch_dim-1) (default S^15, where the CV attractor is a geometric fact).

Pipeline per forward pass:

  1. LayerNorm the input
  2. Reshape into P patches of dimension patch_dim
  3. L2-normalize each patch to S^(patch_dim-1)
  4. Triangulate against per-patch anchors at 3 SLERP phases (t=0, 1/3, 2/3)
  5. Independent patchwork MLP per patch interprets the triangulation
  6. Gated residual: gate Γ— patchwork_output + (1-gate) Γ— input_patch
  7. Global skip connection: input + blended_output

Why it exists: Attention (softmax-weighted averaging) destroys angular structure. Each attention layer without residual halves effective dimensionality (measured: 62β†’28 per layer). With residual connections, the residual signal dominates ~11:1, making attention a small perturbation β€” the residual is doing the geometric preservation, not the attention. The relay replaces the attention mechanism entirely: it measures and gates rather than averaging angular relationships. Empirically preserves 99.4% cosine similarity at depth 16, compared to 7.4% for attention stacks without residual.

SLERP stroboscope: Each relay layer's anchors interpolate between their home position (initialization) and their current learned position via spherical linear interpolation. Triangulating at 3 phases along this path provides angular measurements that no single-shot triangulation can capture β€” the rate of change of distance as the anchor moves reveals curvature information invisible to static measurement.

Cold gating: Gates initialize at sigmoid(-3) β‰ˆ 0.047. The relay blends gate Γ— patchwork_output + (1-gate) Γ— input_patch and adds it to the global skip. At initialization, 95.3% of the blended signal comes from the input patch (near-identity), and only 4.7% from the random untrained patchwork. This means stacking N relays at init produces minimal noise accumulation. The gates open during training only where the relay's geometric processing provides useful signal, allowing selective depth utilization.

Critical rule: Relay anchors (like constellation anchors) must be excluded from weight decay. They live on per-patch spheres S^(patch_dim-1).

ConstellationRelay

ConstellationRelay(dim, n_anchors, n_comp, d_comp, gate_init, anchor_init, activation)

Sequence-aware wrapper around the relay concept, using the full Constellation + Patchwork pipeline instead of the einsum-based RelayLayer. Handles both (B, D) and (B, S, D) inputs, making it usable as a drop-in replacement for attention layers in any transformer-like architecture.

Why both RelayLayer and ConstellationRelay exist: RelayLayer is optimized for the MagnitudeFlow's internal processing (fixed patch_dim=16, einsum-based, maximum throughput). ConstellationRelay is the general-purpose version that accepts arbitrary dimensions and sequence inputs.

MagnitudeFlow

MagnitudeFlow(dim, n_anchors, hidden_dim, n_heads, n_layers, mag_min, mag_max, n_comp)

Per-compartment magnitude prediction through a stack of RelayLayers. No attention anywhere in the stack.

The core insight: L2 normalization projects embeddings onto S^(d-1), destroying magnitude information. But the pre-normalization magnitude carries signal β€” it reflects encoder confidence. MagnitudeFlow recovers this signal geometrically: it takes the embedding direction, the triangulation profile, and the raw magnitude as context, processes them through N relay layers on S^15, and outputs per-compartment magnitude scalars.

These scalars weight the triangulation distances per compartment before the patchwork reads them: tri_weighted = tri Γ— magnitude. Compartments receiving higher magnitude become more influential in the patchwork's interpretation. This gives the model a per-region confidence mechanism without any attention.

Architecture:

emb_proj(dim→relay/2) + tri_proj(A→relay/4) + raw_mag(1) → ctx_proj → relay_dim
    β†’ RelayLayer 1 (own anchors on S^15) β†’ gated residual β†’ skip
    β†’ RelayLayer 2 (own anchors on S^15) β†’ gated residual β†’ skip
    β†’ ...
    β†’ reshape to (B, n_comp, 16) β†’ per-compartment MLP β†’ sigmoid β†’ [mag_min, mag_max]
    β†’ expand to per-anchor (B, A)

Stats bias: After each anchor push operation, MagnitudeFlow receives per-compartment momentum statistics from the push. These are added as a bias to the magnitude output, allowing the magnitude to account for how rapidly the anchor field is changing in each region.

Why per-compartment, not per-anchor or global: Three approaches were tested empirically. A transformer-based magnitude predictor (3 tokens, self-attention) averaged across tokens, producing near-uniform magnitude across compartments and inverted confidence (wrong predictions scored higher than correct, Ξ”=-0.14). Per-compartment independent MLPs reduced the inversion (Ξ”=-0.07) but couldn't eliminate it. The relay stack achieved Ξ”β‰ˆ+0.001 (hallucination eliminated) with genuine compartment differentiation (std=0.69). The failure mode of the transformer was specifically that attention averaged the 3-token context, destroying per-region signal. Per-compartment granularity (4-8 scalars) is coarse enough to prevent overfitting to individual anchor noise while fine enough to capture regional confidence differences.

AnchorPush

AnchorPush(strategy, n_anchors, dim, **kwargs)

Periodically repositions anchors toward class centroids computed from accumulated embeddings. Three strategies:

raw: Fixed learning rate blend toward target. anchor = normalize(anchor + lr Γ— (target - anchor)). Simple, effective, no state.

gru: Statistics-gated SLERP. Maintains EMA of utilization and drift per anchor. Update gate z scales with misalignment + underuse. Reset gate r controls blending with previous position. Produces variable-speed updates: underused anchors move faster, well-placed anchors stay put.

momentum: SGD with momentum on S^(d-1). Accumulates residuals in tangent space with configurable decay. Reprojection onto the tangent plane at each step keeps the accumulator geometrically valid. Dead anchors (utilization below floor) receive forced correction. The proven default for production training.

Why push exists: Anchors are learnable parameters that receive gradient from internal losses (spread, attraction, assignment). But gradient-based anchor movement is slow and can get trapped in local optima. Push provides a periodic global correction based on the actual class structure of the embedding space, analogous to batch normalization providing periodic mean/variance correction.

Critical rule: Push writes directly to anchors.data, bypassing the optimizer. It operates in parameter space, not gradient space. The optimizer's momentum and adaptive learning rates for anchor parameters are therefore stale after each push β€” this is intentional. The optimizer handles fine local adjustment; push handles coarse global repositioning.

FlowAttention (Historical)

FlowAttention(dim, n_anchors, flow_dim, n_steps, time_dim, gate_init)

3-step Euler ODE integration in the tangent plane of S^(d-1), conditioned on sinusoidal timestep embeddings and push statistics. The full pipeline (Conv8 encoder + 6-step flow variant + learned head) achieved 69.8% single-view accuracy on CIFAR-100, but the approach was superseded by the relay architecture, which provides equivalent geometric processing without the ODE overhead and with better depth scaling (relay preserves 99.4% geometry at depth 16; flow accuracy degrades with step count beyond 6, and cross-constellation testing revealed 45% angular displacement β€” cos(pre,post)=0.555).

Retained for backward compatibility with existing checkpoints. New architectures should use RelayLayer or ConstellationRelay.

GeometricAutograd

GeometricAutograd β€” A custom autograd Function that is the identity in the forward pass but modifies gradients in the backward pass. Two corrections:

  1. Tangential projection: Attenuates the radial component of gradients (the component pointing toward/away from the origin). On S^(d-1), only tangential movement is meaningful β€” radial gradients push the embedding off the sphere, which L2 renormalization then undoes. Removing them before they accumulate in optimizer state reduces wasted momentum.

  2. Anchor separation: Projects out the component of the gradient pointing toward the nearest anchor. This prevents embeddings from collapsing onto their nearest anchor, maintaining measurement diversity.

Utilities

param_count(module, name) β€” Counts total and trainable parameters. Prints a formatted line when name is provided.

model_summary(model) β€” Prints per-submodule parameter breakdown. Essential for verifying that parameter budget is allocated as intended.


File 2: geolip_losses.py β€” Losses & Regularization

Every loss function and monitoring metric, with uniform interfaces. All losses return differentiable scalar tensors. All metrics return Python floats.

CV β€” Coefficient of Variation of Pentachoron Volumes

The signature geometric measurement. Samples random 5-point simplices (pentachora) from the embedding space, computes their volumes via Cayley-Menger determinants, and measures the coefficient of variation (std/mean) of those volumes.

cv_loss(emb, target=0.22, n_samples=64, batched=True)

Differentiable loss: (CV - target)Β². Pushes the embedding distribution toward a target volume regularity.

cv_metric(emb, n_samples=200, batched=True)

Non-differentiable monitoring metric. Reports the raw CV value.

cv_multi_scale(emb, scales=(3,4,5,6,7,8), n_samples=100, batched=True)

CV computed at multiple simplex sizes. Healthy geometry shows CV in [0.18, 0.25] at all scales. Scale-dependent CV indicates that the embedding distribution has different structure at different resolutions.

cayley_menger_vol2(points)

Raw Cayley-Menger determinant computation. Given (B, N, D) point sets, returns (B,) squared simplex volumes. The mathematical foundation underlying all CV computation.

Why CV exists: The coefficient of variation of simplex volumes on S^(d-1) measures how regularly the embeddings fill the sphere. CV β‰ˆ 0 means all simplices have identical volume (perfectly uniform distribution). CV >> 1 means volumes vary wildly (tight clusters with voids).

The natural basin: Extensive experimentation (43 configurations, pure noise prediction with zero data structure) established that smooth optimization on S^(d-1) at d=128 naturally converges to CV β‰ˆ 0.24 (mean 0.2393 across 5 seeds, spread 0.013) with no CV loss applied. In trained full models with structured data, CV converges to the tighter 0.20–0.23 band observed across 17+ architectures. The natural basin is strong: at low CV weight (≀0.01), the loss cannot displace CV from the basin regardless of target β€” targets from 0.00 to 1.00 all produce CV in [0.23, 0.25]. However, CV is not immovable: at weight β‰₯ 1.0, the optimizer CAN escape the basin (w=1.0/t=0.80 β†’ CVβ‰ˆ0.71; w=100/t=0.80 β†’ CVβ‰ˆ0.82). The key finding is that the basin exists as an attractor independent of the loss function β€” it emerges from the interaction between the sphere's curvature and gradient descent's smoothness.

The floor: Even with extreme force (weight=100, target=0.00), CV cannot be pushed below ~0.11–0.12. This is the hard geometric floor β€” the maximum volume regularity achievable by smooth functions on S^(d-1). Below this, simplex volumes are constrained by the curvature itself.

Dimension dependence: The basin is not universal across dimensions. The noise sweep showed: d=16 β†’ CV=0.37, d=32 β†’ CV=0.26, d=64 β†’ CV=0.23, d=128 β†’ CV=0.24, d=256 β†’ CV=0.27, d=512 β†’ CV=0.33. The minimum is around d=64–128, corresponding to effective geometric dimension ~16–47. At very low dimension, the sphere is too curved for regular volumes. At very high dimension, the model uses only a subspace (effective dim ~41–50 regardless of ambient dim), and the unused dimensions introduce volume irregularity.

Batched computation: The default batched=True eliminates the Python loop over samples. All n_samples pentachora are sampled simultaneously via argsort(rand), all Cayley-Menger matrices are constructed in parallel, and a single torch.linalg.det call on shape (n_samples, 6, 6) computes all volumes at once. Measured speedup: 141x at n=200 samples. The batched=False fallback exists for debugging and validation.

InfoNCE

nce_loss(z1, z2, temperature=0.07, normalize=True)

Standard symmetric InfoNCE contrastive loss. Two augmented views of the same sample should produce similar embeddings; different samples should produce dissimilar ones. Returns both the loss and the accuracy (fraction of correctly matched pairs).

Why it exists at three levels: InfoNCE is applied to embeddings (external domain), patchwork outputs (geometric domain), and triangulations (internal domain). Each level enforces view consistency at a different stage of the pipeline. Embedding NCE ensures the encoder produces stable features. Patchwork NCE ensures the geometric interpretation is view-invariant. Triangulation NCE ensures the angular distance profile is stable across augmentations.

The temperature parameter controls sharpness: lower temperature makes the loss more sensitive to small similarity differences. The default 0.07 for embeddings is sharper than the 0.1 used for assignments, reflecting the higher precision expected at the embedding level.

Classification

ce_loss(logits, targets) β€” Standard cross-entropy. Returns loss and accuracy.

ce_loss_paired(logits1, logits2, targets) β€” Averaged cross-entropy over two augmented views. Both views should classify correctly; averaging prevents the model from specializing on one augmentation style.

Bridge

bridge_loss(bridge_logits, assign_targets, detach_targets=True)

The bridge forces the patchwork to understand the constellation's assignment. The patchwork receives triangulation distances and produces an interpretation. The bridge head takes that interpretation and predicts which anchor each embedding was assigned to. If the patchwork has learned to read the constellation's structure, this prediction is easy. If not, the bridge loss provides gradient that teaches it.

Why detach: Assignment targets are detached from the computation graph by default. This makes the bridge one-way: the constellation teaches the patchwork, but the patchwork cannot modify the constellation's assignment. Without detachment, classification gradients would flow backward through the bridge into anchor positions, defeating the separation between internal and external domains.

bridge_loss_paired(bridge1, bridge2, assign1, assign2) β€” Averaged over two views.

Assignment

assign_bce_loss(soft_assign, cos_to_anchors)

Binary cross-entropy between the soft assignment (softmax over cosine similarities) and a hard one-hot target at the nearest anchor. Pushes assignments toward crispness β€” each embedding should clearly belong to one anchor, not be smeared across many.

Why BCE, not CE: The target is one-hot over A anchors (256-2048). Standard cross-entropy treats this as a classification problem and applies log-softmax, which is numerically appropriate. But the soft assignment is already a probability distribution (output of softmax), and we want to measure how close it is to a specific target distribution. BCE operates element-wise, measuring the divergence at every anchor position independently.

assign_nce_loss(assign1, assign2, temperature=0.1)

InfoNCE between assignments from two augmented views. Two views of the same image should produce the same assignment pattern. This is the internal domain's view-consistency signal.

Attraction

attraction_loss(cos_to_anchors)

1 - max_cos, averaged over the batch. Pulls each embedding toward its nearest anchor. Without this force, embeddings can drift to regions of S^(d-1) far from any anchor, where triangulation distances are large and uninformative.

Balance: Attraction pulls embeddings toward anchors; spread pushes anchors apart. The equilibrium produces a Voronoi tessellation where each embedding is close to its designated anchor but anchors are maximally separated. Weight 0.25 in the standard configuration.

Spread

spread_loss(anchors, target_cos=0.0)

ReLU(cos_similarity - target_cos), averaged over all anchor pairs. Penalizes any pair of anchors whose cosine similarity exceeds the target (default 0.0, meaning orthogonal). Keeps anchors spread across the sphere rather than collapsing into clusters.

Why ReLU: Only penalizes similarity above the target. Anchors that are already orthogonal or anti-aligned receive zero gradient. This is a soft constraint, not a hard one β€” it permits temporary clustering during training when the task demands it, while providing a restoring force toward spread.

kNN Accuracy

knn_accuracy(embeddings, targets, k=1)

Non-differentiable metric. Classifies each embedding by the label of its nearest neighbor (or majority vote of k neighbors) in embedding space. This validates the geometric structure independently of the task head β€” if kNN accuracy is high, the embedding space has learned a geometry that separates classes without requiring a learned classifier.

Why it matters: The gap between task head accuracy and kNN accuracy measures how much the model depends on the learned classifier versus the raw geometric structure. A small gap means the geometry is doing the work. A large gap means the task head is compensating for poor geometry.

Three-Domain Compound Loss

three_domain_loss(output, targets, constellation, ...)

The complete cooperative loss function with all weight arguments exposed. Standalone alternative to InternalConstellationCore.compute_loss() for use outside the standard encoder pipeline.

Default weights:

EXTERNAL:   CE Γ— 1.0 + NCE_emb Γ— 0.5
GEOMETRIC:  NCE_pw Γ— 1.0 + bridge Γ— 1.0
INTERNAL:   assign Γ— 0.5 + assign_nce Γ— 0.25 + NCE_tri Γ— 0.5
            + attract Γ— 0.25 + CV Γ— 0.01 + spread Γ— 0.01

These weights were tuned to give each domain approximately equal total gradient magnitude. Without explicit balancing, the internal domain (6 terms) dominates the external domain (2 terms) by raw term count.


File 3: geolip_encoder.py β€” Trainable Model

The complete image classification pipeline: pixels in, logits out, with full geometric structure exposed at every stage.

ConvEncoder

ConvEncoder(output_dim)

8-layer convolutional encoder in 4 blocks: (conv3Γ—3-BN-GELU) Γ— 2 + MaxPool. Channels progress 64 β†’ 128 β†’ 256 β†’ 384. Final AdaptiveAvgPool2d(1) reduces spatial dimensions to 1Γ—1, followed by a linear projection to output_dim with LayerNorm.

The adaptive average pooling here operates on conv feature maps — statistical spatial features where pooling is standard dimensionality reduction. This is distinct from the Design Rule against global average pooling on geometric features (triangulation outputs, scattering coefficients), where pooling destroys component-wise structure that downstream geometric layers depend on. The conv→pool→project pipeline produces the input to S^(d-1); geometric constraints apply after normalization, not before.

L2 normalization is intentionally NOT applied inside the encoder. The raw feature norm carries information (encoder confidence), which MagnitudeFlow uses as input. The caller applies F.normalize() after extracting the raw magnitude.

InternalConstellationCore

InternalConstellationCore(num_classes, dim, n_anchors, n_comp, d_comp, ...)

The three-domain head that owns the constellation geometry. Contains:

  • Constellation: anchors on S^(d-1), shaped by internal losses and push
  • Patchwork: interprets magnitude-weighted triangulation distances
  • Bridge: linear projection from patchwork to anchor space (proves patchwork understands the constellation)
  • Task head: MLP reading [soft_assignment, patchwork, embedding] β†’ logits

The forward_paired() method processes two augmented views simultaneously, returning a dict with all intermediate representations needed by every loss term. The compute_loss() method computes all three domains with configurable per-term weights.

Separation principle: The task head reads the constellation's assignment but does not shape it. CE gradient flows through the task head to the encoder (improving features) and through the patchwork (improving interpretation), but constellation anchors are shaped only by internal losses + push. This prevents the classifier from repositioning anchors as classification shortcuts.

GeoLIPImageEncoder

GeoLIPImageEncoder(num_classes, output_dim, n_anchors, n_comp, d_comp, ...)

The full pipeline. Combines ConvEncoder + MagnitudeFlow + InternalConstellationCore into a single module with clean interfaces:

  • forward(x) β€” Single-view eval: pixels β†’ logits + all geometric outputs
  • forward_paired(v1, v2) β€” Two-view training: paired inputs β†’ dict for loss computation
  • compute_loss(output, targets, **kwargs) β€” Three-domain loss with all weights exposed
  • make_optimizer(lr, weight_decay) β€” Builds AdamW with proper anchor parameter exclusion
  • get_anchor_param_ids() β€” Returns param IDs that must have weight_decay=0
  • summary() β€” Prints parameter breakdown by submodule

make_optimizer(): Constellation anchors and relay layer anchors live on spheres. Weight decay pulls parameters toward the origin, which would collapse anchors to zero norm and destroy their geometric meaning. make_optimizer() automatically identifies all anchor parameters and places them in a separate param group with weight_decay=0.


Empirical Constants

Two constants have emerged across extensive experimentation:

CV β‰ˆ 0.20–0.23 (trained models) β€” The coefficient of variation of pentachoron volumes converges to this band in trained models across contrastive learning, language models, diffusion models, and VAEs (17+ models profiled via Procrustes analysis). On pure noise with no data structure, the natural basin is slightly wider at 0.24. The constant reflects a dimension-dependent equilibrium between the sphere's curvature and gradient descent's smoothness. At the effective dimensions used by trained models (16–47), the basin is strongest. The CV loss can be used to regularize toward this band, but at low weight the optimizer converges there naturally β€” and at high weight the optimizer CAN be forced away from it (at the cost of degraded task performance). The practical recommendation: use cv_weight ≀ 0.01, or omit CV loss entirely and monitor it as a health metric.

0.29154 radians β€” The binding/separation phase boundary. Below this angular distance from its nearest anchor, an embedding's local geometry is dominated by the sphere's curvature (the anchor's Voronoi cell provides structural context). Above it, the embedding has been displaced far enough that task pressure dominates over local curvature. Observed independently in contrastive training (MinimalShunts), language modeling (T5 generation), ODE flow matching (alpha convergence), and CLIP projection geometry. Its complement (0.70846) appears as the separation threshold. Whether this is a true universal constant or an artifact of the specific architectures tested remains an open question β€” it is treated as theory-level in ongoing research but should be verified independently.


Design Rules

These are not preferences β€” they are empirically validated constraints. Violating them produces measurable degradation.

  1. Never use attention in geometric pipelines. Softmax-weighted averaging destroys angular structure. Each attention layer without residual halves effective dimensionality (measured: 62β†’28). With residual connections, the residual dominates ~11:1, meaning the attention itself contributes almost nothing β€” the residual is preserving geometry despite the attention, not because of it. The relay architecture provides direct geometric processing without this compromise. Use ConstellationRelay or RelayLayer instead.

  2. Never use global average pooling on geometric features. This applies to triangulation outputs, scattering features, and any representation where spatial or component-wise structure carries geometric meaning. Confirmed empirically on wavelet scattering features: 243-d global average pool dropped accuracy from ~70% to ~29% compared to 15552-d flatten, because pooling destroys the per-coefficient spatial structure that geometric downstream layers depend on. Use flatten or spatial statistics (mean+std per channel minimum) instead. Note: Standard spatial pooling in conv encoders (e.g., AdaptiveAvgPool2d(1) after conv feature maps) is not subject to this rule β€” conv feature maps are statistical representations, not geometric ones. The ConvEncoder uses adaptive avg pooling before projecting to S^(d-1), which is correct. The rule applies after projection to the sphere, where geometric structure must be preserved.

  3. Never apply weight decay to anchor parameters. Anchors live on S^(d-1). Weight decay adds λ×w to the gradient, pulling parameters toward the origin. For anchors, this shrinks their norm below 1.0 between L2-renormalization steps, creating a systematic bias in the gradient direction (toward the origin rather than along the sphere). Over many steps, this interferes with the optimizer's angular momentum. Use make_optimizer() or manually place anchor parameters in a weight_decay=0 param group. This applies to both constellation anchors and relay layer anchors.

  4. Never let classification gradient reach anchor positions. The bridge detaches assignment targets: assign_target = assign.detach(). Without detachment, CE gradient flows backward through the bridge β†’ patchwork β†’ triangulation β†’ anchors, repositioning anchors to create classification shortcuts (e.g., collapsing multiple class anchors together to simplify the decision boundary). Push handles global anchor repositioning based on class centroids; gradient handles fine local adjustment via internal losses (spread, attraction, assignment BCE). Mixing CE gradient into anchor positions conflates these two distinct mechanisms.

  5. Always L2-normalize before triangulation. Triangulation computes emb @ anchors.T, producing cosine similarities. If embeddings are not unit-normalized, this product mixes magnitude and direction: a high-magnitude embedding appears "closer" to all anchors than a low-magnitude one, regardless of angular position. The resulting distances are no longer purely angular and the patchwork's compartment specialization breaks down. The MagnitudeFlow exists specifically to recover magnitude information through a separate geometric channel rather than contaminating the angular measurements.

  6. Relay gates must initialize cold. gate_init=-3.0 (sigmoid β‰ˆ 0.047). At initialization, the relay's patchwork outputs are random β€” they carry no useful geometric signal. The relay uses additive gating: out = x + gate Γ— patchwork + (1-gate) Γ— patches. The global skip preserves x, but each layer adds gate Γ— random_noise to the residual stream. With hot gates (sigmoid(0) = 0.5), an 8-layer stack accumulates ~8 Γ— 0.5 = 4x the patch magnitude as random noise on top of the signal, causing early training instability. With cold gates (0.047), the accumulated noise after 8 layers is ~8 Γ— 0.047 = 0.38x β€” well below the signal magnitude. The gates open selectively during training as the relay learns meaningful features, naturally titrating the contribution of each layer.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Collection including AbstractPhil/geolip-core