AbstractPhil's picture
Create constellation.md
8a787c9 verified

Constellation Forms Catalogue

GeoLIP Architecture Reference β€” March 2026

Sources:

  • geometric-memory-ft1 (GM1)
  • geometric-memory-ft2 (GM2)
  • geometric-memory-ft3 (GM3)
  • procrustes-vit-hypersphere-ft1 (PVH)
  • constellation-diffusion-bottleneck (CDB)
  • Session benchmarks (SB)

Universal Constants

Constant Value Source
Pentachoron CV attractor 0.20–0.23 Geometry of S^15 itself (CDB Β§3)
Binding/separation boundary 0.29154 radians 5+ architectures (CDB Β§11)
Effective geometric dimension ~16 All trained models (CDB Β§3.3)
CV precision invariance fp64 through 1-bit CDB Β§3.2

Universal Rules

Rule Source
SquaredReLU in all constellation paths, never GELU SB activation tests
Patchwork: Linear(tri, triΓ—2) β†’ SquaredReLU β†’ LN β†’ Linear(triΓ—2, dim) SB proven
Gate init: -3.0 (sigmoid β‰ˆ 0.047) SB proven
SLERP: only acos in fp32 (16KB tensor), everything else stays in compute dtype SB fp32 fix
Adam, NO weight decay β€” geometry IS regularization GM3 Β§2.4, PVH Β§12
InfoNCE is the alignment FORCE. Procrustes is the REGULARIZER. GM1 Β§4.1
CV loss on the BOTTLENECK, not the output GM1 Β§4.2
CV loss weight: micro (0.001 or below) GM3 Β§2.2
Procrustes calibration is non-negotiable for anchor init PVH Β§5.1
Anchor dropout (30%) prevents collapse PVH Β§5.2

Form 1: GeoLIP Core (Classification)

Source: CDB Β§2

Purpose: Minimal image classification pipeline. Proves the constellation works as a primary representation layer.

Pipeline:

Input image
  β†’ Conv encoder (builds channel depth: 3β†’64β†’128β†’256)
  β†’ AdaptiveAvgPool β†’ Linear(encoder_out, D) β†’ L2 normalize to S^(d-1)
  β†’ Triangulate against N anchors at 3 SLERP phases β†’ tri_dim profile
  β†’ Patchwork MLP reads triangulation
  β†’ Classifier head β†’ logits

Key properties:

  • Every embedding on the unit sphere BEFORE the constellation sees it
  • The conv encoder builds channel depth β€” constellation operates on channel dimension
  • One global vector per image, not a sequence
  • No attention anywhere

Proven results: 91.5% CIFAR-10, 1.6M params, CV=0.2045, 62/64 active anchors

Loss: CE + CV on embeddings

When to use: Single-input classification where the input can be reduced to one D-dimensional vector on S^(d-1).


Form 2: Expert Soup (Multi-Expert Fusion)

Source: PVH Β§1, Β§4

Purpose: Fuse multiple frozen pretrained experts into a shared geometric representation on S^(d-1).

Pipeline:

Input image
  β†’ N frozen expert encoders (CLIP, DINOv2, SigLIP, etc.) β†’ N Γ— 768-d
  β†’ GPA alignment at 768-d (iterative Procrustes to mutual mean)
  β†’ PCA to D_ANCHOR dims
  β†’ Per-expert Procrustes-initialized projectors (768 β†’ D_ANCHOR)
  β†’ L2 normalize β†’ shared constellation on S^(D_ANCHOR-1)
  β†’ Triangulate: each expert through its own Procrustes rotation
  β†’ Patchwork reads fused triangulation
  β†’ Classifier

Key properties:

  • Experts are FROZEN β€” never modified
  • Procrustes initialization essential (without: 1/256 active anchors, collapsed)
  • Anchor dropout (30%) β†’ 508/512 active anchors
  • Effective dimensionality matches task complexity (76.9 for COCO's 80 classes)
  • Pipeline is almost entirely linear: 7 linear ops + 2 nonlinearities (GELU in patchwork + classifier)
  • Weight decay explicitly avoided

Proven results: mAP=0.84 ceiling (data-limited), perfect hypersphere verified (1000/1000 positive volumes), 508/512 active anchors

Loss: InfoNCE(fused, consensus) + MSE + BCE + Procrustes_align + CV + anchor_spread

Optimizer: Adam lr=1e-3, NO weight decay

When to use: Combining multiple pretrained encoders into a shared geometric space for downstream tasks.


Form 3: Geometric Memory / Anchor Bank (Context Extension)

Source: GM1 Β§2, GM2 Β§2

Purpose: Extend a frozen encoder's context window by accumulating segment-level geometric addresses in a memory bank.

Pipeline:

Long document (N tokens, N >> encoder context)
  β†’ Split into overlapping segments (sized to encoder window)
  β†’ For each segment:
      β†’ Frozen encoder forward β†’ hidden states at multiple layers
      β†’ Multi-layer fusion (learned weighted sum)
      β†’ Memory tokens cross-attend to fused hidden states
      β†’ Depth-profile compressor: per-layer CLS β†’ single anchor (L2-normalized)
      β†’ Anchor stored in geometric memory bank
      β†’ GRU gate updates rolling memory state
  β†’ Final output: encoder-compatible embedding

Key properties:

  • Frozen encoder, trainable memory wrapper
  • Depth-profile anchors encode HOW the encoder processed (not just WHAT)
  • CV loss on the BANK ANCHORS specifically β€” the bottleneck between segments
  • Without CV on bank: projector shortcut collapse (m_acc plateaus at 0.670)
  • With CV on bank: m_acc reaches 0.945
  • Segment size must produce 5+ anchors for CV computation (pentachoron needs 5 points)
  • Convergence order: CV locks first β†’ m_acc climbs β†’ s_cos climbs last

Proven results:

  • GEOLIP-BERT-8192: m_acc=0.927, CV=0.200 (512β†’8192 context)
  • GEOLIP-CLIP-ctx576: m_acc=0.945, CV=0.162 (77β†’576 context)

Loss: InfoNCE(student, teacher) + Procrustes_SVD + |CV(bank_anchors) - 0.20|

When to use: Extending frozen encoder context windows while preserving embedding space compatibility.


Form 4: Sequence Reconstructor (Per-Position Output)

Source: GM2 Β§2

Purpose: Produce full per-position output sequences from memory state for diffusion cross-attention.

Pipeline:

Memory state (from Form 3 bank accumulation)
  β†’ Context = cat(memory_tokens, bank_anchors, content_tokens)
  β†’ 77 learned query tokens + positional encoding
      β†’ Cross-attend to context (2 layers)
      β†’ Self-attend among 77 output positions (2 layers)
  β†’ Output: (B, 77, 768) β€” in frozen encoder's native distribution

Key properties:

  • Must produce output in the distribution the UNet was trained on
  • Training target: frozen encoder's own output on same caption (truncated to 77 tokens)
  • Two teachers: ModernBERT teaches what to remember, CLIP teaches how to say it
  • Two-phase training works for CLIP-L but NOT universally
  • Rule: if you need per-position output, train the per-position consumer from the start
  • Memory format shaped by gradient loudness, not architectural capacity

Proven results:

  • CLIP-L s_cos=0.734, tulips appeared in SD 1.5 from elements past token 77
  • Meridian (bigG): s_cos=0.425 (limited by 1280β†’1024 dimensional mismatch)

Loss: MSE(normalize(pred), normalize(target)) + cosine_similarity + InfoNCE(pooled)

When to use: When downstream consumer needs per-position sequences (diffusion cross-attention, token-level tasks).


Form 5: Constellation Relay (Per-Token Geometric Layer)

Source: CDB Β§4, SB

Purpose: Replace attention as a per-token processing layer. O(S) complexity. Preserves geometry at depth.

Pipeline:

Input (B, S, D) or (B, D)
  β†’ LayerNorm
  β†’ Chunk D into P patches of patch_dim (e.g., 16 Γ— 16d = 256d)
  β†’ L2 normalize each patch to S^(d-1)
  β†’ Triangulate against anchors at 3 SLERP phases β†’ tri_dim profile
  β†’ Patchwork MLP reads triangulation
  β†’ Gated residual (gate init -3.0)
  β†’ Output = residual + gate * patchwork_out

Key properties:

  • Per-token, no cross-token interaction
  • O(S) time and memory β€” no SΒ² term
  • Preserves 99.4% cosine similarity to input at depth 16 (vs 7.4% for attention)
  • 3.4Γ— fewer parameters than vanilla attention
  • Geometric preservation is sequence-length invariant (identical from S=64 through S=131072)
  • Throughput crossover vs attention at Sβ‰ˆ32K; 8.4Γ— faster at S=131K
  • SquaredReLU wins: better anchor diversity (7.1 vs 4.6), better equivariance, 0.9999 reconstruction

Proven results: cos_to_orig=0.994 at depth 16, 8.4Γ— faster than attention at S=131K

When to use: Processing token sequences where geometric preservation matters more than cross-token mixing. Stackable. The per-token processing layer.


Form 6: Cantor Constellation Router (Cross-Token Routing)

Source: SB (cantor_constellation_relay.py)

Purpose: O(S) cross-token routing through the constellation's own anchor hierarchy. Replaces attention's cross-token role.

Pipeline:

Input tokens (B, S, D) + triangulation profiles (B, S, tri_dim) from relay
  β†’ Compute soft routing weights from phase-0 triangulation distances
  β†’ For each level l in binary anchor tree (16β†’8β†’4β†’2β†’1):
      β†’ Merge anchor weights into group weights at level l
      β†’ Weighted scatter: tokens β†’ group summaries (bmm)
      → Transform: per-level MLP(dim→dim×2→dim) + LN
      β†’ Weighted gather: group summaries β†’ token updates (bmm)
      β†’ Gated residual at each level
  β†’ Output: tokens with cross-token information

Key properties:

  • O(S Γ— n_levels Γ— D) where n_levels = log2(A) + 1 = 5 for A=16
  • No SΒ² term anywhere β€” not in compute, not in memory
  • Triangulation from the per-token relay IS the routing key (zero redundant computation)
  • Binary tree over anchors defines hierarchy (16β†’8β†’4β†’2β†’1 groups)
  • At each level: scatter β†’ transform β†’ gather
  • Cantor routing holds at distance BETTER than attention (2Γ— stronger at S=4096)
  • The router is a geometric REGULARIZER: cos_orig=0.9818 at 8 layers vs relay alone 0.6533
  • Geometry IMPROVES with more tokens (0.982β†’0.986 as S increases)

Proven results: 97.0% cross-token task acc, 0.986 cos preservation at 131K tokens, 5.2Γ— faster than attention at 131K

When to use: Combined with Form 5 relay as a complete O(S) transformer layer replacement (ConstellationCantorRelay).


Form 7: Diffusion Bottleneck / Geometric Lookup Table

Source: CDB Β§7–9

Purpose: The constellation as the sole information bottleneck of a diffusion model. NOT an autoencoder.

Pipeline:

Encoder features (256Γ—8Γ—8 = 16384-d)
  β†’ Linear(16384, 256) β†’ L2 normalize to S^15
  β†’ Reshape (B, 16, 16) β†’ per-patch S^15 normalization
  β†’ Triangulate: 16 patches Γ— 16 anchors Γ— 3 phases = 768 dims
  β†’ Concat(768 tri dims, conditioning dims)
  β†’ Patchwork MLP β†’ Linear(hidden, 16384) β†’ reshape β†’ decoder

Key properties:

  • Compression ratio: 16384 β†’ 768 = 21.3Γ—
  • cos_sim β‰ˆ 0 to input β€” the bottleneck does NOT reconstruct
  • It's a geometric LOOKUP TABLE: triangulation profile is an address, patchwork generates from that address
  • Works for flow matching because training signal is velocity prediction, not reconstruction
  • Skip bypass experiment: given 268M linear bypass, model routed 88% through 768 constellation dims
  • Constellation-only cos_sim=0.945 to full model; skip-only cos_sim=0.598
  • The constellation provides a representational ADVANTAGE over unconstrained capacity

Proven results: Loss 0.1749 (beat 268M skip at 0.1757), 46% anchor convergence to 0.29154 in GLFM

Loss: Flow matching velocity loss (MSE on predicted vs target velocity)

When to use: Diffusion model bottleneck where geometric addressing replaces reconstruction.


Form 8: Geometric Lookup Flow Matching (GLFM)

Source: CDB Β§10

Purpose: Formalized three-stage flow matching variant where velocity prediction is driven by geometric address lookup.

Pipeline:

Stage 1 β€” Geometric Addressing:
  Encoder output β†’ project to S^15 at two scales:
    Coarse: global avg pool β†’ 256d β†’ L2 norm β†’ triangulate (768d)
    Fine: per-spatial β†’ 256d β†’ L2 norm β†’ triangulate β†’ aggregate (768d)
  Total address: 1536 dims of angular measurements

Stage 2 β€” Address Conditioning:
  Geometric address + sinusoidal timestep + class embed + noise-level bins
  β†’ Fused projection to generator input dim

Stage 3 β€” Velocity Generation:
  Deep residual MLP generates velocity features from conditioned address
  4 residual blocks width 1024 β†’ 16384-d spatial features β†’ decoder

Key properties:

  • Explicit separation of addressing, conditioning, and generation
  • Multi-scale collapse observed: coarse↔fine cos=0.933 (needs pre-differentiated features like DINOv2)
  • 46% of anchors converged within Β±0.05 of 0.29154 binding constant
  • 59% of anchors crossed binding boundary into task-specific territory

Proven results: Loss 0.1754, accelerated drift convergence vs pure bottleneck

When to use: Flow matching diffusion where you want explicit geometric addressing.


Form 9: From-Scratch Encoder (Pixel β†’ Consensus)

Source: PVH Β§4.2

Purpose: Train a ViT from random initialization to reproduce the expert soup consensus embedding from raw pixels.

Pipeline:

Raw pixels
  β†’ From-scratch ViT (no pretrained weights)
  β†’ Project to D_ANCHOR dims β†’ L2 normalize
  β†’ Train against frozen soup consensus as differentiable teacher

Key properties:

  • The soup is the teacher β€” it provides the target embedding for each image
  • Gradient bottleneck: all gradient flows through D_ANCHOR-dimensional output
  • With 77M params and 128-d output: gradient density = 1.6Γ—10⁻⁢ per param
  • Expansion warm-start works: 384-dβ†’1024-d by padding, recovers in 5 epochs

Proven results: 1024-d ViT reached cos=0.663, mAP=0.500 (limited by gradient bottleneck and 118K COCO)

Loss: Same as soup training + geometric autograd

When to use: When you need a single encoder that reproduces multi-expert consensus from raw input.


Form 10: Dual-Teacher Consensus Distillation

Source: GM3 Β§4

Purpose: Two independently-trained models β†’ GPA consensus β†’ distill into student that exceeds both.

Pipeline:

Teacher A (any config) + Teacher B (any config)
  β†’ Extract embeddings on shared data
  β†’ GPA-align iteratively until Ξ΄ < 1e-8
  β†’ Consensus = L2_normalize(mean_shape)
  β†’ Student initializes anchors from k-means on consensus
  β†’ Train with: CE + InfoNCE(emb, consensus) + MSE(emb, consensus) + micro CV
  β†’ Geometric autograd: tang=0.01, sep=1.0

Key properties:

  • Student exceeds BOTH teachers (0.761 vs 0.699/0.649)
  • Student still ACCELERATING at epoch 30 (resonant dynamics)
  • Consensus is the geometric truth β€” what both agree on after removing rotational frames
  • Robust to catastrophic models: a 25.5% accuracy parent still contributed useful signal
  • Diverse parent selection beats top-N selection

Proven results: Student 0.761 from parents averaging 0.664; still accelerating at E30

When to use: When you have 2+ trained models and want a superior student.


Form 11: Multi-Generational Geometric Evolution

Source: GM3 Β§5

Purpose: Iterated consensus distillation across generations with data diversity.

Pipeline:

Gen 0: N founders trained independently β†’ GPA β†’ consensus anchors
Gen 1: M offspring from Gen 0 consensus + new founder (immigration)
Gen 2+: Previous gen offspring + founder β†’ GPA β†’ consensus β†’ next gen
Each generation trains on differently-perturbed data

Key properties:

  • Monotonically improving across generations
  • Each generation inherits consensus-derived anchor coordinates
  • Fresh founders each generation prevent convergence collapse (gene flow)
  • Robust: catastrophic models don't poison the lineage
  • Diverse data across generations captures INVARIANT structure
  • CV converges toward 0.2 naturally across generations

Proven results: Gen 0 mean=0.664 β†’ Gen 4 best=0.775; FUSE_distilled=0.830

When to use: When you want to compound geometric knowledge across training runs.


Form 12: Geometric Autograd (Optimizer)

Source: GM3 Β§2

Purpose: Gradient filtering that replaces weight decay with manifold-aware optimization.

Components:

Embedding backward:
  β†’ Decompose gradient into tangential + radial relative to S^(d-1)
  β†’ Pass tangential fully, attenuate radial by (1 - tang_strength)
  β†’ If gradient moves toward nearest anchor: attenuate by sep_strength

Anchor backward:
  β†’ Project gradient tangential to hypersphere at anchor position
  β†’ Scale by drift_strength

Forward losses (all differentiable):
  β†’ CV: |CV(pentachoron volumes) - 0.2| Γ— 0.001
  β†’ Spread: anchor cosΒ² off-diagonal mean Γ— 1e-3
  β†’ Ortho: gram off-diagonal β†’ 0 Γ— 1e-3
  β†’ Entropy: -Ξ£ pΒ·log(p) Γ— 1e-4
  β†’ Cluster var: -var(per-anchor mean cosine) Γ— 1e-4

Key properties:

  • Adam + geometric autograd > AdamW consistently
  • Weight decay destroys the geometric harmonic the autograd creates
  • tang=0.01, sep=1.0 proven optimal
  • CV loss MUST be forward loss, never backward injection
  • Enables resonant dynamics: constructive interference compounds across epochs

When to use: Training any constellation-based model. The geometry IS the regularization.


Composition Map

Task Primary Form Supporting Forms
Image classification (single image) Form 1 (Core) Form 12 (Autograd)
Multi-expert fusion Form 2 (Soup) Form 12
Context extension Form 3 (Memory Bank) Form 4 (Seq Reconstructor)
Diffusion cross-attention Form 3 + Form 4
Sequence processing (long) Form 5 (Relay) + Form 6 (Router)
Diffusion bottleneck Form 7 (Lookup Table) Form 8 (GLFM)
Train encoder from scratch Form 9 (From-Scratch) Form 2 (Soup as teacher)
Model distillation Form 10 (Consensus) Form 12
Compound improvement Form 11 (Evolution) Form 10 + Form 12

What the Constellation IS

The constellation is a set of learned anchor points on S^(d-1). It is simultaneously:

  1. A measurement instrument β€” triangulation computes angular distances to reference points
  2. A coordinate system β€” the triangulation profile IS the geometric address
  3. A lookup table β€” the patchwork generates from the address, not reconstructing the input
  4. A routing topology β€” anchor proximity determines cross-token interaction (Cantor)
  5. A geometric regularizer β€” anchor structure prevents collapse and preserves manifold health

The constellation is NOT:

  • An autoencoder (cos_sim β‰ˆ 0 to input in bottleneck form)
  • A positional encoding (it measures WHERE on S^(d-1), not WHERE in sequence)
  • Class prototypes (anchors β‰  classes; anchor count independent of class count)
  • Patches of an image (constellation "patches" = dimensional subspace slices, not spatial tiles)