AbstractPhil

Create constellation.md

8a787c9 verified 30 days ago

preview code

raw

history blame contribute delete

18.6 kB

Constellation Forms Catalogue

GeoLIP Architecture Reference — March 2026

Sources:

geometric-memory-ft1 (GM1)
geometric-memory-ft2 (GM2)
geometric-memory-ft3 (GM3)
procrustes-vit-hypersphere-ft1 (PVH)
constellation-diffusion-bottleneck (CDB)
Session benchmarks (SB)

Universal Constants

Constant	Value	Source
Pentachoron CV attractor	0.20–0.23	Geometry of S^15 itself (CDB §3)
Binding/separation boundary	0.29154 radians	5+ architectures (CDB §11)
Effective geometric dimension	~16	All trained models (CDB §3.3)
CV precision invariance	fp64 through 1-bit	CDB §3.2

Universal Rules

Rule	Source
SquaredReLU in all constellation paths, never GELU	SB activation tests
Patchwork: Linear(tri, tri×2) → SquaredReLU → LN → Linear(tri×2, dim)	SB proven
Gate init: -3.0 (sigmoid ≈ 0.047)	SB proven
SLERP: only acos in fp32 (16KB tensor), everything else stays in compute dtype	SB fp32 fix
Adam, NO weight decay — geometry IS regularization	GM3 §2.4, PVH §12
InfoNCE is the alignment FORCE. Procrustes is the REGULARIZER.	GM1 §4.1
CV loss on the BOTTLENECK, not the output	GM1 §4.2
CV loss weight: micro (0.001 or below)	GM3 §2.2
Procrustes calibration is non-negotiable for anchor init	PVH §5.1
Anchor dropout (30%) prevents collapse	PVH §5.2

Form 1: GeoLIP Core (Classification)

Source: CDB §2

Purpose: Minimal image classification pipeline. Proves the constellation works as a primary representation layer.

Pipeline:

Input image
  → Conv encoder (builds channel depth: 3→64→128→256)
  → AdaptiveAvgPool → Linear(encoder_out, D) → L2 normalize to S^(d-1)
  → Triangulate against N anchors at 3 SLERP phases → tri_dim profile
  → Patchwork MLP reads triangulation
  → Classifier head → logits

Key properties:

Every embedding on the unit sphere BEFORE the constellation sees it
The conv encoder builds channel depth — constellation operates on channel dimension
One global vector per image, not a sequence
No attention anywhere

Proven results: 91.5% CIFAR-10, 1.6M params, CV=0.2045, 62/64 active anchors

Loss: CE + CV on embeddings

When to use: Single-input classification where the input can be reduced to one D-dimensional vector on S^(d-1).

Form 2: Expert Soup (Multi-Expert Fusion)

Source: PVH §1, §4

Purpose: Fuse multiple frozen pretrained experts into a shared geometric representation on S^(d-1).

Pipeline:

Input image
  → N frozen expert encoders (CLIP, DINOv2, SigLIP, etc.) → N × 768-d
  → GPA alignment at 768-d (iterative Procrustes to mutual mean)
  → PCA to D_ANCHOR dims
  → Per-expert Procrustes-initialized projectors (768 → D_ANCHOR)
  → L2 normalize → shared constellation on S^(D_ANCHOR-1)
  → Triangulate: each expert through its own Procrustes rotation
  → Patchwork reads fused triangulation
  → Classifier

Key properties:

Experts are FROZEN — never modified
Procrustes initialization essential (without: 1/256 active anchors, collapsed)
Anchor dropout (30%) → 508/512 active anchors
Effective dimensionality matches task complexity (76.9 for COCO's 80 classes)
Pipeline is almost entirely linear: 7 linear ops + 2 nonlinearities (GELU in patchwork + classifier)
Weight decay explicitly avoided

Proven results: mAP=0.84 ceiling (data-limited), perfect hypersphere verified (1000/1000 positive volumes), 508/512 active anchors

Loss: InfoNCE(fused, consensus) + MSE + BCE + Procrustes_align + CV + anchor_spread

Optimizer: Adam lr=1e-3, NO weight decay

When to use: Combining multiple pretrained encoders into a shared geometric space for downstream tasks.

Form 3: Geometric Memory / Anchor Bank (Context Extension)

Source: GM1 §2, GM2 §2

Purpose: Extend a frozen encoder's context window by accumulating segment-level geometric addresses in a memory bank.

Pipeline:

Long document (N tokens, N >> encoder context)
  → Split into overlapping segments (sized to encoder window)
  → For each segment:
      → Frozen encoder forward → hidden states at multiple layers
      → Multi-layer fusion (learned weighted sum)
      → Memory tokens cross-attend to fused hidden states
      → Depth-profile compressor: per-layer CLS → single anchor (L2-normalized)
      → Anchor stored in geometric memory bank
      → GRU gate updates rolling memory state
  → Final output: encoder-compatible embedding

Key properties:

Frozen encoder, trainable memory wrapper
Depth-profile anchors encode HOW the encoder processed (not just WHAT)
CV loss on the BANK ANCHORS specifically — the bottleneck between segments
Without CV on bank: projector shortcut collapse (m_acc plateaus at 0.670)
With CV on bank: m_acc reaches 0.945
Segment size must produce 5+ anchors for CV computation (pentachoron needs 5 points)
Convergence order: CV locks first → m_acc climbs → s_cos climbs last

Proven results:

GEOLIP-BERT-8192: m_acc=0.927, CV=0.200 (512→8192 context)
GEOLIP-CLIP-ctx576: m_acc=0.945, CV=0.162 (77→576 context)

Loss: InfoNCE(student, teacher) + Procrustes_SVD + |CV(bank_anchors) - 0.20|

When to use: Extending frozen encoder context windows while preserving embedding space compatibility.

Form 4: Sequence Reconstructor (Per-Position Output)

Source: GM2 §2

Purpose: Produce full per-position output sequences from memory state for diffusion cross-attention.

Pipeline:

Memory state (from Form 3 bank accumulation)
  → Context = cat(memory_tokens, bank_anchors, content_tokens)
  → 77 learned query tokens + positional encoding
      → Cross-attend to context (2 layers)
      → Self-attend among 77 output positions (2 layers)
  → Output: (B, 77, 768) — in frozen encoder's native distribution

Key properties:

Must produce output in the distribution the UNet was trained on
Training target: frozen encoder's own output on same caption (truncated to 77 tokens)
Two teachers: ModernBERT teaches what to remember, CLIP teaches how to say it
Two-phase training works for CLIP-L but NOT universally
Rule: if you need per-position output, train the per-position consumer from the start
Memory format shaped by gradient loudness, not architectural capacity

Proven results:

CLIP-L s_cos=0.734, tulips appeared in SD 1.5 from elements past token 77
Meridian (bigG): s_cos=0.425 (limited by 1280→1024 dimensional mismatch)

Loss: MSE(normalize(pred), normalize(target)) + cosine_similarity + InfoNCE(pooled)

When to use: When downstream consumer needs per-position sequences (diffusion cross-attention, token-level tasks).

Form 5: Constellation Relay (Per-Token Geometric Layer)

Source: CDB §4, SB

Purpose: Replace attention as a per-token processing layer. O(S) complexity. Preserves geometry at depth.

Pipeline:

Input (B, S, D) or (B, D)
  → LayerNorm
  → Chunk D into P patches of patch_dim (e.g., 16 × 16d = 256d)
  → L2 normalize each patch to S^(d-1)
  → Triangulate against anchors at 3 SLERP phases → tri_dim profile
  → Patchwork MLP reads triangulation
  → Gated residual (gate init -3.0)
  → Output = residual + gate * patchwork_out

Key properties:

Per-token, no cross-token interaction
O(S) time and memory — no S² term
Preserves 99.4% cosine similarity to input at depth 16 (vs 7.4% for attention)
3.4× fewer parameters than vanilla attention
Geometric preservation is sequence-length invariant (identical from S=64 through S=131072)
Throughput crossover vs attention at S≈32K; 8.4× faster at S=131K
SquaredReLU wins: better anchor diversity (7.1 vs 4.6), better equivariance, 0.9999 reconstruction

Proven results: cos_to_orig=0.994 at depth 16, 8.4× faster than attention at S=131K

When to use: Processing token sequences where geometric preservation matters more than cross-token mixing. Stackable. The per-token processing layer.

Form 6: Cantor Constellation Router (Cross-Token Routing)

Source: SB (cantor_constellation_relay.py)

Purpose: O(S) cross-token routing through the constellation's own anchor hierarchy. Replaces attention's cross-token role.

Pipeline:

Input tokens (B, S, D) + triangulation profiles (B, S, tri_dim) from relay
  → Compute soft routing weights from phase-0 triangulation distances
  → For each level l in binary anchor tree (16→8→4→2→1):
      → Merge anchor weights into group weights at level l
      → Weighted scatter: tokens → group summaries (bmm)
      → Transform: per-level MLP(dim→dim×2→dim) + LN
      → Weighted gather: group summaries → token updates (bmm)
      → Gated residual at each level
  → Output: tokens with cross-token information

Key properties:

O(S × n_levels × D) where n_levels = log2(A) + 1 = 5 for A=16
No S² term anywhere — not in compute, not in memory
Triangulation from the per-token relay IS the routing key (zero redundant computation)
Binary tree over anchors defines hierarchy (16→8→4→2→1 groups)
At each level: scatter → transform → gather
Cantor routing holds at distance BETTER than attention (2× stronger at S=4096)
The router is a geometric REGULARIZER: cos_orig=0.9818 at 8 layers vs relay alone 0.6533
Geometry IMPROVES with more tokens (0.982→0.986 as S increases)

Proven results: 97.0% cross-token task acc, 0.986 cos preservation at 131K tokens, 5.2× faster than attention at 131K

When to use: Combined with Form 5 relay as a complete O(S) transformer layer replacement (ConstellationCantorRelay).

Form 7: Diffusion Bottleneck / Geometric Lookup Table

Source: CDB §7–9

Purpose: The constellation as the sole information bottleneck of a diffusion model. NOT an autoencoder.

Pipeline:

Encoder features (256×8×8 = 16384-d)
  → Linear(16384, 256) → L2 normalize to S^15
  → Reshape (B, 16, 16) → per-patch S^15 normalization
  → Triangulate: 16 patches × 16 anchors × 3 phases = 768 dims
  → Concat(768 tri dims, conditioning dims)
  → Patchwork MLP → Linear(hidden, 16384) → reshape → decoder

Key properties:

Compression ratio: 16384 → 768 = 21.3×
cos_sim ≈ 0 to input — the bottleneck does NOT reconstruct
It's a geometric LOOKUP TABLE: triangulation profile is an address, patchwork generates from that address
Works for flow matching because training signal is velocity prediction, not reconstruction
Skip bypass experiment: given 268M linear bypass, model routed 88% through 768 constellation dims
Constellation-only cos_sim=0.945 to full model; skip-only cos_sim=0.598
The constellation provides a representational ADVANTAGE over unconstrained capacity

Proven results: Loss 0.1749 (beat 268M skip at 0.1757), 46% anchor convergence to 0.29154 in GLFM

Loss: Flow matching velocity loss (MSE on predicted vs target velocity)

When to use: Diffusion model bottleneck where geometric addressing replaces reconstruction.

Form 8: Geometric Lookup Flow Matching (GLFM)

Source: CDB §10

Purpose: Formalized three-stage flow matching variant where velocity prediction is driven by geometric address lookup.

Pipeline:

Stage 1 — Geometric Addressing:
  Encoder output → project to S^15 at two scales:
    Coarse: global avg pool → 256d → L2 norm → triangulate (768d)
    Fine: per-spatial → 256d → L2 norm → triangulate → aggregate (768d)
  Total address: 1536 dims of angular measurements

Stage 2 — Address Conditioning:
  Geometric address + sinusoidal timestep + class embed + noise-level bins
  → Fused projection to generator input dim

Stage 3 — Velocity Generation:
  Deep residual MLP generates velocity features from conditioned address
  4 residual blocks width 1024 → 16384-d spatial features → decoder

Key properties:

Explicit separation of addressing, conditioning, and generation
Multi-scale collapse observed: coarse↔fine cos=0.933 (needs pre-differentiated features like DINOv2)
46% of anchors converged within ±0.05 of 0.29154 binding constant
59% of anchors crossed binding boundary into task-specific territory

Proven results: Loss 0.1754, accelerated drift convergence vs pure bottleneck

When to use: Flow matching diffusion where you want explicit geometric addressing.

Form 9: From-Scratch Encoder (Pixel → Consensus)

Source: PVH §4.2

Purpose: Train a ViT from random initialization to reproduce the expert soup consensus embedding from raw pixels.

Pipeline:

Raw pixels
  → From-scratch ViT (no pretrained weights)
  → Project to D_ANCHOR dims → L2 normalize
  → Train against frozen soup consensus as differentiable teacher

Key properties:

The soup is the teacher — it provides the target embedding for each image
Gradient bottleneck: all gradient flows through D_ANCHOR-dimensional output
With 77M params and 128-d output: gradient density = 1.6×10⁻⁶ per param
Expansion warm-start works: 384-d→1024-d by padding, recovers in 5 epochs

Proven results: 1024-d ViT reached cos=0.663, mAP=0.500 (limited by gradient bottleneck and 118K COCO)

Loss: Same as soup training + geometric autograd

When to use: When you need a single encoder that reproduces multi-expert consensus from raw input.

Form 10: Dual-Teacher Consensus Distillation

Source: GM3 §4

Purpose: Two independently-trained models → GPA consensus → distill into student that exceeds both.

Pipeline:

Teacher A (any config) + Teacher B (any config)
  → Extract embeddings on shared data
  → GPA-align iteratively until δ < 1e-8
  → Consensus = L2_normalize(mean_shape)
  → Student initializes anchors from k-means on consensus
  → Train with: CE + InfoNCE(emb, consensus) + MSE(emb, consensus) + micro CV
  → Geometric autograd: tang=0.01, sep=1.0

Key properties:

Student exceeds BOTH teachers (0.761 vs 0.699/0.649)
Student still ACCELERATING at epoch 30 (resonant dynamics)
Consensus is the geometric truth — what both agree on after removing rotational frames
Robust to catastrophic models: a 25.5% accuracy parent still contributed useful signal
Diverse parent selection beats top-N selection

Proven results: Student 0.761 from parents averaging 0.664; still accelerating at E30

When to use: When you have 2+ trained models and want a superior student.

Form 11: Multi-Generational Geometric Evolution

Source: GM3 §5

Purpose: Iterated consensus distillation across generations with data diversity.

Pipeline:

Gen 0: N founders trained independently → GPA → consensus anchors
Gen 1: M offspring from Gen 0 consensus + new founder (immigration)
Gen 2+: Previous gen offspring + founder → GPA → consensus → next gen
Each generation trains on differently-perturbed data

Key properties:

Monotonically improving across generations
Each generation inherits consensus-derived anchor coordinates
Fresh founders each generation prevent convergence collapse (gene flow)
Robust: catastrophic models don't poison the lineage
Diverse data across generations captures INVARIANT structure
CV converges toward 0.2 naturally across generations

Proven results: Gen 0 mean=0.664 → Gen 4 best=0.775; FUSE_distilled=0.830

When to use: When you want to compound geometric knowledge across training runs.

Form 12: Geometric Autograd (Optimizer)

Source: GM3 §2

Purpose: Gradient filtering that replaces weight decay with manifold-aware optimization.

Components:

Embedding backward:
  → Decompose gradient into tangential + radial relative to S^(d-1)
  → Pass tangential fully, attenuate radial by (1 - tang_strength)
  → If gradient moves toward nearest anchor: attenuate by sep_strength

Anchor backward:
  → Project gradient tangential to hypersphere at anchor position
  → Scale by drift_strength

Forward losses (all differentiable):
  → CV: |CV(pentachoron volumes) - 0.2| × 0.001
  → Spread: anchor cos² off-diagonal mean × 1e-3
  → Ortho: gram off-diagonal → 0 × 1e-3
  → Entropy: -Σ p·log(p) × 1e-4
  → Cluster var: -var(per-anchor mean cosine) × 1e-4

Key properties:

Adam + geometric autograd > AdamW consistently
Weight decay destroys the geometric harmonic the autograd creates
tang=0.01, sep=1.0 proven optimal
CV loss MUST be forward loss, never backward injection
Enables resonant dynamics: constructive interference compounds across epochs

When to use: Training any constellation-based model. The geometry IS the regularization.

Composition Map

Task	Primary Form	Supporting Forms
Image classification (single image)	Form 1 (Core)	Form 12 (Autograd)
Multi-expert fusion	Form 2 (Soup)	Form 12
Context extension	Form 3 (Memory Bank)	Form 4 (Seq Reconstructor)
Diffusion cross-attention	Form 3 + Form 4
Sequence processing (long)	Form 5 (Relay) + Form 6 (Router)
Diffusion bottleneck	Form 7 (Lookup Table)	Form 8 (GLFM)
Train encoder from scratch	Form 9 (From-Scratch)	Form 2 (Soup as teacher)
Model distillation	Form 10 (Consensus)	Form 12
Compound improvement	Form 11 (Evolution)	Form 10 + Form 12

What the Constellation IS

The constellation is a set of learned anchor points on S^(d-1). It is simultaneously:

A measurement instrument — triangulation computes angular distances to reference points
A coordinate system — the triangulation profile IS the geometric address
A lookup table — the patchwork generates from the address, not reconstructing the input
A routing topology — anchor proximity determines cross-token interaction (Cantor)
A geometric regularizer — anchor structure prevents collapse and preserves manifold health

The constellation is NOT:

An autoencoder (cos_sim ≈ 0 to input in bottleneck form)
A positional encoding (it measures WHERE on S^(d-1), not WHERE in sequence)
Class prototypes (anchors ≠ classes; anchor count independent of class count)
Patches of an image (constellation "patches" = dimensional subspace slices, not spatial tiles)