AbstractPhil
/

geolip-constellation-activations

Model card Files Files and versions

xet

Community

AbstractPhil commited on 17 days ago

Commit

8a787c9

verified ·

1 Parent(s): 8c5dc2e

Create constellation.md

Browse files

Files changed (1) hide show

constellation.md +468 -0

constellation.md ADDED Viewed

	@@ -0,0 +1,468 @@

+# Constellation Forms Catalogue
+## GeoLIP Architecture Reference — March 2026
+Sources:
+- geometric-memory-ft1 (GM1)
+- geometric-memory-ft2 (GM2)
+- geometric-memory-ft3 (GM3)
+- procrustes-vit-hypersphere-ft1 (PVH)
+- constellation-diffusion-bottleneck (CDB)
+- Session benchmarks (SB)
+---
+## Universal Constants
+| Constant | Value | Source |
+|----------|-------|--------|
+| Pentachoron CV attractor | 0.20–0.23 | Geometry of S^15 itself (CDB §3) |
+| Binding/separation boundary | 0.29154 radians | 5+ architectures (CDB §11) |
+| Effective geometric dimension | ~16 | All trained models (CDB §3.3) |
+| CV precision invariance | fp64 through 1-bit | CDB §3.2 |
+## Universal Rules
+| Rule | Source |
+|------|--------|
+| SquaredReLU in all constellation paths, never GELU | SB activation tests |
+| Patchwork: Linear(tri, tri×2) → SquaredReLU → LN → Linear(tri×2, dim) | SB proven |
+| Gate init: -3.0 (sigmoid ≈ 0.047) | SB proven |
+| SLERP: only acos in fp32 (16KB tensor), everything else stays in compute dtype | SB fp32 fix |
+| Adam, NO weight decay — geometry IS regularization | GM3 §2.4, PVH §12 |
+| InfoNCE is the alignment FORCE. Procrustes is the REGULARIZER. | GM1 §4.1 |
+| CV loss on the BOTTLENECK, not the output | GM1 §4.2 |
+| CV loss weight: micro (0.001 or below) | GM3 §2.2 |
+| Procrustes calibration is non-negotiable for anchor init | PVH §5.1 |
+| Anchor dropout (30%) prevents collapse | PVH §5.2 |
+---
+## Form 1: GeoLIP Core (Classification)
+**Source:** CDB §2
+**Purpose:** Minimal image classification pipeline. Proves the constellation works as a primary representation layer.
+**Pipeline:**
+```
+Input image
+  → Conv encoder (builds channel depth: 3→64→128→256)
+  → AdaptiveAvgPool → Linear(encoder_out, D) → L2 normalize to S^(d-1)
+  → Triangulate against N anchors at 3 SLERP phases → tri_dim profile
+  → Patchwork MLP reads triangulation
+  → Classifier head → logits
+```
+**Key properties:**
+- Every embedding on the unit sphere BEFORE the constellation sees it
+- The conv encoder builds channel depth — constellation operates on channel dimension
+- One global vector per image, not a sequence
+- No attention anywhere
+**Proven results:** 91.5% CIFAR-10, 1.6M params, CV=0.2045, 62/64 active anchors
+**Loss:** CE + CV on embeddings
+**When to use:** Single-input classification where the input can be reduced to one D-dimensional vector on S^(d-1).
+---
+## Form 2: Expert Soup (Multi-Expert Fusion)
+**Source:** PVH §1, §4
+**Purpose:** Fuse multiple frozen pretrained experts into a shared geometric representation on S^(d-1).
+**Pipeline:**
+```
+Input image
+  → N frozen expert encoders (CLIP, DINOv2, SigLIP, etc.) → N × 768-d
+  → GPA alignment at 768-d (iterative Procrustes to mutual mean)
+  → PCA to D_ANCHOR dims
+  → Per-expert Procrustes-initialized projectors (768 → D_ANCHOR)
+  → L2 normalize → shared constellation on S^(D_ANCHOR-1)
+  → Triangulate: each expert through its own Procrustes rotation
+  → Patchwork reads fused triangulation
+  → Classifier
+```
+**Key properties:**
+- Experts are FROZEN — never modified
+- Procrustes initialization essential (without: 1/256 active anchors, collapsed)
+- Anchor dropout (30%) → 508/512 active anchors
+- Effective dimensionality matches task complexity (76.9 for COCO's 80 classes)
+- Pipeline is almost entirely linear: 7 linear ops + 2 nonlinearities (GELU in patchwork + classifier)
+- Weight decay explicitly avoided
+**Proven results:** mAP=0.84 ceiling (data-limited), perfect hypersphere verified (1000/1000 positive volumes), 508/512 active anchors
+**Loss:** InfoNCE(fused, consensus) + MSE + BCE + Procrustes_align + CV + anchor_spread
+**Optimizer:** Adam lr=1e-3, NO weight decay
+**When to use:** Combining multiple pretrained encoders into a shared geometric space for downstream tasks.
+---
+## Form 3: Geometric Memory / Anchor Bank (Context Extension)
+**Source:** GM1 §2, GM2 §2
+**Purpose:** Extend a frozen encoder's context window by accumulating segment-level geometric addresses in a memory bank.
+**Pipeline:**
+```
+Long document (N tokens, N >> encoder context)
+  → Split into overlapping segments (sized to encoder window)
+  → For each segment:
+      → Frozen encoder forward → hidden states at multiple layers
+      → Multi-layer fusion (learned weighted sum)
+      → Memory tokens cross-attend to fused hidden states
+      → Depth-profile compressor: per-layer CLS → single anchor (L2-normalized)
+      → Anchor stored in geometric memory bank
+      → GRU gate updates rolling memory state
+  → Final output: encoder-compatible embedding
+```
+**Key properties:**
+- Frozen encoder, trainable memory wrapper
+- Depth-profile anchors encode HOW the encoder processed (not just WHAT)
+- CV loss on the BANK ANCHORS specifically — the bottleneck between segments
+- Without CV on bank: projector shortcut collapse (m_acc plateaus at 0.670)
+- With CV on bank: m_acc reaches 0.945
+- Segment size must produce 5+ anchors for CV computation (pentachoron needs 5 points)
+- Convergence order: CV locks first → m_acc climbs → s_cos climbs last
+**Proven results:**
+- GEOLIP-BERT-8192: m_acc=0.927, CV=0.200 (512→8192 context)
+- GEOLIP-CLIP-ctx576: m_acc=0.945, CV=0.162 (77→576 context)
+**Loss:** InfoNCE(student, teacher) + Procrustes_SVD + |CV(bank_anchors) - 0.20|
+**When to use:** Extending frozen encoder context windows while preserving embedding space compatibility.
+---
+## Form 4: Sequence Reconstructor (Per-Position Output)
+**Source:** GM2 §2
+**Purpose:** Produce full per-position output sequences from memory state for diffusion cross-attention.
+**Pipeline:**
+```
+Memory state (from Form 3 bank accumulation)
+  → Context = cat(memory_tokens, bank_anchors, content_tokens)
+  → 77 learned query tokens + positional encoding
+      → Cross-attend to context (2 layers)
+      → Self-attend among 77 output positions (2 layers)
+  → Output: (B, 77, 768) — in frozen encoder's native distribution
+```
+**Key properties:**
+- Must produce output in the distribution the UNet was trained on
+- Training target: frozen encoder's own output on same caption (truncated to 77 tokens)
+- Two teachers: ModernBERT teaches what to remember, CLIP teaches how to say it
+- Two-phase training works for CLIP-L but NOT universally
+- Rule: if you need per-position output, train the per-position consumer from the start
+- Memory format shaped by gradient loudness, not architectural capacity
+**Proven results:**
+- CLIP-L s_cos=0.734, tulips appeared in SD 1.5 from elements past token 77
+- Meridian (bigG): s_cos=0.425 (limited by 1280→1024 dimensional mismatch)
+**Loss:** MSE(normalize(pred), normalize(target)) + cosine_similarity + InfoNCE(pooled)
+**When to use:** When downstream consumer needs per-position sequences (diffusion cross-attention, token-level tasks).
+---
+## Form 5: Constellation Relay (Per-Token Geometric Layer)
+**Source:** CDB §4, SB
+**Purpose:** Replace attention as a per-token processing layer. O(S) complexity. Preserves geometry at depth.
+**Pipeline:**
+```
+Input (B, S, D) or (B, D)
+  → LayerNorm
+  → Chunk D into P patches of patch_dim (e.g., 16 × 16d = 256d)
+  → L2 normalize each patch to S^(d-1)
+  → Triangulate against anchors at 3 SLERP phases → tri_dim profile
+  → Patchwork MLP reads triangulation
+  → Gated residual (gate init -3.0)
+  → Output = residual + gate * patchwork_out
+```
+**Key properties:**
+- Per-token, no cross-token interaction
+- O(S) time and memory — no S² term
+- Preserves 99.4% cosine similarity to input at depth 16 (vs 7.4% for attention)
+- 3.4× fewer parameters than vanilla attention
+- Geometric preservation is sequence-length invariant (identical from S=64 through S=131072)
+- Throughput crossover vs attention at S≈32K; 8.4× faster at S=131K
+- SquaredReLU wins: better anchor diversity (7.1 vs 4.6), better equivariance, 0.9999 reconstruction
+**Proven results:** cos_to_orig=0.994 at depth 16, 8.4× faster than attention at S=131K
+**When to use:** Processing token sequences where geometric preservation matters more than cross-token mixing. Stackable. The per-token processing layer.
+---
+## Form 6: Cantor Constellation Router (Cross-Token Routing)
+**Source:** SB (cantor_constellation_relay.py)
+**Purpose:** O(S) cross-token routing through the constellation's own anchor hierarchy. Replaces attention's cross-token role.
+**Pipeline:**
+```
+Input tokens (B, S, D) + triangulation profiles (B, S, tri_dim) from relay
+  → Compute soft routing weights from phase-0 triangulation distances
+  → For each level l in binary anchor tree (16→8→4→2→1):
+      → Merge anchor weights into group weights at level l
+      → Weighted scatter: tokens → group summaries (bmm)
+      → Transform: per-level MLP(dim→dim×2→dim) + LN
+      → Weighted gather: group summaries → token updates (bmm)
+      → Gated residual at each level
+  → Output: tokens with cross-token information
+```
+**Key properties:**
+- O(S × n_levels × D) where n_levels = log2(A) + 1 = 5 for A=16
+- No S² term anywhere — not in compute, not in memory
+- Triangulation from the per-token relay IS the routing key (zero redundant computation)
+- Binary tree over anchors defines hierarchy (16→8→4→2→1 groups)
+- At each level: scatter → transform → gather
+- Cantor routing holds at distance BETTER than attention (2× stronger at S=4096)
+- The router is a geometric REGULARIZER: cos_orig=0.9818 at 8 layers vs relay alone 0.6533
+- Geometry IMPROVES with more tokens (0.982→0.986 as S increases)
+**Proven results:** 97.0% cross-token task acc, 0.986 cos preservation at 131K tokens, 5.2× faster than attention at 131K
+**When to use:** Combined with Form 5 relay as a complete O(S) transformer layer replacement (ConstellationCantorRelay).
+---
+## Form 7: Diffusion Bottleneck / Geometric Lookup Table
+**Source:** CDB §7–9
+**Purpose:** The constellation as the sole information bottleneck of a diffusion model. NOT an autoencoder.
+**Pipeline:**
+```
+Encoder features (256×8×8 = 16384-d)
+  → Linear(16384, 256) → L2 normalize to S^15
+  → Reshape (B, 16, 16) → per-patch S^15 normalization
+  → Triangulate: 16 patches × 16 anchors × 3 phases = 768 dims
+  → Concat(768 tri dims, conditioning dims)
+  → Patchwork MLP → Linear(hidden, 16384) → reshape → decoder
+```
+**Key properties:**
+- Compression ratio: 16384 → 768 = 21.3×
+- cos_sim ≈ 0 to input — the bottleneck does NOT reconstruct
+- It's a geometric LOOKUP TABLE: triangulation profile is an address, patchwork generates from that address
+- Works for flow matching because training signal is velocity prediction, not reconstruction
+- Skip bypass experiment: given 268M linear bypass, model routed 88% through 768 constellation dims
+- Constellation-only cos_sim=0.945 to full model; skip-only cos_sim=0.598
+- The constellation provides a representational ADVANTAGE over unconstrained capacity
+**Proven results:** Loss 0.1749 (beat 268M skip at 0.1757), 46% anchor convergence to 0.29154 in GLFM
+**Loss:** Flow matching velocity loss (MSE on predicted vs target velocity)
+**When to use:** Diffusion model bottleneck where geometric addressing replaces reconstruction.
+---
+## Form 8: Geometric Lookup Flow Matching (GLFM)
+**Source:** CDB §10
+**Purpose:** Formalized three-stage flow matching variant where velocity prediction is driven by geometric address lookup.
+**Pipeline:**
+```
+Stage 1 — Geometric Addressing:
+  Encoder output → project to S^15 at two scales:
+    Coarse: global avg pool → 256d → L2 norm → triangulate (768d)
+    Fine: per-spatial → 256d → L2 norm → triangulate → aggregate (768d)
+  Total address: 1536 dims of angular measurements
+Stage 2 — Address Conditioning:
+  Geometric address + sinusoidal timestep + class embed + noise-level bins
+  → Fused projection to generator input dim
+Stage 3 — Velocity Generation:
+  Deep residual MLP generates velocity features from conditioned address
+  4 residual blocks width 1024 → 16384-d spatial features → decoder
+```
+**Key properties:**
+- Explicit separation of addressing, conditioning, and generation
+- Multi-scale collapse observed: coarse↔fine cos=0.933 (needs pre-differentiated features like DINOv2)
+- 46% of anchors converged within ±0.05 of 0.29154 binding constant
+- 59% of anchors crossed binding boundary into task-specific territory
+**Proven results:** Loss 0.1754, accelerated drift convergence vs pure bottleneck
+**When to use:** Flow matching diffusion where you want explicit geometric addressing.
+---
+## Form 9: From-Scratch Encoder (Pixel → Consensus)
+**Source:** PVH §4.2
+**Purpose:** Train a ViT from random initialization to reproduce the expert soup consensus embedding from raw pixels.
+**Pipeline:**
+```
+Raw pixels
+  → From-scratch ViT (no pretrained weights)
+  → Project to D_ANCHOR dims → L2 normalize
+  → Train against frozen soup consensus as differentiable teacher
+```
+**Key properties:**
+- The soup is the teacher — it provides the target embedding for each image
+- Gradient bottleneck: all gradient flows through D_ANCHOR-dimensional output
+- With 77M params and 128-d output: gradient density = 1.6×10⁻⁶ per param
+- Expansion warm-start works: 384-d→1024-d by padding, recovers in 5 epochs
+**Proven results:** 1024-d ViT reached cos=0.663, mAP=0.500 (limited by gradient bottleneck and 118K COCO)
+**Loss:** Same as soup training + geometric autograd
+**When to use:** When you need a single encoder that reproduces multi-expert consensus from raw input.
+---
+## Form 10: Dual-Teacher Consensus Distillation
+**Source:** GM3 §4
+**Purpose:** Two independently-trained models → GPA consensus → distill into student that exceeds both.
+**Pipeline:**
+```
+Teacher A (any config) + Teacher B (any config)
+  → Extract embeddings on shared data
+  → GPA-align iteratively until δ < 1e-8
+  → Consensus = L2_normalize(mean_shape)
+  → Student initializes anchors from k-means on consensus
+  → Train with: CE + InfoNCE(emb, consensus) + MSE(emb, consensus) + micro CV
+  → Geometric autograd: tang=0.01, sep=1.0
+```
+**Key properties:**
+- Student exceeds BOTH teachers (0.761 vs 0.699/0.649)
+- Student still ACCELERATING at epoch 30 (resonant dynamics)
+- Consensus is the geometric truth — what both agree on after removing rotational frames
+- Robust to catastrophic models: a 25.5% accuracy parent still contributed useful signal
+- Diverse parent selection beats top-N selection
+**Proven results:** Student 0.761 from parents averaging 0.664; still accelerating at E30
+**When to use:** When you have 2+ trained models and want a superior student.
+---
+## Form 11: Multi-Generational Geometric Evolution
+**Source:** GM3 §5
+**Purpose:** Iterated consensus distillation across generations with data diversity.
+**Pipeline:**
+```
+Gen 0: N founders trained independently → GPA → consensus anchors
+Gen 1: M offspring from Gen 0 consensus + new founder (immigration)
+Gen 2+: Previous gen offspring + founder → GPA → consensus → next gen
+Each generation trains on differently-perturbed data
+```
+**Key properties:**
+- Monotonically improving across generations
+- Each generation inherits consensus-derived anchor coordinates
+- Fresh founders each generation prevent convergence collapse (gene flow)
+- Robust: catastrophic models don't poison the lineage
+- Diverse data across generations captures INVARIANT structure
+- CV converges toward 0.2 naturally across generations
+**Proven results:** Gen 0 mean=0.664 → Gen 4 best=0.775; FUSE_distilled=0.830
+**When to use:** When you want to compound geometric knowledge across training runs.
+---
+## Form 12: Geometric Autograd (Optimizer)
+**Source:** GM3 §2
+**Purpose:** Gradient filtering that replaces weight decay with manifold-aware optimization.
+**Components:**
+```
+Embedding backward:
+  → Decompose gradient into tangential + radial relative to S^(d-1)
+  → Pass tangential fully, attenuate radial by (1 - tang_strength)
+  → If gradient moves toward nearest anchor: attenuate by sep_strength
+Anchor backward:
+  → Project gradient tangential to hypersphere at anchor position
+  → Scale by drift_strength
+Forward losses (all differentiable):
+  → CV: |CV(pentachoron volumes) - 0.2| × 0.001
+  → Spread: anchor cos² off-diagonal mean × 1e-3
+  → Ortho: gram off-diagonal → 0 × 1e-3
+  → Entropy: -Σ p·log(p) × 1e-4
+  → Cluster var: -var(per-anchor mean cosine) × 1e-4
+```
+**Key properties:**
+- Adam + geometric autograd > AdamW consistently
+- Weight decay destroys the geometric harmonic the autograd creates
+- tang=0.01, sep=1.0 proven optimal
+- CV loss MUST be forward loss, never backward injection
+- Enables resonant dynamics: constructive interference compounds across epochs
+**When to use:** Training any constellation-based model. The geometry IS the regularization.
+---
+## Composition Map
+| Task | Primary Form | Supporting Forms |
+|------|-------------|-----------------|
+| Image classification (single image) | Form 1 (Core) | Form 12 (Autograd) |
+| Multi-expert fusion | Form 2 (Soup) | Form 12 |
+| Context extension | Form 3 (Memory Bank) | Form 4 (Seq Reconstructor) |
+| Diffusion cross-attention | Form 3 + Form 4 | |
+| Sequence processing (long) | Form 5 (Relay) + Form 6 (Router) | |
+| Diffusion bottleneck | Form 7 (Lookup Table) | Form 8 (GLFM) |
+| Train encoder from scratch | Form 9 (From-Scratch) | Form 2 (Soup as teacher) |
+| Model distillation | Form 10 (Consensus) | Form 12 |
+| Compound improvement | Form 11 (Evolution) | Form 10 + Form 12 |
+---
+## What the Constellation IS
+The constellation is a set of learned anchor points on S^(d-1). It is simultaneously:
+1. **A measurement instrument** — triangulation computes angular distances to reference points
+2. **A coordinate system** — the triangulation profile IS the geometric address
+3. **A lookup table** — the patchwork generates from the address, not reconstructing the input
+4. **A routing topology** — anchor proximity determines cross-token interaction (Cantor)
+5. **A geometric regularizer** — anchor structure prevents collapse and preserves manifold health
+The constellation is NOT:
+- An autoencoder (cos_sim ≈ 0 to input in bottleneck form)
+- A positional encoding (it measures WHERE on S^(d-1), not WHERE in sequence)
+- Class prototypes (anchors ≠ classes; anchor count independent of class count)
+- Patches of an image (constellation "patches" = dimensional subspace slices, not spatial tiles)