Procrustes ViT Shared Manifold Alignment Experimentation

Community Article Published March 15, 2026

An Analysis of MANY Analysis

AbstractPhil — March 2026


Abstract

We present an extensive empirical investigation into geometric multi-expert alignment on the unit hypersphere S^(d-1). Through iterative architectural experimentation across 60+ training runs, we constructed verified perfect hyperspheres from multi-expert consensus embeddings, discovered emergent dimensional correspondence between effective embedding dimensionality and upstream architectural constraints, and quantified a +103.3 dimensional information gain from dual-stream expert disagreement that standard consensus averaging destroys. Our findings demonstrate that the Cayley-Menger pentachoron coefficient of variation (CV) serves as a reliable geometric health metric, that Procrustes-calibrated projector initialization is essential for constellation stability, and that the information-theoretic ceiling of multi-expert fusion is governed not by the downstream geometry but by the training paradigm of the dominant expert. All experiments were conducted on COCO 2017 (118K train, 5K val) with three vision experts: CLIP ViT-L/14, DINOv2 ViT-B/14, and SigLIP ViT-B/16.


Table of Contents

  1. Core Architecture: The GeoLIP Pipeline
  2. The Hypersphere Alignment System
  3. Geometric Formulas and Loss Functions
  4. Experimental Progression
  5. Key Findings
  6. The 76.9-Dimensional Discovery
  7. Dual-Stream Architecture and the +103.3 Dimension Gap
  8. Hypersphere Verification
  9. Structural Limitations
  10. Unexpected Outcomes
  11. Formulas Reference
  12. Architectural Variants and Results
  13. Implications and Future Directions

1. Core Architecture: The GeoLIP Pipeline

GeoLIP (Geometric Linear Interpolative Patchwork) operates on the unit hypersphere S^(d-1). The pipeline proceeds through distinct phases that mirror the CaptionBERT text alignment paradigm applied to vision:

Phase 1 — Expert Soup Construction: Three pretrained vision experts (CLIP ViT-L/14, DINOv2 ViT-B/14, SigLIP ViT-B/16) produce 768-d embeddings for each image. These are aligned via Generalized Procrustes Analysis (GPA) at 768-d, projected to a lower-dimensional consensus via PCA, and fused through Procrustes-calibrated projectors onto the hypersphere.

Phase 2 — Constellation Crystallization: 256-512 anchor points on S^(d-1) form a coordinate system. Each image's embedding is triangulated against all anchors (cosine distance), producing a high-dimensional distance vector that encodes the image's precise position on the sphere.

Phase 3 — Patchwork Interpretation: The triangulation distances are partitioned into compartments, each processed by a small MLP. The patchwork reads local geometric structure — which anchors are nearby, which are distant, how the triangulation pattern differs from other images.

Phase 4 — From-Scratch Encoder Training: A from-scratch ViT (no pretrained weights) learns to reproduce the consensus embedding from raw pixels, using the frozen soup as a differentiable teacher.

The entire geometric pipeline from expert features to classification logits passes through exactly two nonlinearities (GELU in patchwork + GELU in classifier) plus one geometric nonlinearity (ReLU in the Cayley-Menger volume computation). Everything else — projection, whitening, rotation, triangulation — is linear algebra on the sphere.


2. The Hypersphere Alignment System

image

image

image

2.1 Generalized Procrustes Analysis (GPA)

GPA iteratively aligns N expert embedding matrices to their mutual mean shape. Each iteration:

  1. Compute mean shape: M = (1/N) Σ X_i
  2. For each expert, compute whitened Procrustes alignment to M
  3. Apply alignment, measure total displacement δ
  4. Repeat until δ < 10^-8

In our experiments, GPA consistently converges within 15-20 iterations, achieving inter-expert consensus cosine of 0.965-0.975 at 768-d.

2.2 Whitened Procrustes Alignment

For source embedding matrix S and target T:

  1. Center: S_c = S - μ_S, T_c = T - μ_T
  2. Whiten: Compute covariance C_S = S_c^T S_c / (N-1), then S_w = S_c · C_S^(-1/2)
  3. Rotate: SVD of T_w^T · S_w = UΣV^T, rotation R = UV^T
  4. Compose: Final projection W = (C_S^(-1/2) · R^T)^T

The symmetric inverse square root C^(-1/2) is computed via eigendecomposition: C = QΛQ^T → C^(-1/2) = QΛ^(-1/2)Q^T, with eigenvalue clamping at ε = 10^-6.

2.3 PCA Dimensionality Reduction

After GPA, the 768-d consensus is projected to D_ANCHOR dimensions via PCA:

  1. Center the consensus: X_c = X - μ_X
  2. SVD: X_c = UΣV^T
  3. Projection matrix: P = V[:D_ANCHOR] (top D_ANCHOR right singular vectors)
  4. Projected consensus: X_d = normalize(X · P^T)

At D_ANCHOR=256, PCA retains 100.0% of variance (768-d experts have at most ~256 independent dimensions after GPA alignment). At D_ANCHOR=128, retention is 99.88%.

2.4 Projector Initialization from Procrustes

Each expert's projector (768 → D_ANCHOR) is initialized from the composed Procrustes transformation:

W_proj = (C_S^{-1/2} · R^T)^T    shape: (D_ANCHOR, 768)
b_proj = -(μ_S · C_S^{-1/2} · R^T)  shape: (D_ANCHOR,)

This gives initial cosine similarity of 0.46-0.75 between projected embeddings and consensus targets (depending on D_ANCHOR), compared to ~0.0 for random initialization. The calibrated initialization is essential — without it, the constellation collapses to 1/256 active anchors within the first epoch.

2.5 Consensus CV Calibration

The pentachoron Coefficient of Variation (CV) of the consensus embedding is measured empirically and used as the target for the CV loss:

  • Consensus CV at 768-d: 0.2793
  • Consensus CV at 256-d: 0.2731-0.2878 (varies by GPA convergence)
  • Consensus CV at 128-d: 0.2731

These values fall within the established pentachoron band of 0.20-0.23 for the universal attractor, with the higher values reflecting the multi-expert fusion context.


3. Geometric Formulas and Loss Functions

3.1 Cayley-Menger Determinant (Simplex Volume)

For V points p_1, ..., p_V in ℝ^d, the squared volume of the (V-1)-simplex:

d²_{ij} = ||p_i - p_j||²

CM = | 0  1    1    ...  1   |
     | 1  0    d²₁₂ ... d²₁ᵥ|
     | 1  d²₂₁ 0   ... d²₂ᵥ|
     | ⋮                     |
     | 1  d²ᵥ₁ d²ᵥ₂ ... 0  |

Vol² = (-1)^V / (2^{V-1} · ((V-1)!)²) · det(CM)

For pentachoron (V=5): Vol² = det(CM) / 288

The volume is made differentiable via: Vol = √(ReLU(Vol²) + ε) where ε = 10^-12.

The ReLU is a geometric nonlinearity — it enforces the topological constraint that simplex volumes cannot be negative. This is not a learned activation but a physical law of the manifold.

3.2 Pentachoron Coefficient of Variation (CV)

For n_samples random 5-point subsets of the embedding set:
  vol_i = CM_volume(random_5_points)
  
CV = std(vol_1, ..., vol_n) / mean(vol_1, ..., vol_n)

CV loss: L_CV = |CV_measured - CV_target|

Lower CV indicates more uniform volume distribution (geometrically healthy). Higher CV indicates degenerate configurations (collapsed or clustered).

3.3 InfoNCE with Queue

L_InfoNCE = -1/(2B) Σ[log(exp(e_i·t_i/τ) / Σ_j exp(e_i·t_j/τ))
                     + log(exp(t_i·e_i/τ) / Σ_j exp(t_i·e_j/τ))]

With queue: negatives include the current batch (B) plus a rolling buffer (Q) of detached embeddings from previous batches. Total negatives: B + Q per sample. Queue size: 4096 in our experiments. Temperature τ = 0.07.

3.4 Geometric Autograd (Tangential + Separation)

A custom autograd function that modifies gradients to respect hypersphere geometry:

# Decompose gradient into radial and tangential components
radial = (grad · emb) · emb
tangential = grad - radial

# Suppress radial component (keeps embedding on sphere)
corrected = tangential + (1 - tang) · radial

# Separation: push away from nearest anchor
if sep > 0:
    nearest_anchor = anchors[argmax(emb · anchors^T)]
    toward_anchor = (corrected · nearest) · nearest
    corrected -= sep · ReLU(toward_anchor) · toward_anchor · nearest

Parameters: tang = 0.01 (near-full tangential projection), sep = 1.0 (strong separation).

3.5 Whitened Procrustes Alignment Loss

A differentiable proxy for Procrustes alignment quality:

L_align = 1 - mean(cosine_similarity(emb - μ_emb, target - μ_target))

This centers both distributions and measures alignment of deviations from mean — a differentiable approximation to the full Procrustes objective.

3.6 Anchor Spread Loss

Penalizes anchor clustering:

sim = anchors_normalized · anchors_normalized^T
sim = sim - diag(sim)  # zero diagonal
L_spread = mean(sim²)

For large anchor sets (>1024), computed on a random subsample of 512 anchors per step.

3.7 Combined Loss (Soup Training)

L = 1.0·L_InfoNCE + 0.5·L_MSE + 0.3·L_BCE + 0.5·L_align 
  + 0.001·L_CV + 0.001·L_spread

Optimizer: Adam, lr=10^-3, no weight decay (geometry IS the regularization).

3.8 Combined Loss (Encoder Training)

L = 1.0·L_InfoNCE + 0.5·L_MSE + 0.3·L_BCE + 0.5·L_align
  + 0.001·L_CV + geometric_autograd

Optimizer: Adam, lr=3×10^-4, warmup 500 steps + cosine decay.


4. Experimental Progression

4.1 Soup Variants

Variant Anchors D_ANCHOR Experts Params mAP Active Anchors
Base (uncalibrated) 256 128 3×1 perspective 800K 0.825 1/256 (collapsed)
Base (calibrated) 256 128 3×1 perspective 800K 0.837 94/256
Heavy fused 512 256 3×3 perspectives 3.2M 0.840 508/512
Massive ortho 2048 256 3×3 perspectives 17.5M 0.838 1506/2048
Dual-stream 512 256 3×shared + 3×native 12.1M 0.838 201/512

Key finding: The mAP ceiling at ~0.84 is a data/task limitation, not an architectural one. Increasing anchors from 256 to 2048 and parameters from 800K to 17.5M yielded no improvement.

4.2 Encoder Variants

Variant Architecture Params Epochs Best cos Best mAP
Base ViT 384-d 6L/384d/6h 11.2M 60 0.599 0.429
Large ViT 1024-d 6L/1024d/16h 77.8M 48+ 0.663 0.500
Base + geo injection 6L/384d/6h + geo tokens 11.4M 10+ 0.601 0.432
Tiny fused 4L/240d/4h + fused constellation 4.2M 20+ 0.534 0.400
Tiny + bank 4L/240d/4h + bank 3.8M 5+ 0.450 0.311

Key finding: The 1024-d encoder surpassed the 384-d ceiling but plateaued at cos=0.663. Train cosine continued rising (0.913 at E48) while val cosine flatlined — classic overfitting with 77M params on 118K images.


5. Key Findings

5.1 Procrustes Calibration is Non-Negotiable

Without calibration (random initialization):

  • Constellation collapses to 1/256 active anchors by epoch 1
  • Self-similarity: 0.969 (everything in one tight cone)
  • Effective dimensionality: 24/128
  • mAP: 0.825 (classifier reads micro-variations around single anchor)

With calibration (GPA + Procrustes + consensus-seeded anchors):

  • 226/256 active anchors at epoch 1
  • Self-similarity: distributed
  • Effective dimensionality: 77/256
  • mAP: 0.837 (geometry is alive)

The calibrated model starts at 0.788 mAP on epoch 1 and climbs. The uncalibrated model starts at 0.732 and stalls. Same architecture, same data, same losses. The only difference is initialization.

5.2 Anchor Dropout Prevents Collapse

Standard training: models converge to ~94 active anchors regardless of architecture size.

With 30% anchor dropout:

  • Heavy soup: 508/512 active (99.2%)
  • Fused tiny ViT: 166/256 active (64.8%)
  • Utilization entropy: 92.3% of maximum

Anchor dropout randomly masks anchors per batch, forcing the model to distribute representations across the full anchor set. This is analogous to DropConnect on a geometric graph.

5.3 The Gradient Bottleneck

All losses apply to the D_ANCHOR-dimensional pooled output. For the 1024-d encoder with 128-d output:

  • 77M transformer parameters receive gradient through a 128-d pinhole
  • Early transformer layers are gradient-starved
  • Train cosine climbs to 0.91 while val cosine plateaus at 0.66

DINOv2 avoids this with iBOT (per-patch loss). Our encoder gets gradient only at the pooled summary. The geometric injection experiment (prepending a geo token at each layer) did not break the ceiling — awareness of position is not the same as gradient bandwidth.

5.4 The Perfect Hypersphere

Verification on the heavy soup (512 anchors × 256-d):

Norms: mean=1.0000000000, std=2.54×10⁻⁸, max_dev=1.19×10⁻⁷
Pentachoron squared volumes (1000 samples):
  Positive: 1000  Negative: 0  Zero: 0
  mean=0.00842387  std=0.00096220
Pairwise squared distances: mean=1.992599 (theoretical: 2.0)

1000/1000 positive volumes. Zero degenerate simplices. Norms at unity to floating point precision. The embedding space IS S^255 — not approximately, not statistically, but mathematically exactly.

The 0.0074 deviation from the theoretical d²=2.0 is the "gravitational pull" of the data manifold on the anchor set. Anchors seeded from consensus samples inherit a slight positive correlation along the ~77 data-occupied dimensions.

5.5 Expert Uniqueness is Asymmetric

Leave-one-out uniqueness:
  Without CLIP:   0.0285 (SigLIP covers for it)
  Without SigLIP: 0.0280 (CLIP covers for it)
  Without DINOv2: 0.0346 (nobody covers for it — unique structure)

Cross-expert agreement:
  CLIP × SigLIP:  0.750 (both text-supervised — similar views)
  CLIP × DINOv2:  0.669 (different training paradigms)
  DINOv2 × SigLIP: 0.674 (different training paradigms)

DINOv2 is the most unique expert because it's the only one without a text bottleneck. But in the GPA consensus, it gets outvoted 2-to-1 by the text-supervised majority.

5.6 The mAP Ceiling at 0.84

Every architecture variant converges to the same mAP range:

  • Base soup: 0.837
  • Heavy fused: 0.840
  • Massive ortho (2048 anchors): 0.838
  • Dual-stream: 0.838

This ceiling is not architectural — it's the information content of 80 COCO classes evaluated on 5K images with 3 experts trained on orders-of-magnitude more data than our training set.


6. The 76.9-Dimensional Discovery

6.1 The Observation

The effective dimensionality of the fused embedding manifold, measured across all soup variants:

Heavy soup (512 × 256-d):     76.9 effective dimensions
Dual-stream (512 × 256-d):    71.6 effective dimensions
Base soup (256 × 128-d):      23.6 effective dimensions (collapsed)
Base calibrated (256 × 128-d): ~77 effective dimensions (estimated)

The collapsed base soup at 23.6 reflects degenerate geometry. All properly calibrated soups converge to approximately 77 effective dimensions regardless of ambient dimensionality (128-d or 256-d).

6.2 COCO's 80 Classes — The Task Dimensionality Hypothesis

COCO has 80 object classes. The effective dimensionality of 76.9 is suspiciously close. After accounting for correlated class pairs:

mouse × keyboard:       cos=0.994 (nearly identical embeddings)
baseball bat × glove:   cos=0.988
microwave × refrigerator: cos=0.974
spoon × bowl:           cos=0.972
skis × snowboard:       cos=0.962
fork × knife:           cos=0.953

These ~3-4 highly correlated pairs reduce the effective independent axes from 80 to approximately 76-77. The embedding manifold has learned exactly as many independent dimensions as the task provides independent labels — and no more.

6.3 Implications

If the effective dimensionality matches the task complexity:

  • D_ANCHOR > 80 provides no additional information for COCO classification
  • D_ANCHOR=128 is generous, D_ANCHOR=256 wastes 180 empty dimensions
  • The patchwork can overfit the empty dimensions (observed with the massive 14.6M patchwork)
  • Different datasets would yield different effective dimensionalities — ImageNet-1K would likely show ~900-950

This suggests D_ANCHOR should be set to approximately 1.0-1.5× the number of independent class distinctions, not to match the expert's hidden dimension.


7. Dual-Stream Architecture and the +103.3 Dimension Gap

7.1 Architecture

Each expert receives two projectors:

proj_shared:  768 → 256, Procrustes-initialized (consensus path)
proj_native:  768 → 256, Xavier-initialized (expert's own geometry)

The shared path learns where experts AGREE. The native path learns where each expert NATURALLY represents. The displacement (shared - native) IS the learned Procrustes transformation.

7.2 The Information Gap

Analysis of the dual-stream soup revealed:

Shared path effective dimensions:     71.6
Native diff effective dimensions:    162.2
Combined effective dimensions:       174.6
Information gain from native diffs: +103.3 dimensions

The consensus averaging destroyed 103.3 dimensions of structured, deterministic expert disagreement. This is not noise — it is reproducible to floating point precision and encodes what CLIP knows that DINOv2 doesn't, what DINOv2 sees that SigLIP misses.

7.3 Structure of the Disagreement

Native cross-expert agreement:
  CLIP × DINOv2:  cos=0.206 (strongly different)
  CLIP × SigLIP:  cos=0.208 (strongly different)
  DINOv2 × SigLIP: cos=0.793 (agree with each other)

Native triangulation divergence from shared:
  CLIP:   tri_cos=0.209 (very different from consensus)
  DINOv2: tri_cos=0.246
  SigLIP: tri_cos=0.320 (closest to consensus)

CLIP's native projection is the outlier. DINOv2 and SigLIP (despite completely different training paradigms — self-supervised vs sigmoid-contrastive) agree in native space at cos=0.793. CLIP's text-supervised training creates a fundamentally different representational geometry.

7.4 Why the Patchwork Couldn't Use It

The fused constellation computed expert triangulations through Procrustes rotations that converged to near-identity:

Expert triangulation correlation: 0.997 (effectively identical)
Pairwise diffs: std=0.000 (literally zero signal)

The consensus pressure from InfoNCE + MSE was overwhelming. The expert rotations learned to be the same rotation, making the "3 expert perspectives × 512 anchors = 1536-d" triangulation redundant — just "1 perspective × 512 anchors" repeated three times.

The dual-stream architecture explicitly separates shared and native paths with a displacement loss keeping shared×native cosine in the [0.3, 0.8] band. The native paths maintained differentiation, but the patchwork only read the shared triangulation in the current architecture.


8. Hypersphere Verification

8.1 Verification Protocol

# 1. Norm verification
norms = anchors.norm(dim=-1)
# Result: mean=1.0000000000, std=2.54e-08

# 2. Cayley-Menger volume verification (1000 random pentachorons)
for _ in range(1000):
    idx = random_5_from(N_ANCHORS)
    vol² = cayley_menger_det(anchors[idx])
# Result: 1000 positive, 0 negative, 0 zero

# 3. Distance consistency
d² = 2 - 2·cos(θ)  # On unit sphere
# Result: mean=1.9926 (expected 2.0 for uniform on S^255)

8.2 Geometric Health Metrics

Metric Heavy Soup Dual-Stream Meaning
Anchor pairwise cos 0.001 0.0001 Near-orthogonal (good)
Anchor eff_rank 207.9/256 256.0/256 Spanning full space
Anchor pentachoron CV 0.067 0.029 Low variance = uniform
Embedding eff_dim 76.9/256 71.6/256 Data manifold dimension
Global CV 0.196 0.220 Embedding volume regularity
Local/Global CV ratio 1.81 1.96 Clusters have internal structure
Utilization entropy 92.3% 73.4% Anchor usage uniformity
Gini coefficient 0.525 0.857 Utilization inequality

8.3 The Linearity of the Geometric Pipeline

The soup's forward pass has exactly 7 linear operations and 2 nonlinearities:

LINEAR:
  1. Expert projection (768 → D_ANCHOR)
  2. LayerNorm
  3. Expert whitening (D_ANCHOR × D_ANCHOR)
  4. Expert rotation (D_ANCHOR × D_ANCHOR)
  5. L2 normalization
  6. Triangulation (cosine with anchors)
  7. 1 - cosine (distance conversion)

NONLINEAR:
  8. Patchwork GELU (interpretation of geometry)
  9. Classifier GELU (interpretation of patchwork)
  
GEOMETRIC NONLINEAR (in loss only):
  10. Cayley-Menger ReLU (volume positivity constraint)

The geometry is preserved because every operation is an isometry (rotation), projection (whitening), or metric (cosine). Distortion is impossible by construction. The two GELUs sit at the boundary between geometry and interpretation — the patchwork reads the sphere, it doesn't modify it.


9. Structural Limitations

9.1 Data Scale

118K COCO images with 80 multi-label classes is insufficient to fill a 256-d hypersphere. The data occupies a ~77-d ribbon on S^255, leaving 180 dimensions empty. Larger datasets (CC12M, LAION) would likely fill more dimensions and justify higher D_ANCHOR.

9.2 Expert Homogeneity

Two of three experts (CLIP, SigLIP) are text-supervised, sharing the same information-theoretic bottleneck. The consensus is dominated by the text-supervised majority. Adding a structurally different expert (e.g., MAE, I-JEPA) would increase effective dimensionality and disagreement information.

9.3 Gradient Bottleneck in Encoder Training

The from-scratch ViT receives all gradient through the D_ANCHOR-dimensional output. With D_ANCHOR=128 and a 77M encoder, the gradient information density is ~1.6 × 10^-6 per parameter. Intermediate losses (per-patch, per-layer) are needed for larger encoders.

9.4 Loss Collapse in Expert Differentiation

Every attempt to maintain expert perspective differentiation during training failed:

  • Bank loss: collapsed to 0.0000 by epoch 4
  • Expert agreement loss: collapsed to 0.0004
  • Displacement loss: collapsed to 0.0000 by epoch 2
  • Cross-contrast loss: collapsed to 0.0000 by epoch 3

The consensus losses (InfoNCE, MSE, Procrustes) overwhelm any differentiation pressure. Expert perspectives can only be preserved through explicit architectural separation (dual-stream) rather than loss-based encouragement.

9.5 Anchor Crystallization Pattern

All models exhibit anchor count reduction during training:

Epoch 1:  226/256 active (calibrated base)
Epoch 5:  113/256 active
Epoch 20: 94/256 active (stabilized)

Without anchor dropout, models crystallize to approximately N_CLASSES active anchors. With 30% dropout, models maintain 60-90% utilization but the distribution remains uneven (Gini 0.52-0.86).


10. Unexpected Outcomes

10.1 Perfect Hypersphere from Training

We did not explicitly constrain the embedding space to be a perfect hypersphere. L2 normalization puts points ON S^(d-1), but the Cayley-Menger volumes, simplex regularity, and distance structure all emerged from training. The CV loss encourages volume uniformity but does not enforce positive-definiteness — that was a natural consequence of the linear geometric pipeline.

10.2 Cross-Model Weight Cosine = 0.000, Activation Procrustes = 0.999

From the deep model analysis of CLIP, DINOv2, and SigLIP:

Weight-space alignment:     cos ≈ 0.000 at every layer
Activation-space alignment: cos ≈ 0.999 after Procrustes at final layer

The three models encode identical geometry through completely different weight configurations. They agree on WHAT to represent but disagree entirely on HOW to represent it. Procrustes alignment bridges the gap — the geometric information is rotation-equivalent.

10.3 Anchor Count ≈ Number of Classes

Without dropout, the model consistently crystallizes to ~94 active anchors for 80 COCO classes. This is not exactly 80 because:

  • Some classes co-occur frequently (person + car) and share anchors
  • Some classes are rare (toaster: 8 images) and may not claim dedicated anchors
  • Sub-class structure (person outdoors vs person indoors) creates extra anchors

The convergence to approximately N_CLASSES active anchors suggests the constellation naturally discovers the task-relevant partition of the sphere.

10.4 DINOv2 + SigLIP Native Agreement

Despite completely different training paradigms:

  • DINOv2: self-supervised (DINO + iBOT), no text, no labels
  • SigLIP: sigmoid contrastive with text supervision

Their native (non-Procrustes-aligned) projections agree at cos=0.793. CLIP's native projection is the outlier at cos=0.206 against both. This suggests DINOv2 and SigLIP converge on similar visual structure through different training paths, while CLIP's text conditioning creates a fundamentally different representational geometry.

10.5 The Expansion Warm-Start

Expanding a trained 384-d encoder to 1024-d by padding weights with Xavier initialization works remarkably well:

384-d model at E20:  cos=0.599, mAP=0.429
1024-d at E1 (expanded): cos=0.533, mAP=0.353
1024-d at E5:        cos=0.593, mAP=0.428 (recovered in 5 epochs)
1024-d at E19:       cos=0.652, mAP=0.492 (surpassed ceiling)

83 weight tensors were expanded (old values in top-left corner, Xavier in new dimensions). The trained core preserved the learned geometry while the new dimensions explored.


11. Formulas Reference

Complete Loss Stack

# Soup training (expert features → 80-class logits)
L_soup = 1.0·InfoNCE(fused, consensus_target, queue)
       + 0.5·MSE(fused, consensus_target)
       + 0.3·BCE(logits, coco_labels)
       + 0.5·ProcAlign(fused, consensus_target)
       + 0.001·|CV(fused) - CV_target|
       + 0.001·AnchorSpread(anchors)

# Encoder training (raw pixels → 128-d embedding)
L_encoder = same as soup + GeometricAutograd(tang=0.01, sep=1.0)

# Dual-stream additional losses
L_dual = 0.3·DisplacementLoss(shared, native)  # keep in [0.3, 0.8]
       + 0.3·CrossContrastLoss(native_pairs)   # keep in [0.2, 0.8]
       + 0.1·NativeDiversityLoss(native_list)   # encourage different agreements

Key Equations

Procrustes alignment:       R = U·V^T  where  T_w^T·S_w = UΣV^T
Triangulation:              tri_i = 1 - cos(emb, anchor_i) = 1 - emb·a_i^T
Pentachoron volume:         Vol = √(max(0, (-1)^5/(2^4·4!²)·det(CM)))
CV:                         CV = std(volumes) / mean(volumes)
Geometric autograd:         g_corrected = g_tangential + (1-α)·g_radial - β·g_toward_anchor
Anchor spread:              L = mean(off_diagonal(anchor·anchor^T)²)

12. Architectural Variants and Results

Soup Architectures Tested

ExpertProjector:     Linear(768→D) + LayerNorm + L2-norm
Constellation:       Parameter(N_ANCHORS, D) + L2-norm
Patchwork:           N_COMP × [Linear(in→2d) + GELU + Linear(2d→d) + LayerNorm]
FusedConstellation:  Constellation + 3 Procrustes rotations + anchor dropout
DualStreamProjector: proj_shared (Procrustes) + proj_native (Xavier)
MultiDepthPatchwork: coarse(16) + fine(64) [+ micro(128)] → projection
Classifier:          Linear(pw+D→pw) + GELU + LayerNorm + Dropout + Linear(pw→80)

From-Scratch Encoder Architectures Tested

TinyViT:    4L/240d/4h,  patch16, ~3M params
BaseViT:    6L/384d/6h,  patch16, ~11M params
LargeViT:   6L/1024d/16h, patch16, ~77M params
GeoInject:  BaseViT + geo token at each layer residual

Optimizer Configuration (Proven)

Soup:    Adam lr=1e-3, NO weight decay
Encoder: Adam lr=3e-4, warmup 500 steps, cosine decay, NO weight decay

Weight decay is explicitly avoided — the geometry provides all necessary regularization through the hypersphere constraint and Procrustes alignment.


13. Implications and Future Directions

13.1 The Geometric Consensus Principle

Multi-expert GPA alignment on the hypersphere naturally discovers the information-theoretic bottleneck of the dominant training paradigm. This is not designed — it emerges from averaging on S^(d-1). The consensus dimensionality reflects the intersection of what multiple models can agree on, which is bounded by the least expressive common signal (text supervision at ~77 tokens for CLIP-family models, or ~80 independent class labels for COCO).

13.2 Disagreement as Information

The +103.3 dimension gap proves that expert disagreement is not noise but structured, deterministic, and information-rich. Future architectures should explicitly preserve and utilize this disagreement rather than averaging it into the consensus. The dual-stream architecture demonstrates one approach; pairwise diff triangulation offers another.

13.3 The Linear Geometry Hypothesis

The soup achieves 0.84 mAP with only 2 nonlinearities in its forward pass. The geometric operations (projection, whitening, rotation, triangulation) are purely linear and provably structure-preserving. This suggests that for representation learning on the hypersphere, the geometry itself provides sufficient inductive bias — nonlinearity is needed only for interpretation of the geometric structure, not for its computation.

13.4 Scaling Laws for Geometric Systems

Our experiments suggest:

  • D_ANCHOR should match task complexity (1.0-1.5× independent classes), not expert dimension
  • N_ANCHORS should exceed active classes but excessive anchors overfit (2048 was worse than 512)
  • Encoder capacity must match data scale (77M params on 118K images overfits)
  • Anchor dropout is essential for utilization above the natural crystallization point
  • Expert diversity matters more than expert count (3 experts with 2 sharing text supervision = ~1.5 effective experts)

13.5 Open Questions

  • Does the 76.9-dimensional convergence hold for non-COCO datasets?
  • Can the +103.3 native dimensions be exploited without retraining the soup?
  • What is the minimum encoder capacity that avoids the gradient bottleneck?
  • Can the geometric autograd be extended to per-patch supervision (analogous to iBOT)?
  • Is there a theoretical relationship between pentachoron CV and downstream task performance?

Appendix: Repository Links


This document summarizes findings from an extended research session spanning multiple days of continuous experimentation, approximately 200+ GPU-hours of training across 60+ experimental runs, and the analysis of over 17 architectural variants. All code, checkpoints, and tensorboard logs are available in the linked repositories.

Community

Sign up or log in to comment