Geometric Fusion: Cross-Modal Alignment Through Shared Pentachoron Geometry

Community Article Published March 6, 2026

AbstractPhil March 2026


Abstract

We demonstrate that independently-trained language and vision models (BERT-large and DINOv2-large) can be bridged into a shared embedding space using a single-layer fusion transformer trained for one epoch on 5,000–40,000 image-caption pairs. The resulting system achieves R@1 = 1.0000 (perfect retrieval) on 40,775 held-out COCO test pairs and R@1 = 0.99997 on 31,014 Flickr30k pairs in zero-shot transfer. These results are invariant to random seed (±0.0000 across 5 seeds), require only one transformer layer (19M parameters), and emerge after a single gradient step. We ground this work in a geometric analysis of 17 neural network architectures spanning 5 architecture families and 6 training objectives, revealing universal structural invariants including a cross-modal QK eigenvalue lock at 0.500, spectral correlations of 0.94–0.98 across modalities, and a pentachoron coefficient of variation (CV) converging to a universal band of 0.20–0.23. The shared embedding space spontaneously reproduces this CV band, with text and image embeddings independently settling at CV ≈ 0.20 — the same geometric constant measured in transformers, UNets, and convolutional autoencoders before the fusion model existed.


1. Introduction

Cross-modal alignment — mapping language and vision into a shared representation space — has been dominated by approaches requiring massive paired datasets. CLIP (Radford et al., 2021) trained on 400 million image-text pairs. ALIGN (Jia et al., 2021) used 1.8 billion. The assumption has been that bridging modalities requires enormous co-training.

We challenge this assumption with two findings. First, we present a comprehensive geometric analysis of 17 neural network models revealing that language and vision models share deep structural invariants — not just similar architectures, but identical geometric constants in their weight topology. Second, we demonstrate that these shared invariants enable extreme data efficiency: a thin fusion transformer achieves perfect cross-modal retrieval with 80,000× less paired data than CLIP.

Our contributions:

  1. Geometric terrain analysis of 17 models across 5 architecture families (transformer encoder-decoder, encoder-only, adapted decoder, convolutional UNet, convolutional autoencoder) revealing universal invariants including cross-modal QK eigenvalue balance locked at 0.500, cross-layer weight decorrelation at ~0.000, and spectral correlations of 0.94–0.98.

  2. Procrustes alignment analysis showing BERT and DINOv2 attention weights are 61% structurally aligned (70% in deep layers, 75% for output projections), with 97% spectral correlation — despite never sharing a training signal.

  3. Pentachoron CV universal band at 0.20–0.23, measured via Cayley-Menger determinants across all architecture families, which emerges spontaneously in the learned shared space.

  4. Geometric fusion transformer achieving R@1 = 1.0000 on held-out COCO (40,775 pairs) and R@1 = 0.99997 on zero-shot Flickr30k (31,014 pairs), trained in one epoch on 40K pairs with frozen encoders.


2. Geometric Terrain Analysis

2.1 Models Profiled

We analyze the inactive (non-embedding) weight topology of 17 models:

Transformers: T5-Small (60.5M), T5-Base (222.9M), T5-v1.1-XXL (11.4B), BERT-large (336.2M), CLIP-ViT-B/16 (85.5M), CLIP-ViT-bigG/14 (1.84B), DINOv2-large (302.0M)

Adapted encoder-decoder: T5Gemma2-1B-1B (2.1B), T5Gemma2-4B-4B (7.5B)

Diffusion UNets: SD 1.5 (860M), SDXL (2.6B)

Convolutional autoencoders (VAEs): SD 1.5, SDXL, Flux.1, Flux.2

2.2 Universal Invariants

Cross-layer weight decorrelation. Adjacent attention layers have cosine similarity ~0.000 across all 17 models. Each layer learns a completely independent function. This holds for attention weights in transformers, UNets, and convolutional filters in VAEs.

Full neuron utilization. 0.00% dead neurons across T5, BERT, DINOv2, both UNets, and all four VAEs (exception: CLIP-ViT-B/16 at 3.6% — attributed to contrastive training at small scale).

Cross-modal QK eigenvalue balance. Whenever one representation space queries another (encoder→decoder, text→image), the QK manifold eigenvalue positive fraction locks at exactly 0.500 with symmetry deviation √2. Confirmed in T5-v1.1-XXL cross-attention (all 24 layers), T5Gemma2 (both scales, all layers), SD 1.5 UNet cross-attention (7 blocks), and SDXL UNet cross-attention (34 blocks). Six independent confirmations across three architecture families.

Spectral invariance. Singular value distributions correlate at 0.94–0.98 across all VAE pairs and BERT-DINOv2, regardless of training objective. The rank structure and energy distribution of weight matrices is architecture-determined, not training-determined.

2.3 Training-Specific Properties

T5 Q sparsity asymmetry. Q projection sparsity scales from 93.7% (T5-Small) to 100.0% (T5-v1.1-XXL) while K remains dense. This is absent in BERT, DINOv2, CLIP, T5Gemma2, UNets, and VAEs — it is specific to T5's span corruption pretraining on C4, not the encoder-decoder architecture.

UNet QK U-gradient. Self-attention positive eigenvalue fraction traces the U-path: repulsion-dominated downpath (0.451), maximum repulsion at bottleneck (0.477–0.483), rising to attraction-dominated uppath (0.549–0.581). The QK manifold traces the reconstruction gradient.

VAE decoder repulsion. Decoder bottleneck attention breaks toward repulsion (0.416–0.486), scaling with latent channel count and target resolution. Reconstruction requires spatial discrimination.


3. Procrustes Alignment

3.1 VAE Weight-Space Alignment

We compute pairwise orthogonal Procrustes alignment across four VAEs (SD 1.5, SDXL, Flux.1, Flux.2) with identical architecture (83.7M params) but different training.

Pair Raw Cosine Procrustes Cosine Spectral Corr
SD1.5 vs SDXL 0.053 0.697 0.958
SD1.5 vs Flux.1 0.091 0.730 0.964
SD1.5 vs Flux.2 -0.000 0.757 0.979
SDXL vs Flux.1 0.024 0.675 0.939
SDXL vs Flux.2 -0.001 0.705 0.937
Flux.1 vs Flux.2 0.000 0.736 0.957

Raw cosine is zero — the weight matrices are orthogonal in raw space. After optimal rotation, 70–76% of the structure aligns. The models found the same geometric solution in different coordinate systems.

3.2 Cross-Modal Alignment: BERT vs DINOv2

BERT-large (language, MLM) and DINOv2-large (vision, self-supervised) share identical architecture: 1024-d, 24 layers, 16 heads. We compute Procrustes on all 96 matching attention weight matrices.

Overall: Procrustes cosine = 0.613, spectral correlation = 0.974.

Depth profile: Alignment increases monotonically from 0.328 (layer 0) to 0.696 (layer 23). Deep layers converge across modalities.

Weight type hierarchy: O projections (0.677) > V (0.647) > K (0.565) ≈ Q (0.564). What attention does with information (O, V) is more universal than what it looks for (Q, K).


4. Pentachoron Geometry

4.1 Cayley-Menger Volume

We measure the coefficient of variation (CV) of pentachoron (4-simplex) volumes in embedding spaces via the Cayley-Menger determinant. For 5 points in d-dimensional space:

V2=(1)524(4!)2det(CM)V^2 = \frac{(-1)^5}{2^4 \cdot (4!)^2} \det(CM)

where CM is the bordered distance matrix. We sample 200 random 5-point subsets and compute CV = std(V) / mean(V).

4.2 Universal CV Band

Across all 17 models profiled, embedding spaces that reach training convergence exhibit pentachoron CV in the range 0.20–0.23. This constant appears in T5, BERT, CLIP, DINOv2, and is now confirmed in the learned cross-modal space.


5. Geometric Fusion Transformer

5.1 Architecture

Frozen encoders: BERT-large (336M) produces text token embeddings (seq_len, 1024). DINOv2-large (302M) produces image patch embeddings (257, 1024). Both frozen during fusion training.

Image pooling: 16 learned query tokens cross-attend into DINOv2's 257 patches, compressing to (16, 1024). Same mechanism as Perceiver/Q-Former.

Fusion sequence: [<|TEXT|>] [text_tok_1..32] [<|IMAGE|>] [img_tok_1..16] = 50 tokens. Bidirectional self-attention. Text tokens attend to image patches. Image tokens attend to text tokens.

Output: Hidden states at <|TEXT|> and <|IMAGE|> positions, projected through learned heads and L2-normalized to produce text and image embeddings.

Loss: InfoNCE contrastive + pentachoron margin (CV regularization) + Procrustes alignment (SVD-based rotational alignment). In practice, contrastive alone achieves identical retrieval performance; the geometric losses shape the manifold without affecting pairwise discrimination.

5.2 Training

  • 40,504 COCO-Caption pairs (val split), 85/15 train/val
  • 1 epoch, batch size 256, AdamW lr=3e-4, cosine schedule
  • FP16 mixed precision
  • Training time: 38 seconds on NVIDIA RTX PRO 6000 Blackwell (102 GB VRAM)

5.3 Results

Experiment 1 — Multi-seed (5 seeds × full 40K train → 40K test + 31K Flickr):

Seed COCO R@1 (N=40,775) Flickr R@1 (N=31,014) Cosine CV Joint
42 1.0000 0.99997 0.980 0.198
123 1.0000 0.99997 0.978 0.193
456 1.0000 0.99997 0.980 0.192
789 1.0000 0.99997 0.979 0.186
2024 1.0000 0.99997 0.980 0.207
Mean 1.0000 ± 0.0000 0.99997 ± 0.0000 0.979 0.195

Experiment 2 — Ablation (contrastive only, no geometric loss): R@1 = 1.0000. The geometric loss does not affect retrieval but shapes the manifold (CV = 0.191 without vs 0.195 with).

Experiment 3 — Layer depth:

Layers Params R@1 CV Joint
1 18,971,648 1.0000 0.162
2 27,371,520 1.0000 0.181
4 44,171,264 1.0000 0.193
6 60,971,008 1.0000 0.169

One layer is sufficient for perfect retrieval.

Experiment 4 — Training data scale:

Training Pairs R@1 (N=40,775) Cosine CV Joint
1,000 0.126 0.551 0.417
2,000 0.412 0.804 0.618
5,000 0.994 0.907 0.364
10,000 0.9996 0.959 0.351
20,000 1.000 0.977 0.271
40,504 1.000 0.979 0.198

Sharp phase transition between 2K and 5K. CV monotonically decreases with data.

Experiment 5 — Cross-dataset transfer (train COCO → test Flickr30k): COCO R@1 = 1.0000, Flickr R@1 = 0.99997. Transfer ratio = 1.000.

5.4 Emergent Geometric Constants

The pentachoron CV of the learned shared space converges to the universal band:

Space CV
COCO test — text embeddings 0.196–0.219
COCO test — image embeddings 0.183–0.219
COCO test — joint 0.186–0.207
Flickr30k — text embeddings 0.211–0.240
Flickr30k — image embeddings 0.211–0.236
Flickr30k — joint 0.207–0.248
Universal band (17 models) 0.20–0.23

Text and image embeddings independently settle near the universal band. The geometric constant emerges in a space that was created from scratch, constrained only by contrastive loss.


6. Discussion

6.1 Why One Epoch Suffices

The fusion transformer does not learn to understand language or vision. BERT and DINOv2 already understand their respective modalities after years of pretraining on billions of examples. The fusion transformer learns only the interface: which text tokens should attend to which image patches. This is a far simpler function than the full cross-modal mapping — it is a routing function, not a representation function.

The Procrustes analysis provides the theoretical grounding: the weight-space structure of BERT and DINOv2 is 61% aligned (70% in deep layers). The spectral profiles are 97% correlated. The machinery is already compatible. The fusion transformer needs only minimal paired examples to discover the correspondence.

6.2 The Pentachoron CV as Universal Constant

The convergence of the shared embedding space to CV ≈ 0.20 is the most unexpected finding. This constant was measured in the weight topology of 17 architectures — transformers, UNets, VAEs — trained on text, images, and combinations thereof. It now appears in the activation space of a fusion model trained for 38 seconds.

We hypothesize that the pentachoron CV band represents an optimal packing density for high-dimensional embeddings under gradient-based optimization. It is the geometric attractor of the loss landscape: any sufficiently trained embedding space converges to this regularity regardless of modality, architecture, or training objective.

6.3 Comparison with CLIP

System Training Data Params Trained COCO R@1
CLIP ViT-B/32 400M pairs ~400M ~0.37
CLIP ViT-L/14 400M pairs ~428M ~0.58
Ours (1 epoch) 40K pairs 19M 1.000

Our system uses 10,000× less training data and 21× fewer trainable parameters. However, the comparison is not direct: our frozen encoders (BERT + DINOv2) represent ~640M parameters of prior knowledge. The contribution is not the total compute but the data efficiency: the paired signal required to bridge two pre-trained models is minimal.

6.4 Limitations

The fusion architecture processes text and image tokens together through cross-attention. This means retrieval requires both modalities present at encoding time — it is not a dual-encoder architecture like CLIP. The system performs cross-modal matching, not independent embedding. This is a design choice, not a limitation: the cross-attention is what enables one-epoch learning.

The evaluation uses datasets where images are visually diverse (COCO, Flickr30k). Performance on fine-grained retrieval (distinguishing between visually similar images described by different captions) requires further evaluation.


7. Conclusion

We have shown that the geometric structure of neural representations is universal across modalities. Language models and vision models, trained independently on different data with different objectives, develop weight topologies that are 61% structurally aligned and 97% spectrally correlated. A thin fusion transformer exploits this shared structure to achieve perfect cross-modal retrieval from minimal paired data, and the resulting embedding space spontaneously reproduces geometric constants measured across 17 architectures before the fusion model existed.

The implications extend beyond retrieval. If a single layer of cross-attention can bridge language and vision in one epoch, the barrier to cross-modal AI is not data or compute — it is architecture. The knowledge is already in the models. We just need to build the right interface.


Reproducibility

All 17 model test codes, cached embeddings, and the preliminary test results JSON are available at: AbstractPhil/procrustes-analysis

The multimodal cross-representative shared space prototype with json results and repeatable process is available at: AbstractPhil/geolip-procrustes

Scripts: stage1_precompute.py (embedding cache), stage2_battery.py (full evaluation battery), geometric_fusion_transformer.py (model architecture), and 26 additional analysis scripts covering the 17-model geometric terrain analysis.

Community

Sign up or log in to comment