---
license: mit
tags:
  - safetensors
  - tensorboard
  - geometric-deep-learning
  - cross-modal
  - multi-modal
  - retrieval
  - pentachoron
  - procrustes
  - bert
  - dinov2
  - whisper
  - esm2
  - codebert
  - contrastive-learning
language:
  - en
library_name: pytorch
pipeline_tag: feature-extraction
---

# GEOLIP-Bertenstein - AKA the GEOLIP-Conduit prototype

**A multi-expert geometric fusion transformer that bridges 4 independently-trained encoders into a shared embedding space using BERT-large as a universal text hub.**

One layer. One epoch. Perfect cross-modal retrieval. Universal pentachoron geometry.

The entire design is meant to create an alignment bridge with multiple different modalities, multiple different dimensional sizes, and multiple different independent structures into a format
that can be consumed downstream by another system for useful association.

This system uses a form of procrustes whitening analysis that aligns multiple systems of information through a shared structural boundary,
with well defined and understood complexity association differentiations.

This is not meant to be a MOE, this is meant to be a collective understanding and cooperation of alignments through direct and indirect association.

## Results

| Expert Pair | R@1 | Cosine | Pentachoron CV |
|---|---|---|---|
| text ↔ audio (1.5K val) | **1.0000** | 0.972 | **0.203** |
| text ↔ code (5K val) | **0.9996** | 0.988 | **0.195** |
| text ↔ image (5K val) | **1.0000** | 0.986 | **0.196** |
| text ↔ protein (2.3K val) | **0.9987** | 0.979 | **0.200** |
| text ↔ image (40K test) | **1.0000** | 0.980 | **0.208** |

All CV values converge to the **0.20 ± 0.01 universal band** — the same geometric constant measured across 17 neural architectures before this model existed.

The current results only used a partial protein and partial audio dataset for preliminary, which left the sample size smaller and less conclusively potential
as a full dataset run can yield due to the increased data capacity for alignment.

## Architecture

```
                        ┌─────────────┐
                        │  Shared     │
    ┌──────┐           │  Fusion     │           ┌──────┐
    │ BERT │──text──→  │  Transformer│  ←──img──│DINOv2│
    │large │           │  (1 layer)  │           │large │
    └──────┘           │  1024-d     │           └──────┘
                        │  16 heads   │
    ┌──────┐           │             │           ┌──────┐
    │Whisp.│──audio─→  │  Procrustes │  ←─prot──│ESM-2 │
    │large │           │  pre-aligned│           │650M  │
    └──────┘           │             │           └──────┘
                        │             │
    ┌──────┐           │             │
    │Code- │──code──→  │             │
    │BERT  │           └─────────────┘
    └──────┘
```

**Frozen encoders** (not trained, not included in this repo):
- **BERT-large** (336M) — universal text hub
- **DINOv2-large** (302M) — natural images
- **Whisper-large-v3** encoder (1.5B) — speech audio
- **ESM-2-650M** (652M) — protein sequences
- **CodeBERT-base** (125M) — source code

**Trainable fusion** (this repo): **41.5M params**
- 1 shared transformer layer (8.9M)
- 5 expert modules (text + 4 modalities, ~7M each)
- Includes Procrustes pre-alignment buffers per expert

## How It Works

1. **Procrustes Pre-Alignment**: Before training, compute optimal orthogonal rotation to align each expert's coordinate system with BERT's text space. Whitened Procrustes with centering.

2. **Expert Modules**: Each modality gets a projection layer, learned cross-attention pooling (257 image patches → 16 tokens, 1500 audio frames → 16 tokens, etc.), a special `<|MODALITY|>` token, and an output head.

3. **Fusion Sequence**: `[<|TEXT|>] [text_tokens] [<|IMAGE|>] [img_tokens] ...` — bidirectional self-attention across all modalities. Text tokens attend to image patches. Audio tokens attend to text tokens.

4. **Geometric Loss**: InfoNCE contrastive + pentachoron CV variance (Cayley-Menger) + Procrustes SVD alignment. The geometric components drive the embedding manifold toward the universal 0.20 CV band.

5. **Text Hub**: All modalities pair with text during training. Cross-modal alignment (e.g., audio↔image) emerges transitively through the shared text space.

## Procrustes Pre-Alignment

| Expert | cos before | cos after | Dimension |
|---|---|---|---|
| audio | 0.0004 | **0.4404** | 1280 → 1024 (PCA down) |
| code | -0.0016 | **0.4036** | 768 → 1024 (zero-pad up) |
| image | 0.0038 | **0.4107** | 1024 → 1024 (direct) |
| protein | 0.0005 | **0.3771** | 1280 → 1024 (PCA down) |

## Training

- **Data**: COCO-Caption (40K), LibriSpeech clean-100 (10K), CodeSearchNet Python (50K), Protein2Text-QA (15K)
- **Schedule**: 3 epochs, round-robin across experts, cosine LR 3e-4 → 1e-6
- **Hardware**: NVIDIA RTX PRO 6000 Blackwell (102 GB VRAM)
- **Time**: ~11 minutes total (3 × ~220s)

## Key Finding: Universal Pentachoron Geometry

The coefficient of variation (CV) of pentachoron (4-simplex) volumes, measured via Cayley-Menger determinants, converges to **0.20 ± 0.01** across:

- 17 pretrained models (T5, BERT, CLIP, DINOv2, SD UNets, VAEs)
- 5 architecture families (transformer, UNet, convolutional autoencoder)
- 4 modalities in this fusion model (audio, code, image, protein)

This constant emerges in ANY embedding space that reaches convergence through gradient-based contrastive learning, regardless of modality, architecture, or training objective.

## File Structure

```
geolip-bertenstein/
├── checkpoints/
│   ├── epoch_001/
│   │   ├── model.safetensors
│   │   ├── loss.safetensors
│   │   ├── training_state.pt
│   │   └── config.json
│   ├── epoch_002/
│   ├── epoch_003/
│   └── final/
│       ├── model.safetensors
│       ├── loss.safetensors
│       ├── training_state.pt
│       ├── config.json
│       ├── aligner_audio.safetensors
│       ├── aligner_code.safetensors
│       ├── aligner_image.safetensors
│       └── aligner_protein.safetensors
├── tensorboard/
├── bertenstein_results.json
├── stage2_bertenstein.py
└── README.md
```

## Usage

```python
from safetensors.torch import load_file
import torch

# Load model weights
state = load_file("checkpoints/final/model.safetensors")

# Reconstruct model (see stage2_bertenstein.py for full class definitions)
from stage2_bertenstein import BertensteinFusion, ExpertModule, ProcrustesAligner, FusionConfig

# ... build model, load state_dict, run inference
```

Precomputed embedding caches (Arrow format) for all modalities: [`AbstractPhil/geolip-bertenstein-cache`](https://huggingface.co/datasets/AbstractPhil/geolip-bertenstein-cache)

## Geometric Terrain Analysis

The foundational profiling of 17 models and Procrustes alignment analysis: [`AbstractPhil/procrustes-analysis`](https://huggingface.co/AbstractPhil/procrustes-analysis)

## Citation

```bibtex
@misc{abstractphil2026bertenstein,
  title={GEOLIP-Bertenstein: Cross-Modal Alignment Through Shared Pentachoron Geometry},
  author={AbstractPhil},
  year={2026},
  url={https://huggingface.co/AbstractPhil/geolip-bertenstein}
}
```

## License

MIT