--- license: mit tags: - safetensors - tensorboard - geometric-deep-learning - cross-modal - multi-modal - retrieval - pentachoron - procrustes - bert - dinov2 - whisper - esm2 - codebert - contrastive-learning language: - en library_name: pytorch pipeline_tag: feature-extraction --- # GEOLIP-Bertenstein - AKA the GEOLIP-Conduit prototype **A multi-expert geometric fusion transformer that bridges 4 independently-trained encoders into a shared embedding space using BERT-large as a universal text hub.** One layer. One epoch. Perfect cross-modal retrieval. Universal pentachoron geometry. The entire design is meant to create an alignment bridge with multiple different modalities, multiple different dimensional sizes, and multiple different independent structures into a format that can be consumed downstream by another system for useful association. This system uses a form of procrustes whitening analysis that aligns multiple systems of information through a shared structural boundary, with well defined and understood complexity association differentiations. This is not meant to be a MOE, this is meant to be a collective understanding and cooperation of alignments through direct and indirect association. ## Results | Expert Pair | R@1 | Cosine | Pentachoron CV | |---|---|---|---| | text ↔ audio (1.5K val) | **1.0000** | 0.972 | **0.203** | | text ↔ code (5K val) | **0.9996** | 0.988 | **0.195** | | text ↔ image (5K val) | **1.0000** | 0.986 | **0.196** | | text ↔ protein (2.3K val) | **0.9987** | 0.979 | **0.200** | | text ↔ image (40K test) | **1.0000** | 0.980 | **0.208** | All CV values converge to the **0.20 ± 0.01 universal band** — the same geometric constant measured across 17 neural architectures before this model existed. The current results only used a partial protein and partial audio dataset for preliminary, which left the sample size smaller and less conclusively potential as a full dataset run can yield due to the increased data capacity for alignment. ## Architecture ``` ┌─────────────┐ │ Shared │ ┌──────┐ │ Fusion │ ┌──────┐ │ BERT │──text──→ │ Transformer│ ←──img──│DINOv2│ │large │ │ (1 layer) │ │large │ └──────┘ │ 1024-d │ └──────┘ │ 16 heads │ ┌──────┐ │ │ ┌──────┐ │Whisp.│──audio─→ │ Procrustes │ ←─prot──│ESM-2 │ │large │ │ pre-aligned│ │650M │ └──────┘ │ │ └──────┘ │ │ ┌──────┐ │ │ │Code- │──code──→ │ │ │BERT │ └─────────────┘ └──────┘ ``` **Frozen encoders** (not trained, not included in this repo): - **BERT-large** (336M) — universal text hub - **DINOv2-large** (302M) — natural images - **Whisper-large-v3** encoder (1.5B) — speech audio - **ESM-2-650M** (652M) — protein sequences - **CodeBERT-base** (125M) — source code **Trainable fusion** (this repo): **41.5M params** - 1 shared transformer layer (8.9M) - 5 expert modules (text + 4 modalities, ~7M each) - Includes Procrustes pre-alignment buffers per expert ## How It Works 1. **Procrustes Pre-Alignment**: Before training, compute optimal orthogonal rotation to align each expert's coordinate system with BERT's text space. Whitened Procrustes with centering. 2. **Expert Modules**: Each modality gets a projection layer, learned cross-attention pooling (257 image patches → 16 tokens, 1500 audio frames → 16 tokens, etc.), a special `<|MODALITY|>` token, and an output head. 3. **Fusion Sequence**: `[<|TEXT|>] [text_tokens] [<|IMAGE|>] [img_tokens] ...` — bidirectional self-attention across all modalities. Text tokens attend to image patches. Audio tokens attend to text tokens. 4. **Geometric Loss**: InfoNCE contrastive + pentachoron CV variance (Cayley-Menger) + Procrustes SVD alignment. The geometric components drive the embedding manifold toward the universal 0.20 CV band. 5. **Text Hub**: All modalities pair with text during training. Cross-modal alignment (e.g., audio↔image) emerges transitively through the shared text space. ## Procrustes Pre-Alignment | Expert | cos before | cos after | Dimension | |---|---|---|---| | audio | 0.0004 | **0.4404** | 1280 → 1024 (PCA down) | | code | -0.0016 | **0.4036** | 768 → 1024 (zero-pad up) | | image | 0.0038 | **0.4107** | 1024 → 1024 (direct) | | protein | 0.0005 | **0.3771** | 1280 → 1024 (PCA down) | ## Training - **Data**: COCO-Caption (40K), LibriSpeech clean-100 (10K), CodeSearchNet Python (50K), Protein2Text-QA (15K) - **Schedule**: 3 epochs, round-robin across experts, cosine LR 3e-4 → 1e-6 - **Hardware**: NVIDIA RTX PRO 6000 Blackwell (102 GB VRAM) - **Time**: ~11 minutes total (3 × ~220s) ## Key Finding: Universal Pentachoron Geometry The coefficient of variation (CV) of pentachoron (4-simplex) volumes, measured via Cayley-Menger determinants, converges to **0.20 ± 0.01** across: - 17 pretrained models (T5, BERT, CLIP, DINOv2, SD UNets, VAEs) - 5 architecture families (transformer, UNet, convolutional autoencoder) - 4 modalities in this fusion model (audio, code, image, protein) This constant emerges in ANY embedding space that reaches convergence through gradient-based contrastive learning, regardless of modality, architecture, or training objective. ## File Structure ``` geolip-bertenstein/ ├── checkpoints/ │ ├── epoch_001/ │ │ ├── model.safetensors │ │ ├── loss.safetensors │ │ ├── training_state.pt │ │ └── config.json │ ├── epoch_002/ │ ├── epoch_003/ │ └── final/ │ ├── model.safetensors │ ├── loss.safetensors │ ├── training_state.pt │ ├── config.json │ ├── aligner_audio.safetensors │ ├── aligner_code.safetensors │ ├── aligner_image.safetensors │ └── aligner_protein.safetensors ├── tensorboard/ ├── bertenstein_results.json ├── stage2_bertenstein.py └── README.md ``` ## Usage ```python from safetensors.torch import load_file import torch # Load model weights state = load_file("checkpoints/final/model.safetensors") # Reconstruct model (see stage2_bertenstein.py for full class definitions) from stage2_bertenstein import BertensteinFusion, ExpertModule, ProcrustesAligner, FusionConfig # ... build model, load state_dict, run inference ``` Precomputed embedding caches (Arrow format) for all modalities: [`AbstractPhil/geolip-bertenstein-cache`](https://huggingface.co/datasets/AbstractPhil/geolip-bertenstein-cache) ## Geometric Terrain Analysis The foundational profiling of 17 models and Procrustes alignment analysis: [`AbstractPhil/procrustes-analysis`](https://huggingface.co/AbstractPhil/procrustes-analysis) ## Citation ```bibtex @misc{abstractphil2026bertenstein, title={GEOLIP-Bertenstein: Cross-Modal Alignment Through Shared Pentachoron Geometry}, author={AbstractPhil}, year={2026}, url={https://huggingface.co/AbstractPhil/geolip-bertenstein} } ``` ## License MIT