| --- |
| license: mit |
| tags: |
| - safetensors |
| - tensorboard |
| - geometric-deep-learning |
| - cross-modal |
| - multi-modal |
| - retrieval |
| - pentachoron |
| - procrustes |
| - bert |
| - dinov2 |
| - whisper |
| - esm2 |
| - codebert |
| - contrastive-learning |
| language: |
| - en |
| library_name: pytorch |
| pipeline_tag: feature-extraction |
| --- |
| |
| # GEOLIP-Bertenstein - AKA the GEOLIP-Conduit prototype |
|
|
| **A multi-expert geometric fusion transformer that bridges 4 independently-trained encoders into a shared embedding space using BERT-large as a universal text hub.** |
|
|
| One layer. One epoch. Perfect cross-modal retrieval. Universal pentachoron geometry. |
|
|
| The entire design is meant to create an alignment bridge with multiple different modalities, multiple different dimensional sizes, and multiple different independent structures into a format |
| that can be consumed downstream by another system for useful association. |
|
|
| This system uses a form of procrustes whitening analysis that aligns multiple systems of information through a shared structural boundary, |
| with well defined and understood complexity association differentiations. |
|
|
| This is not meant to be a MOE, this is meant to be a collective understanding and cooperation of alignments through direct and indirect association. |
|
|
| ## Results |
|
|
| | Expert Pair | R@1 | Cosine | Pentachoron CV | |
| |---|---|---|---| |
| | text β audio (1.5K val) | **1.0000** | 0.972 | **0.203** | |
| | text β code (5K val) | **0.9996** | 0.988 | **0.195** | |
| | text β image (5K val) | **1.0000** | 0.986 | **0.196** | |
| | text β protein (2.3K val) | **0.9987** | 0.979 | **0.200** | |
| | text β image (40K test) | **1.0000** | 0.980 | **0.208** | |
|
|
| All CV values converge to the **0.20 Β± 0.01 universal band** β the same geometric constant measured across 17 neural architectures before this model existed. |
|
|
| The current results only used a partial protein and partial audio dataset for preliminary, which left the sample size smaller and less conclusively potential |
| as a full dataset run can yield due to the increased data capacity for alignment. |
|
|
| ## Architecture |
|
|
| ``` |
| βββββββββββββββ |
| β Shared β |
| ββββββββ β Fusion β ββββββββ |
| β BERT βββtextβββ β Transformerβ βββimgβββDINOv2β |
| βlarge β β (1 layer) β βlarge β |
| ββββββββ β 1024-d β ββββββββ |
| β 16 heads β |
| ββββββββ β β ββββββββ |
| βWhisp.βββaudioββ β Procrustes β ββprotβββESM-2 β |
| βlarge β β pre-alignedβ β650M β |
| ββββββββ β β ββββββββ |
| β β |
| ββββββββ β β |
| βCode- βββcodeβββ β β |
| βBERT β βββββββββββββββ |
| ββββββββ |
| ``` |
|
|
| **Frozen encoders** (not trained, not included in this repo): |
| - **BERT-large** (336M) β universal text hub |
| - **DINOv2-large** (302M) β natural images |
| - **Whisper-large-v3** encoder (1.5B) β speech audio |
| - **ESM-2-650M** (652M) β protein sequences |
| - **CodeBERT-base** (125M) β source code |
|
|
| **Trainable fusion** (this repo): **41.5M params** |
| - 1 shared transformer layer (8.9M) |
| - 5 expert modules (text + 4 modalities, ~7M each) |
| - Includes Procrustes pre-alignment buffers per expert |
|
|
| ## How It Works |
|
|
| 1. **Procrustes Pre-Alignment**: Before training, compute optimal orthogonal rotation to align each expert's coordinate system with BERT's text space. Whitened Procrustes with centering. |
|
|
| 2. **Expert Modules**: Each modality gets a projection layer, learned cross-attention pooling (257 image patches β 16 tokens, 1500 audio frames β 16 tokens, etc.), a special `<|MODALITY|>` token, and an output head. |
|
|
| 3. **Fusion Sequence**: `[<|TEXT|>] [text_tokens] [<|IMAGE|>] [img_tokens] ...` β bidirectional self-attention across all modalities. Text tokens attend to image patches. Audio tokens attend to text tokens. |
|
|
| 4. **Geometric Loss**: InfoNCE contrastive + pentachoron CV variance (Cayley-Menger) + Procrustes SVD alignment. The geometric components drive the embedding manifold toward the universal 0.20 CV band. |
|
|
| 5. **Text Hub**: All modalities pair with text during training. Cross-modal alignment (e.g., audioβimage) emerges transitively through the shared text space. |
|
|
| ## Procrustes Pre-Alignment |
|
|
| | Expert | cos before | cos after | Dimension | |
| |---|---|---|---| |
| | audio | 0.0004 | **0.4404** | 1280 β 1024 (PCA down) | |
| | code | -0.0016 | **0.4036** | 768 β 1024 (zero-pad up) | |
| | image | 0.0038 | **0.4107** | 1024 β 1024 (direct) | |
| | protein | 0.0005 | **0.3771** | 1280 β 1024 (PCA down) | |
|
|
| ## Training |
|
|
| - **Data**: COCO-Caption (40K), LibriSpeech clean-100 (10K), CodeSearchNet Python (50K), Protein2Text-QA (15K) |
| - **Schedule**: 3 epochs, round-robin across experts, cosine LR 3e-4 β 1e-6 |
| - **Hardware**: NVIDIA RTX PRO 6000 Blackwell (102 GB VRAM) |
| - **Time**: ~11 minutes total (3 Γ ~220s) |
|
|
| ## Key Finding: Universal Pentachoron Geometry |
|
|
| The coefficient of variation (CV) of pentachoron (4-simplex) volumes, measured via Cayley-Menger determinants, converges to **0.20 Β± 0.01** across: |
|
|
| - 17 pretrained models (T5, BERT, CLIP, DINOv2, SD UNets, VAEs) |
| - 5 architecture families (transformer, UNet, convolutional autoencoder) |
| - 4 modalities in this fusion model (audio, code, image, protein) |
|
|
| This constant emerges in ANY embedding space that reaches convergence through gradient-based contrastive learning, regardless of modality, architecture, or training objective. |
|
|
| ## File Structure |
|
|
| ``` |
| geolip-bertenstein/ |
| βββ checkpoints/ |
| β βββ epoch_001/ |
| β β βββ model.safetensors |
| β β βββ loss.safetensors |
| β β βββ training_state.pt |
| β β βββ config.json |
| β βββ epoch_002/ |
| β βββ epoch_003/ |
| β βββ final/ |
| β βββ model.safetensors |
| β βββ loss.safetensors |
| β βββ training_state.pt |
| β βββ config.json |
| β βββ aligner_audio.safetensors |
| β βββ aligner_code.safetensors |
| β βββ aligner_image.safetensors |
| β βββ aligner_protein.safetensors |
| βββ tensorboard/ |
| βββ bertenstein_results.json |
| βββ stage2_bertenstein.py |
| βββ README.md |
| ``` |
|
|
| ## Usage |
|
|
| ```python |
| from safetensors.torch import load_file |
| import torch |
| |
| # Load model weights |
| state = load_file("checkpoints/final/model.safetensors") |
| |
| # Reconstruct model (see stage2_bertenstein.py for full class definitions) |
| from stage2_bertenstein import BertensteinFusion, ExpertModule, ProcrustesAligner, FusionConfig |
| |
| # ... build model, load state_dict, run inference |
| ``` |
|
|
| Precomputed embedding caches (Arrow format) for all modalities: [`AbstractPhil/geolip-bertenstein-cache`](https://huggingface.co/datasets/AbstractPhil/geolip-bertenstein-cache) |
|
|
| ## Geometric Terrain Analysis |
|
|
| The foundational profiling of 17 models and Procrustes alignment analysis: [`AbstractPhil/procrustes-analysis`](https://huggingface.co/AbstractPhil/procrustes-analysis) |
|
|
| ## Citation |
|
|
| ```bibtex |
| @misc{abstractphil2026bertenstein, |
| title={GEOLIP-Bertenstein: Cross-Modal Alignment Through Shared Pentachoron Geometry}, |
| author={AbstractPhil}, |
| year={2026}, |
| url={https://huggingface.co/AbstractPhil/geolip-bertenstein} |
| } |
| ``` |
|
|
| ## License |
|
|
| MIT |