geolip-bertenstein / README.md
AbstractPhil's picture
Update README.md
b7bd1b2 verified
---
license: mit
tags:
- safetensors
- tensorboard
- geometric-deep-learning
- cross-modal
- multi-modal
- retrieval
- pentachoron
- procrustes
- bert
- dinov2
- whisper
- esm2
- codebert
- contrastive-learning
language:
- en
library_name: pytorch
pipeline_tag: feature-extraction
---
# GEOLIP-Bertenstein - AKA the GEOLIP-Conduit prototype
**A multi-expert geometric fusion transformer that bridges 4 independently-trained encoders into a shared embedding space using BERT-large as a universal text hub.**
One layer. One epoch. Perfect cross-modal retrieval. Universal pentachoron geometry.
The entire design is meant to create an alignment bridge with multiple different modalities, multiple different dimensional sizes, and multiple different independent structures into a format
that can be consumed downstream by another system for useful association.
This system uses a form of procrustes whitening analysis that aligns multiple systems of information through a shared structural boundary,
with well defined and understood complexity association differentiations.
This is not meant to be a MOE, this is meant to be a collective understanding and cooperation of alignments through direct and indirect association.
## Results
| Expert Pair | R@1 | Cosine | Pentachoron CV |
|---|---|---|---|
| text ↔ audio (1.5K val) | **1.0000** | 0.972 | **0.203** |
| text ↔ code (5K val) | **0.9996** | 0.988 | **0.195** |
| text ↔ image (5K val) | **1.0000** | 0.986 | **0.196** |
| text ↔ protein (2.3K val) | **0.9987** | 0.979 | **0.200** |
| text ↔ image (40K test) | **1.0000** | 0.980 | **0.208** |
All CV values converge to the **0.20 Β± 0.01 universal band** β€” the same geometric constant measured across 17 neural architectures before this model existed.
The current results only used a partial protein and partial audio dataset for preliminary, which left the sample size smaller and less conclusively potential
as a full dataset run can yield due to the increased data capacity for alignment.
## Architecture
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Shared β”‚
β”Œβ”€β”€β”€β”€β”€β”€β” β”‚ Fusion β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”
β”‚ BERT │──text──→ β”‚ Transformerβ”‚ ←──img──│DINOv2β”‚
β”‚large β”‚ β”‚ (1 layer) β”‚ β”‚large β”‚
β””β”€β”€β”€β”€β”€β”€β”˜ β”‚ 1024-d β”‚ β””β”€β”€β”€β”€β”€β”€β”˜
β”‚ 16 heads β”‚
β”Œβ”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”
β”‚Whisp.│──audio─→ β”‚ Procrustes β”‚ ←─prot──│ESM-2 β”‚
β”‚large β”‚ β”‚ pre-alignedβ”‚ β”‚650M β”‚
β””β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”˜
β”‚ β”‚
β”Œβ”€β”€β”€β”€β”€β”€β” β”‚ β”‚
β”‚Code- │──code──→ β”‚ β”‚
β”‚BERT β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β””β”€β”€β”€β”€β”€β”€β”˜
```
**Frozen encoders** (not trained, not included in this repo):
- **BERT-large** (336M) β€” universal text hub
- **DINOv2-large** (302M) β€” natural images
- **Whisper-large-v3** encoder (1.5B) β€” speech audio
- **ESM-2-650M** (652M) β€” protein sequences
- **CodeBERT-base** (125M) β€” source code
**Trainable fusion** (this repo): **41.5M params**
- 1 shared transformer layer (8.9M)
- 5 expert modules (text + 4 modalities, ~7M each)
- Includes Procrustes pre-alignment buffers per expert
## How It Works
1. **Procrustes Pre-Alignment**: Before training, compute optimal orthogonal rotation to align each expert's coordinate system with BERT's text space. Whitened Procrustes with centering.
2. **Expert Modules**: Each modality gets a projection layer, learned cross-attention pooling (257 image patches β†’ 16 tokens, 1500 audio frames β†’ 16 tokens, etc.), a special `<|MODALITY|>` token, and an output head.
3. **Fusion Sequence**: `[<|TEXT|>] [text_tokens] [<|IMAGE|>] [img_tokens] ...` β€” bidirectional self-attention across all modalities. Text tokens attend to image patches. Audio tokens attend to text tokens.
4. **Geometric Loss**: InfoNCE contrastive + pentachoron CV variance (Cayley-Menger) + Procrustes SVD alignment. The geometric components drive the embedding manifold toward the universal 0.20 CV band.
5. **Text Hub**: All modalities pair with text during training. Cross-modal alignment (e.g., audio↔image) emerges transitively through the shared text space.
## Procrustes Pre-Alignment
| Expert | cos before | cos after | Dimension |
|---|---|---|---|
| audio | 0.0004 | **0.4404** | 1280 β†’ 1024 (PCA down) |
| code | -0.0016 | **0.4036** | 768 β†’ 1024 (zero-pad up) |
| image | 0.0038 | **0.4107** | 1024 β†’ 1024 (direct) |
| protein | 0.0005 | **0.3771** | 1280 β†’ 1024 (PCA down) |
## Training
- **Data**: COCO-Caption (40K), LibriSpeech clean-100 (10K), CodeSearchNet Python (50K), Protein2Text-QA (15K)
- **Schedule**: 3 epochs, round-robin across experts, cosine LR 3e-4 β†’ 1e-6
- **Hardware**: NVIDIA RTX PRO 6000 Blackwell (102 GB VRAM)
- **Time**: ~11 minutes total (3 Γ— ~220s)
## Key Finding: Universal Pentachoron Geometry
The coefficient of variation (CV) of pentachoron (4-simplex) volumes, measured via Cayley-Menger determinants, converges to **0.20 Β± 0.01** across:
- 17 pretrained models (T5, BERT, CLIP, DINOv2, SD UNets, VAEs)
- 5 architecture families (transformer, UNet, convolutional autoencoder)
- 4 modalities in this fusion model (audio, code, image, protein)
This constant emerges in ANY embedding space that reaches convergence through gradient-based contrastive learning, regardless of modality, architecture, or training objective.
## File Structure
```
geolip-bertenstein/
β”œβ”€β”€ checkpoints/
β”‚ β”œβ”€β”€ epoch_001/
β”‚ β”‚ β”œβ”€β”€ model.safetensors
β”‚ β”‚ β”œβ”€β”€ loss.safetensors
β”‚ β”‚ β”œβ”€β”€ training_state.pt
β”‚ β”‚ └── config.json
β”‚ β”œβ”€β”€ epoch_002/
β”‚ β”œβ”€β”€ epoch_003/
β”‚ └── final/
β”‚ β”œβ”€β”€ model.safetensors
β”‚ β”œβ”€β”€ loss.safetensors
β”‚ β”œβ”€β”€ training_state.pt
β”‚ β”œβ”€β”€ config.json
β”‚ β”œβ”€β”€ aligner_audio.safetensors
β”‚ β”œβ”€β”€ aligner_code.safetensors
β”‚ β”œβ”€β”€ aligner_image.safetensors
β”‚ └── aligner_protein.safetensors
β”œβ”€β”€ tensorboard/
β”œβ”€β”€ bertenstein_results.json
β”œβ”€β”€ stage2_bertenstein.py
└── README.md
```
## Usage
```python
from safetensors.torch import load_file
import torch
# Load model weights
state = load_file("checkpoints/final/model.safetensors")
# Reconstruct model (see stage2_bertenstein.py for full class definitions)
from stage2_bertenstein import BertensteinFusion, ExpertModule, ProcrustesAligner, FusionConfig
# ... build model, load state_dict, run inference
```
Precomputed embedding caches (Arrow format) for all modalities: [`AbstractPhil/geolip-bertenstein-cache`](https://huggingface.co/datasets/AbstractPhil/geolip-bertenstein-cache)
## Geometric Terrain Analysis
The foundational profiling of 17 models and Procrustes alignment analysis: [`AbstractPhil/procrustes-analysis`](https://huggingface.co/AbstractPhil/procrustes-analysis)
## Citation
```bibtex
@misc{abstractphil2026bertenstein,
title={GEOLIP-Bertenstein: Cross-Modal Alignment Through Shared Pentachoron Geometry},
author={AbstractPhil},
year={2026},
url={https://huggingface.co/AbstractPhil/geolip-bertenstein}
}
```
## License
MIT