README.md · AbstractPhil/geolip-bertenstein at main

geolip-bertenstein / README.md

AbstractPhil

Update README.md

b7bd1b2 verified 25 days ago

preview code

raw

history blame contribute delete

7.7 kB

	---
	license: mit
	tags:
	- safetensors
	- tensorboard
	- geometric-deep-learning
	- cross-modal
	- multi-modal
	- retrieval
	- pentachoron
	- procrustes
	- bert
	- dinov2
	- whisper
	- esm2
	- codebert
	- contrastive-learning
	language:
	- en
	library_name: pytorch
	pipeline_tag: feature-extraction
	---

	# GEOLIP-Bertenstein - AKA the GEOLIP-Conduit prototype

	A multi-expert geometric fusion transformer that bridges 4 independently-trained encoders into a shared embedding space using BERT-large as a universal text hub.

	One layer. One epoch. Perfect cross-modal retrieval. Universal pentachoron geometry.

	The entire design is meant to create an alignment bridge with multiple different modalities, multiple different dimensional sizes, and multiple different independent structures into a format
	that can be consumed downstream by another system for useful association.

	This system uses a form of procrustes whitening analysis that aligns multiple systems of information through a shared structural boundary,
	with well defined and understood complexity association differentiations.

	This is not meant to be a MOE, this is meant to be a collective understanding and cooperation of alignments through direct and indirect association.

	## Results

	\| Expert Pair \| R@1 \| Cosine \| Pentachoron CV \|
	\|---\|---\|---\|---\|
	\| text ↔ audio (1.5K val) \| 1.0000 \| 0.972 \| 0.203 \|
	\| text ↔ code (5K val) \| 0.9996 \| 0.988 \| 0.195 \|
	\| text ↔ image (5K val) \| 1.0000 \| 0.986 \| 0.196 \|
	\| text ↔ protein (2.3K val) \| 0.9987 \| 0.979 \| 0.200 \|
	\| text ↔ image (40K test) \| 1.0000 \| 0.980 \| 0.208 \|

	All CV values converge to the 0.20 ± 0.01 universal band — the same geometric constant measured across 17 neural architectures before this model existed.

	The current results only used a partial protein and partial audio dataset for preliminary, which left the sample size smaller and less conclusively potential
	as a full dataset run can yield due to the increased data capacity for alignment.

	## Architecture

	```
	┌─────────────┐
	│ Shared │
	┌──────┐ │ Fusion │ ┌──────┐
	│ BERT │──text──→ │ Transformer│ ←──img──│DINOv2│
	│large │ │ (1 layer) │ │large │
	└──────┘ │ 1024-d │ └──────┘
	│ 16 heads │
	┌──────┐ │ │ ┌──────┐
	│Whisp.│──audio─→ │ Procrustes │ ←─prot──│ESM-2 │
	│large │ │ pre-aligned│ │650M │
	└──────┘ │ │ └──────┘
	│ │
	┌──────┐ │ │
	│Code- │──code──→ │ │
	│BERT │ └─────────────┘
	└──────┘
	```

	Frozen encoders (not trained, not included in this repo):
	- BERT-large (336M) — universal text hub
	- DINOv2-large (302M) — natural images
	- Whisper-large-v3 encoder (1.5B) — speech audio
	- ESM-2-650M (652M) — protein sequences
	- CodeBERT-base (125M) — source code

	Trainable fusion (this repo): 41.5M params
	- 1 shared transformer layer (8.9M)
	- 5 expert modules (text + 4 modalities, ~7M each)
	- Includes Procrustes pre-alignment buffers per expert

	## How It Works

	1. Procrustes Pre-Alignment: Before training, compute optimal orthogonal rotation to align each expert's coordinate system with BERT's text space. Whitened Procrustes with centering.

	2. Expert Modules: Each modality gets a projection layer, learned cross-attention pooling (257 image patches → 16 tokens, 1500 audio frames → 16 tokens, etc.), a special `<\|MODALITY\|>` token, and an output head.

	3. Fusion Sequence: `[<\|TEXT\|>] [text_tokens] [<\|IMAGE\|>] [img_tokens] ...` — bidirectional self-attention across all modalities. Text tokens attend to image patches. Audio tokens attend to text tokens.

	4. Geometric Loss: InfoNCE contrastive + pentachoron CV variance (Cayley-Menger) + Procrustes SVD alignment. The geometric components drive the embedding manifold toward the universal 0.20 CV band.

	5. Text Hub: All modalities pair with text during training. Cross-modal alignment (e.g., audio↔image) emerges transitively through the shared text space.

	## Procrustes Pre-Alignment

	\| Expert \| cos before \| cos after \| Dimension \|
	\|---\|---\|---\|---\|
	\| audio \| 0.0004 \| 0.4404 \| 1280 → 1024 (PCA down) \|
	\| code \| -0.0016 \| 0.4036 \| 768 → 1024 (zero-pad up) \|
	\| image \| 0.0038 \| 0.4107 \| 1024 → 1024 (direct) \|
	\| protein \| 0.0005 \| 0.3771 \| 1280 → 1024 (PCA down) \|

	## Training

	- Data: COCO-Caption (40K), LibriSpeech clean-100 (10K), CodeSearchNet Python (50K), Protein2Text-QA (15K)
	- Schedule: 3 epochs, round-robin across experts, cosine LR 3e-4 → 1e-6
	- Hardware: NVIDIA RTX PRO 6000 Blackwell (102 GB VRAM)
	- Time: ~11 minutes total (3 × ~220s)

	## Key Finding: Universal Pentachoron Geometry

	The coefficient of variation (CV) of pentachoron (4-simplex) volumes, measured via Cayley-Menger determinants, converges to 0.20 ± 0.01 across:

	- 17 pretrained models (T5, BERT, CLIP, DINOv2, SD UNets, VAEs)
	- 5 architecture families (transformer, UNet, convolutional autoencoder)
	- 4 modalities in this fusion model (audio, code, image, protein)

	This constant emerges in ANY embedding space that reaches convergence through gradient-based contrastive learning, regardless of modality, architecture, or training objective.

	## File Structure

	```
	geolip-bertenstein/
	├── checkpoints/
	│ ├── epoch_001/
	│ │ ├── model.safetensors
	│ │ ├── loss.safetensors
	│ │ ├── training_state.pt
	│ │ └── config.json
	│ ├── epoch_002/
	│ ├── epoch_003/
	│ └── final/
	│ ├── model.safetensors
	│ ├── loss.safetensors
	│ ├── training_state.pt
	│ ├── config.json
	│ ├── aligner_audio.safetensors
	│ ├── aligner_code.safetensors
	│ ├── aligner_image.safetensors
	│ └── aligner_protein.safetensors
	├── tensorboard/
	├── bertenstein_results.json
	├── stage2_bertenstein.py
	└── README.md
	```

	## Usage

	```python
	from safetensors.torch import load_file
	import torch

	# Load model weights
	state = load_file("checkpoints/final/model.safetensors")

	# Reconstruct model (see stage2_bertenstein.py for full class definitions)
	from stage2_bertenstein import BertensteinFusion, ExpertModule, ProcrustesAligner, FusionConfig

	# ... build model, load state_dict, run inference
	```

	Precomputed embedding caches (Arrow format) for all modalities: [`AbstractPhil/geolip-bertenstein-cache`](https://huggingface.co/datasets/AbstractPhil/geolip-bertenstein-cache)

	## Geometric Terrain Analysis

	The foundational profiling of 17 models and Procrustes alignment analysis: [`AbstractPhil/procrustes-analysis`](https://huggingface.co/AbstractPhil/procrustes-analysis)

	## Citation

	```bibtex
	@misc{abstractphil2026bertenstein,
	title={GEOLIP-Bertenstein: Cross-Modal Alignment Through Shared Pentachoron Geometry},
	author={AbstractPhil},
	year={2026},
	url={https://huggingface.co/AbstractPhil/geolip-bertenstein}
	}
	```

	## License

	MIT