GeoLIP ESM-2 Geometric Observer

A geometric observer that attaches to frozen ESM-2 (650M) and learns to read its internal structural representations without modifying a single weight. 8.4M trainable parameters observing 652M frozen parameters through 5 structural taps.

The observer achieves 96-100% top-1 agreement with ESM-2's token predictions and 0.988-0.997 confidence correlation — it doesn't just predict what ESM-2 thinks, it knows how certain ESM-2 is at every position. Trained via self-distillation on 500K UniRef50 sequences in ~9 hours on a single NVIDIA RTX PRO 6000 Blackwell.

Part of the geolip-core geometric observer ecosystem.

Two Expert Procrustes Alignment Analysis

Trained heads went here into the dataset repo.

https://huggingface.co/datasets/AbstractPhil/ryan-spearman-prepared-features

The quaternion head defeated all competition, MHA included, by a large margin. It aligns orthogonally to the geometric structure and fits every single latent detail required of the SVD, constellation, and data oriented learning required to be the only head.

However, we need two to procrustes align properly. Two deviant experts, not just one. So I trained a traditional accumulation indexer, the standard MHA model as the twin.

I trained a quaternion head, specifically dedicated to the geometric side at around 86-96% Spearman accuracy with protein mutations.

Adjacently, I trained a heavy attention head specifically targeting the non-geometric side. The two became my head experts, and those experts are what I aligned.

The full json data output is massive, and it's in the 200_epoch_analysis directory.

I did not let this early stop, I continued long into the expectation of overfit - but the model did not overfit.

The model found an equilibrium with the geometric structure and the quaternion head, and the two aligned orthogonally nearly instantly.

We are not aligning sequences, we have a literal tap on the nervecenter of this model. It spearman pulls individual sequential responses from chains at that degree of accuracy.

Phase 2 Multi-Head Results

From 20% to 80%. I will not stop at 80%.

Phase 2 Results

The data is there, the signal from the linear head shows there is reconciled informational accumulation, but it's simply in the incorrect format for this train.

HOWEVER, as the data shows, trying to use the current variant head will decrease accuracy. Using either the standard model or the geometric model on zero-shot, will yield similar results. The geometric sidecar will improve the output by a fraction of accuracy, but it's not enough yet. The head itself DID reduce, which means the approach must be altered.

The loss shows the problem is essentially unsolvable for a little linear head in this state with this information, so the answer will need to be more complex.

I'm thinking multi-stage MLP before the head gets hit, and then pooling the necessary. Claude and I will prototype some variants to test.

I have a few methodologies on how to format this correctly, and I'll be testing each after populating the full geolip-ryan-spearman dataset repo with 22k extracted features from the observer model.

The extractions will be highly robust in this particular model's spectrum, latents extracted from stagewise are too costly from the esm2, so I'll be grabbing only the pooled there, as well as everything from the geometric model down the chain. This should provide enough signal, but if not I may need to have the actual parent active for the next one. Depends what I need from them in the end.

Gathering bulk features will allow me to rapid train heads, rather than wait on the parent models to run.

Variant Head Fault Redetermination

My original assumption was that we wouldn't need it, but the outcome is nearly a mirror to the original model upon eval testing with small nudges towards the correct directional outputs. Meaning it DID learn more information to certain objectivity, but also did not learn ENOUGH without training the variant heads to focus on the target positions.

Next experiment is on the variant head again, this time with correct protein sequences translated from DNA, not DNA sequences as it was accidentally being fed in the last run for evaluation.

Variant Head

It honestly looks like the model might not need it. The tests show convergence rather than divergence at the linear head, which means the model pair most likely know exactly what the answers are already, and I'm just overfitting the linear layer with that information.

I'll update soon with the answer. If not, I'll retrain using pure sequence.

Spearman should be nearly perfect at this stage, meaning I need a brand new process of head formation if this doesn't work.

Architecture

ESM-2 frozen (652M params)
  │
  ├── Layer  5 ─→ hook ─→ project (1280→256) ─→ whiten ─┐
  ├── Layer 11 ─→ hook ─→ project (1280→256) ─→ whiten ─┤
  ├── Layer 17 ─→ hook ─→ project (1280→256) ─→ whiten ─┤ → (B, 5, 256)
  ├── Layer 23 ─→ hook ─→ project (1280→256) ─→ whiten ─┤
  └── Layer 32 ─→ hook ─→ project (1280→256) ─→ whiten ─┘
                                                          ↓
                     ObserveSVDTokens — structural decomposition across layers
                                                          ↓
                  AssociateConstellation — triangulate on S^(255) against 32 anchors
                                                          ↓
                        CurateCMGate — Cayley-Menger simplex validity selection
                                                          ↓
                       CuratePatchwork — 8 compartment interpretation (8×32)
                                                          ↓
                        FuseGeometric — unified protein-level feature (512-d)
                                                          ↓
              ┌─────────────────────────────────────────────┐
              │  Phase 1: DistillationHead → ESM-2 logits   │
              │  Phase 2: VariantEffectHead → fitness score  │
              │  Phase 3: CorrectionHead → logit delta       │
              └─────────────────────────────────────────────┘

The observer is orthogonally attached. ESM-2 does not know the observer exists. Gradients never flow into the frozen model. The observer forms an independent structural opinion and can intervene on the model's output without creating shared bias.

Two valid configurations only:

Bipartisan unit — model and geometry designed together, trained together (for new models)
Independent observer — frozen target, orthogonal attachment, independent training signal (this model)

There is no third option. Partial coupling creates shared bias that destroys the observer's value.

Parameters

Component	Params	Trainable
ESM-2 (frozen)	652,356,534	No
Tap projections (5 × Linear 1280→256 + LN)	1,638,400	Yes
Embed projection (Linear 256→256 + LN)	66,048	Yes
Constellation (32 anchors × 256)	8,192	Yes
CM Gate (MLP 3→16→1 per anchor)	833	Yes
Patchwork (8 compartments × 32 hidden)	8,448	Yes
Fuse (SVD proj + concat)	33,024	Yes
Distillation head (geo→ESM dim→vocab)	6,623,265	Yes
Variant effect head (4×geo→geo→1)	263,425	Yes
Total trainable	8,412,531

Initialization: SVD + Procrustes + Newton-Schulz

The geometric substrate is not randomly initialized. Before training begins:

Calibration pass — 4,196 sequences run through frozen ESM-2, hidden states collected at each tap layer
Per-tap SVD projection — covariance → eigenvectors → top-256 directions capturing maximum variance per layer
Newton-Schulz whitening — decorrelate the projected space so the SVD observation sees clean structure
Consecutive Procrustes alignment — SVD(H_tgt^T @ H_src) → orthogonal rotation aligning each tap to its neighbor. Layer-to-layer structure is coherent from step 0, not arbitrary basis flips
Embed projection — SVD-initialized from aligned tap statistics
Anchor reinitialization — farthest-point sampling from actual calibration embeddings on S^(255) + repulsion polish. Anchors placed where the data lives, not uniform on the sphere

Post-initialization alignment quality:

Tap 11→ 5: cos_after=0.4313
Tap 17→11: cos_after=0.6295
Tap 23→17: cos_after=0.7843
Tap 32→23: cos_after=0.5904

Energy captured per tap (of 1280-d → 256-d):

Tap  5: 89.6%   (top_sv=66.4)
Tap 11: 87.8%   (top_sv=310.2)
Tap 17: 81.2%   (top_sv=844.5)
Tap 23: 82.1%   (top_sv=1589.0)
Tap 32: 96.6%   (top_sv=34307.7)    ← final layer dominates

Training

Self-distillation on UniRef50 (500K sequences, filtered to 50-1000 residues). The geometric substrate learns to predict ESM-2's own token-level amino acid probability distributions from structural observation alone. The frozen model's masked language modeling knowledge — learned from 250 million evolutionary sequences — is the curriculum.

Loss: KL divergence between observer-predicted logits and ESM-2's actual logits, soft distillation at temperature T=2.0.

Hardware: NVIDIA RTX PRO 6000 Blackwell Server Edition (102 GB VRAM)

Training config:

Batch size: 256, grad accumulation: 2 (effective batch: 512)
Learning rate: 3e-4 (AdamW, weight decay 0.05)
Sequence length: 512 tokens
8 dataloader workers
~4.75 seconds per batch

Training Curve

Epoch	Loss	CV	SVD Entropy	SVD Top-1	Gate Active	Anchor Std	Time
0	0.0825	0.322	1.572	0.286	32/32	28,531	77 min
1	0.0199	0.241	1.572	0.281	32/32	22,674	155 min
2	0.0153	0.215	1.566	0.284	32/32	19,893	232 min
3	0.0135	0.191	1.563	0.285	32/32	16,713	309 min
4	0.0125	0.168	1.560	0.286	32/32	15,509	387 min
5	0.0118	0.153	1.558	0.286	32/32	14,976	464 min
6	0.0114	0.144	1.556	0.286	32/32	15,201	541 min

Loss dropped from 5.93 → 0.011 over 7 epochs (~9 hours). The bulk signal was captured in epoch 0; subsequent epochs refined fine structural detail. SVD entropy and top-1 ratio stabilized immediately (epoch 0) and held constant — the multi-layer structural decomposition was correct from initialization.

Analysis Results (Epoch 6 Checkpoint)

Full battery analysis on 2,046 diverse protein sequences (682 short, 682 medium, 682 long).

Distillation Quality — The Observer Reads

Protein	Length	Loss	Top-1 Agreement	Top-5 Overlap	Confidence Correlation
Hemoglobin α	51	0.0100	96.2%	96.6%	0.9880
GFP fragment	60	0.0016	100.0%	93.9%	0.9954
Mixed sequence	60	0.0007	100.0%	92.9%	0.9973

Mean distillation loss: 0.00735 ± 0.00086

The observer predicts not just what ESM-2 thinks at each position, but how confident it is. A confidence correlation of 0.997 means the geometric features encode the full shape of ESM-2's probability distribution — which positions are structurally constrained (sharp distributions) and which are tolerant (flat distributions).

SVD Spectrum — Multi-Layer Structure

Tap	Mean	Std	Energy Fraction	Entropy
S[0] (layer 5)	19.52	0.55	0.281	1.560
S[1] (layer 11)	15.99	0.56	0.230	1.560
S[2] (layer 17)	14.33	0.46	0.206	1.560
S[3] (layer 23)	12.60	0.33	0.181	1.560
S[4] (layer 32)	7.01	1.00	0.101	1.560

Global entropy: 1.560 ± 0.011 (diverse, all taps contributing)
Top-1 ratio: 0.281 (no single tap dominates)
Condition number: 2.84 ± 0.42 (max 4.33, well-conditioned)

The SVD spectrum shows a smooth energy gradient from early to late layers, not a step function. Each tap carries independent structural information. The final layer (tap 32) has the lowest individual energy but the highest raw singular value (34,308) — it concentrates information differently than the earlier layers.

CM Validity Gate — Geometric Health

32/32 anchors active at 100% across all sequences
Gate mean: 0.862 ± 0.009 (initialized at sigmoid(2.0) ≈ 0.88)
Gate range per sample: 0.034 ± 0.004
CM positive fraction: 1.000

The gate is not discriminating because there is nothing to suppress. The protein embedding manifold is so geometrically regular that every simplex formed between the input and its anchor neighborhood is valid. On a model with collapsed or pathological representations, the CM gate would close — it has been validated to do so on synthetic degenerate inputs. This manifold has no pathology.

Constellation — Anchor Distribution on S^(255)

32/32 anchors alive, hit entropy 3.14 / 3.47 (90.5% of uniform)
Anchor-anchor cosine: mean -0.030, max 0.086, std 0.031 — nearly orthogonal, zero collapse
Data-anchor coverage: mean cosine 0.167, min 0.074
Triangulation distances: mean 1.004, std 0.082

The negative mean anchor-anchor cosine confirms the repulsion initialization held through training. No anchor pair has collapsed (max 0.086 is well below concern). The constellation forms a well-spread reference frame on the 255-sphere.

Embedding Manifold — Geometry of Protein Space

Embeddings confirmed on S^(255): norms 1.000000 ± 0.000000
Pairwise cosine: mean 0.094 (near-orthogonal), std 0.163
Near-duplicates (>0.9): 0.03%. Antipodal (<-0.5): 0.00%
Intrinsic dimensionality: 93 dims (90%), 131 dims (95%), 205 dims (99%) of 256 total
Effective ratio: 0.51 — the manifold uses half the available dimensions
Nearest neighbor cosine: 0.777 (1st NN), 0.600 (10th NN), gap 0.177

The manifold is structured but not collapsed. 131 effective dimensions means the observer is working with genuine high-dimensional geometry, not a low-rank approximation pretending to be high-dimensional. The NN gap of 0.177 indicates clear local structure — sequences have meaningful neighborhoods.

Patchwork Compartments — Independent Interpretation

8/8 compartments active, zero dead
Per-compartment norms: 5.068–5.089 (near-uniform utilization)
Inter-compartment correlation: mean 0.113, max 0.412
Low correlation confirms compartments specialize on different aspects of the geometric signal

CV Pentachoron — The Geometric Regularity Metric

Global CV: 0.222 (in the 0.20–0.23 universal band)

Measurement	CV
Global (2046 seqs)	0.222
Short proteins (50-150 residues)	0.199
Medium proteins (150-400 residues)	0.206
Long proteins (400-1000 residues)	0.232
Bootstrap mean (n=100)	0.192
Bootstrap 95% CI	[0.170, 0.211]

The CV coefficient of variation measures the regularity of pentachoron (4-simplex) volumes sampled from the embedding manifold. A CV of 0.125 is the theoretical minimum for uniform point distribution on a sphere. The universal band of 0.20–0.23 has been observed across 17 prior architectures spanning vision, language, and multi-modal models.

This is architecture 18. The protein embedding manifold confirms the universal band. Short proteins (50-150 residues) achieve CV=0.199, the tightest measurement in the band — consistent with the biological constraint that shorter proteins have fewer degrees of freedom and more regular fold geometry.

The training-time CV trajectory shows continued compression:

Epoch 0: 0.322    (above band, still learning)
Epoch 1: 0.241    (entering band)
Epoch 2: 0.215    (in band)
Epoch 3: 0.191    (below band, compressing)
Epoch 4: 0.168    (below all prior measurements)
Epoch 5: 0.153    (approaching theoretical floor)
Epoch 6: 0.144    (still falling)

The training-time CV measures the geometric regularity of the learned embedding space during optimization. The analysis-time CV (0.222) measures on held-out sequences — the manifold is tighter on training distribution than on diverse held-out data, as expected.

Empirical Constants

Constant	Value	Context
CV pentachoron (analysis)	0.222	In universal band (0.20–0.23), architecture 18
CV pentachoron (training E6)	0.144	Below all prior measurements
SVD entropy (stable)	1.560	Locked from epoch 0, all taps contributing
SVD top-1 ratio (stable)	0.286	No single layer dominates
SVD condition number	2.84	Well-conditioned multi-layer decomposition
Embedding intrinsic dim	131 (95%)	0.51 effective ratio of 256
Anchor-anchor cosine	-0.030 mean	Near-orthogonal, confirmed repulsion
Pairwise embedding cosine	0.094 mean	Near-orthogonal on S^(255)
Gate mean	0.862	All anchors open — manifold has no pathology
Confidence correlation	0.988–0.997	Observer reads ESM-2's certainty structure
Binding/separation boundary	0.29154 / 0.70846	Prior work, validated across modalities

Design Principles

Architecture Before Loss

The geometric substrate is structural, not supervisory. The CM gate doesn't need a validity loss — it architecturally suppresses degenerate simplices. The Procrustes alignment doesn't need an alignment loss — the rotation is exact by construction. Repulsion-initialized anchors don't need a spread loss if the gate prevents collapse.

The Freeze Is Non-Negotiable

The observer must form an independent opinion. Shared gradients create shared bias. Independent opinions catch errors; correlated opinions amplify them. This was validated across hundreds of experiments with shared memory banks, joint training, and partial coupling — all converged to the same failure mode.

Gates Start Open

Gate bias initialized at +2.0 (sigmoid ≈ 0.88). The observer starts by attending to everything and learns to suppress what's geometrically invalid. Not the reverse. Learning to open is harder than learning to close.

Usage

from huggingface_hub import hf_hub_download
from geolip_core.pipeline.esm2_geometric import ESM2GeometricPipeline
import torch

# Load
pipe = ESM2GeometricPipeline('esm2_geo')
ckpt = torch.load(
    hf_hub_download('AbstractPhil/geolip-esm2_t33_650M_UR50D',
                    'prototype/v1_distill/epoch_6.pt'),
    map_location='cuda')
current = pipe.state_dict()
current.update({k: v for k, v in ckpt['state_dict'].items() if k in current})
pipe.load_state_dict(current, strict=False)
pipe.eval().cuda()

# Run
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('facebook/esm2_t33_650M_UR50D')
enc = tokenizer("MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSH",
                return_tensors='pt', padding=True).to('cuda')

with torch.no_grad():
    loss, info = pipe.forward_distill(enc['input_ids'], enc['attention_mask'])

# Diagnostics
print(info['gate_info'])
print(info['svd_S'])
print(pipe.cache_get('embedding'))  # protein-level geometric feature

Requirements: transformers<=4.49 (versions ≥4.50 remove ESM-2 position embeddings silently)

Prototype Directory

prototype/
└── v1_distill/
    ├── checkpoints/
    │   ├── step_500.pt
    │   ├── step_1000.pt
    │   ├── ...
    │   └── step_7000.pt
    ├── tensorboard/
    │   └── events.out.tfevents.*
    ├── epoch_0.pt
    ├── epoch_1.pt
    ├── ...
    ├── epoch_6.pt
    ├── final.pt
    └── training_log.jsonl

Ecosystem

geolip-core — Geometric observer framework (5 stages: input, associate, curate, align, distinguish)
geofractal — Router/tower/component composition system
glip-autoencoder — Full GeoLIP package
SVD Kernel Article — Fused Triton SVD engineering specification
SVD Experiment Journey — Development map
geolip-bertenstein — Multi-expert geometric fusion (bipartisan unit configuration)
procrustes-analysis — Cross-model alignment study (17 models, universal constants)

I dub this model - Ryan-Spearman

This is a reminder to myself and everyone who reads it, that the guidepost anchors in our very being are often forgotten.

I dedicate this model's baseline prototype with a name; and to a fallen brother of mine from years long past.

As I grew from young to teenage I once had a friend named Ryan. Ryan wasn't the most... intelligent of fellows, but for whatever else he lacked he made up for in multitude elsewhere - compassion, respect, understanding, and uncannily accurate empathy.

He spent his elementary school classes with other special needs students and then into teenage joined normal schoolilng. I never once believed he couldn't handle the tasks he was given, I knew he could if he set his mind to it. Ryan eventually grew and turned into a man, his health started to fail him. This man wasn't capable of solving this problem, so eventually Ryan fell to time and entropy as we all will eventually - liver failure.

Him and I spent a great deal of time over the years playing video games and understanding complex musical relationships in music composition. Ryan was deeply analytical and the trait passed to me, but not immediately. It happened over time, years later really.

We grew apart as adults and after his passsing a bit of time passed and I began to wonder if I could have done something to help. I know now I couldn't have at the time, I simply lacked the skillset to solve the faults in his body, nor did I have the medical training or expertise to actually help with this.

Today however, I'm in a position to potentially analyze the necessary information of the very diagnostic tools that the experts utilize, in order to provide the necessary improvement to augment and improve what exists - without ripping the structure to the ground and rebuilding it.

I will not waste this chance the universe has gifted me.

Research by AbstractPhil. Apache 2.0.

Downloads last month: 712

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support