GeoLIP ESM-2 Geometric Observer

A geometric observer that attaches to frozen ESM-2 (650M) and learns to read its internal structural representations without modifying a single weight. 8.4M trainable parameters observing 652M frozen parameters through 5 structural taps.

The observer achieves 96-100% top-1 agreement with ESM-2's token predictions and 0.988-0.997 confidence correlation β€” it doesn't just predict what ESM-2 thinks, it knows how certain ESM-2 is at every position. Trained via self-distillation on 500K UniRef50 sequences in ~9 hours on a single NVIDIA RTX PRO 6000 Blackwell.

Part of the geolip-core geometric observer ecosystem.

Two Expert Procrustes Alignment Analysis

Trained heads went here into the dataset repo.

https://huggingface.co/datasets/AbstractPhil/ryan-spearman-prepared-features

The quaternion head defeated all competition, MHA included, by a large margin. It aligns orthogonally to the geometric structure and fits every single latent detail required of the SVD, constellation, and data oriented learning required to be the only head.

However, we need two to procrustes align properly. Two deviant experts, not just one. So I trained a traditional accumulation indexer, the standard MHA model as the twin.

I trained a quaternion head, specifically dedicated to the geometric side at around 86-96% Spearman accuracy with protein mutations.

Adjacently, I trained a heavy attention head specifically targeting the non-geometric side. The two became my head experts, and those experts are what I aligned.

The full json data output is massive, and it's in the 200_epoch_analysis directory.

I did not let this early stop, I continued long into the expectation of overfit - but the model did not overfit.

The model found an equilibrium with the geometric structure and the quaternion head, and the two aligned orthogonally nearly instantly.

We are not aligning sequences, we have a literal tap on the nervecenter of this model. It spearman pulls individual sequential responses from chains at that degree of accuracy.

image

Phase 2 Multi-Head Results

image

From 20% to 80%. I will not stop at 80%.

Phase 2 Results

The data is there, the signal from the linear head shows there is reconciled informational accumulation, but it's simply in the incorrect format for this train.

HOWEVER, as the data shows, trying to use the current variant head will decrease accuracy. Using either the standard model or the geometric model on zero-shot, will yield similar results. The geometric sidecar will improve the output by a fraction of accuracy, but it's not enough yet. The head itself DID reduce, which means the approach must be altered.

image

The loss shows the problem is essentially unsolvable for a little linear head in this state with this information, so the answer will need to be more complex.

I'm thinking multi-stage MLP before the head gets hit, and then pooling the necessary. Claude and I will prototype some variants to test.

I have a few methodologies on how to format this correctly, and I'll be testing each after populating the full geolip-ryan-spearman dataset repo with 22k extracted features from the observer model.

The extractions will be highly robust in this particular model's spectrum, latents extracted from stagewise are too costly from the esm2, so I'll be grabbing only the pooled there, as well as everything from the geometric model down the chain. This should provide enough signal, but if not I may need to have the actual parent active for the next one. Depends what I need from them in the end.

Gathering bulk features will allow me to rapid train heads, rather than wait on the parent models to run.

Variant Head Fault Redetermination

My original assumption was that we wouldn't need it, but the outcome is nearly a mirror to the original model upon eval testing with small nudges towards the correct directional outputs. Meaning it DID learn more information to certain objectivity, but also did not learn ENOUGH without training the variant heads to focus on the target positions.

Next experiment is on the variant head again, this time with correct protein sequences translated from DNA, not DNA sequences as it was accidentally being fed in the last run for evaluation.

Variant Head

It honestly looks like the model might not need it. The tests show convergence rather than divergence at the linear head, which means the model pair most likely know exactly what the answers are already, and I'm just overfitting the linear layer with that information.

I'll update soon with the answer. If not, I'll retrain using pure sequence.

Spearman should be nearly perfect at this stage, meaning I need a brand new process of head formation if this doesn't work.

Architecture

ESM-2 frozen (652M params)
  β”‚
  β”œβ”€β”€ Layer  5 ─→ hook ─→ project (1280β†’256) ─→ whiten ─┐
  β”œβ”€β”€ Layer 11 ─→ hook ─→ project (1280β†’256) ─→ whiten ──
  β”œβ”€β”€ Layer 17 ─→ hook ─→ project (1280β†’256) ─→ whiten ── β†’ (B, 5, 256)
  β”œβ”€β”€ Layer 23 ─→ hook ─→ project (1280β†’256) ─→ whiten ──
  └── Layer 32 ─→ hook ─→ project (1280β†’256) ─→ whiten β”€β”˜
                                                          ↓
                     ObserveSVDTokens β€” structural decomposition across layers
                                                          ↓
                  AssociateConstellation β€” triangulate on S^(255) against 32 anchors
                                                          ↓
                        CurateCMGate β€” Cayley-Menger simplex validity selection
                                                          ↓
                       CuratePatchwork β€” 8 compartment interpretation (8Γ—32)
                                                          ↓
                        FuseGeometric β€” unified protein-level feature (512-d)
                                                          ↓
              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
              β”‚  Phase 1: DistillationHead β†’ ESM-2 logits   β”‚
              β”‚  Phase 2: VariantEffectHead β†’ fitness score  β”‚
              β”‚  Phase 3: CorrectionHead β†’ logit delta       β”‚
              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

The observer is orthogonally attached. ESM-2 does not know the observer exists. Gradients never flow into the frozen model. The observer forms an independent structural opinion and can intervene on the model's output without creating shared bias.

Two valid configurations only:

  1. Bipartisan unit β€” model and geometry designed together, trained together (for new models)
  2. Independent observer β€” frozen target, orthogonal attachment, independent training signal (this model)

There is no third option. Partial coupling creates shared bias that destroys the observer's value.

Parameters

Component Params Trainable
ESM-2 (frozen) 652,356,534 No
Tap projections (5 Γ— Linear 1280β†’256 + LN) 1,638,400 Yes
Embed projection (Linear 256β†’256 + LN) 66,048 Yes
Constellation (32 anchors Γ— 256) 8,192 Yes
CM Gate (MLP 3β†’16β†’1 per anchor) 833 Yes
Patchwork (8 compartments Γ— 32 hidden) 8,448 Yes
Fuse (SVD proj + concat) 33,024 Yes
Distillation head (geo→ESM dim→vocab) 6,623,265 Yes
Variant effect head (4×geo→geo→1) 263,425 Yes
Total trainable 8,412,531

Initialization: SVD + Procrustes + Newton-Schulz

The geometric substrate is not randomly initialized. Before training begins:

  1. Calibration pass β€” 4,196 sequences run through frozen ESM-2, hidden states collected at each tap layer
  2. Per-tap SVD projection β€” covariance β†’ eigenvectors β†’ top-256 directions capturing maximum variance per layer
  3. Newton-Schulz whitening β€” decorrelate the projected space so the SVD observation sees clean structure
  4. Consecutive Procrustes alignment β€” SVD(H_tgt^T @ H_src) β†’ orthogonal rotation aligning each tap to its neighbor. Layer-to-layer structure is coherent from step 0, not arbitrary basis flips
  5. Embed projection β€” SVD-initialized from aligned tap statistics
  6. Anchor reinitialization β€” farthest-point sampling from actual calibration embeddings on S^(255) + repulsion polish. Anchors placed where the data lives, not uniform on the sphere

Post-initialization alignment quality:

Tap 11β†’ 5: cos_after=0.4313
Tap 17β†’11: cos_after=0.6295
Tap 23β†’17: cos_after=0.7843
Tap 32β†’23: cos_after=0.5904

Energy captured per tap (of 1280-d β†’ 256-d):

Tap  5: 89.6%   (top_sv=66.4)
Tap 11: 87.8%   (top_sv=310.2)
Tap 17: 81.2%   (top_sv=844.5)
Tap 23: 82.1%   (top_sv=1589.0)
Tap 32: 96.6%   (top_sv=34307.7)    ← final layer dominates

Training

Self-distillation on UniRef50 (500K sequences, filtered to 50-1000 residues). The geometric substrate learns to predict ESM-2's own token-level amino acid probability distributions from structural observation alone. The frozen model's masked language modeling knowledge β€” learned from 250 million evolutionary sequences β€” is the curriculum.

Loss: KL divergence between observer-predicted logits and ESM-2's actual logits, soft distillation at temperature T=2.0.

Hardware: NVIDIA RTX PRO 6000 Blackwell Server Edition (102 GB VRAM)

Training config:

  • Batch size: 256, grad accumulation: 2 (effective batch: 512)
  • Learning rate: 3e-4 (AdamW, weight decay 0.05)
  • Sequence length: 512 tokens
  • 8 dataloader workers
  • ~4.75 seconds per batch

Training Curve

Epoch Loss CV SVD Entropy SVD Top-1 Gate Active Anchor Std Time
0 0.0825 0.322 1.572 0.286 32/32 28,531 77 min
1 0.0199 0.241 1.572 0.281 32/32 22,674 155 min
2 0.0153 0.215 1.566 0.284 32/32 19,893 232 min
3 0.0135 0.191 1.563 0.285 32/32 16,713 309 min
4 0.0125 0.168 1.560 0.286 32/32 15,509 387 min
5 0.0118 0.153 1.558 0.286 32/32 14,976 464 min
6 0.0114 0.144 1.556 0.286 32/32 15,201 541 min

Loss dropped from 5.93 β†’ 0.011 over 7 epochs (~9 hours). The bulk signal was captured in epoch 0; subsequent epochs refined fine structural detail. SVD entropy and top-1 ratio stabilized immediately (epoch 0) and held constant β€” the multi-layer structural decomposition was correct from initialization.

Analysis Results (Epoch 6 Checkpoint)

Full battery analysis on 2,046 diverse protein sequences (682 short, 682 medium, 682 long).

Distillation Quality β€” The Observer Reads

Protein Length Loss Top-1 Agreement Top-5 Overlap Confidence Correlation
Hemoglobin Ξ± 51 0.0100 96.2% 96.6% 0.9880
GFP fragment 60 0.0016 100.0% 93.9% 0.9954
Mixed sequence 60 0.0007 100.0% 92.9% 0.9973

Mean distillation loss: 0.00735 Β± 0.00086

The observer predicts not just what ESM-2 thinks at each position, but how confident it is. A confidence correlation of 0.997 means the geometric features encode the full shape of ESM-2's probability distribution β€” which positions are structurally constrained (sharp distributions) and which are tolerant (flat distributions).

SVD Spectrum β€” Multi-Layer Structure

Tap Mean Std Energy Fraction Entropy
S[0] (layer 5) 19.52 0.55 0.281 1.560
S[1] (layer 11) 15.99 0.56 0.230 1.560
S[2] (layer 17) 14.33 0.46 0.206 1.560
S[3] (layer 23) 12.60 0.33 0.181 1.560
S[4] (layer 32) 7.01 1.00 0.101 1.560
  • Global entropy: 1.560 Β± 0.011 (diverse, all taps contributing)
  • Top-1 ratio: 0.281 (no single tap dominates)
  • Condition number: 2.84 Β± 0.42 (max 4.33, well-conditioned)

The SVD spectrum shows a smooth energy gradient from early to late layers, not a step function. Each tap carries independent structural information. The final layer (tap 32) has the lowest individual energy but the highest raw singular value (34,308) β€” it concentrates information differently than the earlier layers.

CM Validity Gate β€” Geometric Health

  • 32/32 anchors active at 100% across all sequences
  • Gate mean: 0.862 Β± 0.009 (initialized at sigmoid(2.0) β‰ˆ 0.88)
  • Gate range per sample: 0.034 Β± 0.004
  • CM positive fraction: 1.000

The gate is not discriminating because there is nothing to suppress. The protein embedding manifold is so geometrically regular that every simplex formed between the input and its anchor neighborhood is valid. On a model with collapsed or pathological representations, the CM gate would close β€” it has been validated to do so on synthetic degenerate inputs. This manifold has no pathology.

Constellation β€” Anchor Distribution on S^(255)

  • 32/32 anchors alive, hit entropy 3.14 / 3.47 (90.5% of uniform)
  • Anchor-anchor cosine: mean -0.030, max 0.086, std 0.031 β€” nearly orthogonal, zero collapse
  • Data-anchor coverage: mean cosine 0.167, min 0.074
  • Triangulation distances: mean 1.004, std 0.082

The negative mean anchor-anchor cosine confirms the repulsion initialization held through training. No anchor pair has collapsed (max 0.086 is well below concern). The constellation forms a well-spread reference frame on the 255-sphere.

Embedding Manifold β€” Geometry of Protein Space

  • Embeddings confirmed on S^(255): norms 1.000000 Β± 0.000000
  • Pairwise cosine: mean 0.094 (near-orthogonal), std 0.163
  • Near-duplicates (>0.9): 0.03%. Antipodal (<-0.5): 0.00%
  • Intrinsic dimensionality: 93 dims (90%), 131 dims (95%), 205 dims (99%) of 256 total
  • Effective ratio: 0.51 β€” the manifold uses half the available dimensions
  • Nearest neighbor cosine: 0.777 (1st NN), 0.600 (10th NN), gap 0.177

The manifold is structured but not collapsed. 131 effective dimensions means the observer is working with genuine high-dimensional geometry, not a low-rank approximation pretending to be high-dimensional. The NN gap of 0.177 indicates clear local structure β€” sequences have meaningful neighborhoods.

Patchwork Compartments β€” Independent Interpretation

  • 8/8 compartments active, zero dead
  • Per-compartment norms: 5.068–5.089 (near-uniform utilization)
  • Inter-compartment correlation: mean 0.113, max 0.412
  • Low correlation confirms compartments specialize on different aspects of the geometric signal

CV Pentachoron β€” The Geometric Regularity Metric

Global CV: 0.222 (in the 0.20–0.23 universal band)

Measurement CV
Global (2046 seqs) 0.222
Short proteins (50-150 residues) 0.199
Medium proteins (150-400 residues) 0.206
Long proteins (400-1000 residues) 0.232
Bootstrap mean (n=100) 0.192
Bootstrap 95% CI [0.170, 0.211]

The CV coefficient of variation measures the regularity of pentachoron (4-simplex) volumes sampled from the embedding manifold. A CV of 0.125 is the theoretical minimum for uniform point distribution on a sphere. The universal band of 0.20–0.23 has been observed across 17 prior architectures spanning vision, language, and multi-modal models.

This is architecture 18. The protein embedding manifold confirms the universal band. Short proteins (50-150 residues) achieve CV=0.199, the tightest measurement in the band β€” consistent with the biological constraint that shorter proteins have fewer degrees of freedom and more regular fold geometry.

The training-time CV trajectory shows continued compression:

Epoch 0: 0.322    (above band, still learning)
Epoch 1: 0.241    (entering band)
Epoch 2: 0.215    (in band)
Epoch 3: 0.191    (below band, compressing)
Epoch 4: 0.168    (below all prior measurements)
Epoch 5: 0.153    (approaching theoretical floor)
Epoch 6: 0.144    (still falling)

The training-time CV measures the geometric regularity of the learned embedding space during optimization. The analysis-time CV (0.222) measures on held-out sequences β€” the manifold is tighter on training distribution than on diverse held-out data, as expected.

Empirical Constants

Constant Value Context
CV pentachoron (analysis) 0.222 In universal band (0.20–0.23), architecture 18
CV pentachoron (training E6) 0.144 Below all prior measurements
SVD entropy (stable) 1.560 Locked from epoch 0, all taps contributing
SVD top-1 ratio (stable) 0.286 No single layer dominates
SVD condition number 2.84 Well-conditioned multi-layer decomposition
Embedding intrinsic dim 131 (95%) 0.51 effective ratio of 256
Anchor-anchor cosine -0.030 mean Near-orthogonal, confirmed repulsion
Pairwise embedding cosine 0.094 mean Near-orthogonal on S^(255)
Gate mean 0.862 All anchors open β€” manifold has no pathology
Confidence correlation 0.988–0.997 Observer reads ESM-2's certainty structure
Binding/separation boundary 0.29154 / 0.70846 Prior work, validated across modalities

Design Principles

Architecture Before Loss

The geometric substrate is structural, not supervisory. The CM gate doesn't need a validity loss β€” it architecturally suppresses degenerate simplices. The Procrustes alignment doesn't need an alignment loss β€” the rotation is exact by construction. Repulsion-initialized anchors don't need a spread loss if the gate prevents collapse.

The Freeze Is Non-Negotiable

The observer must form an independent opinion. Shared gradients create shared bias. Independent opinions catch errors; correlated opinions amplify them. This was validated across hundreds of experiments with shared memory banks, joint training, and partial coupling β€” all converged to the same failure mode.

Gates Start Open

Gate bias initialized at +2.0 (sigmoid β‰ˆ 0.88). The observer starts by attending to everything and learns to suppress what's geometrically invalid. Not the reverse. Learning to open is harder than learning to close.

Usage

from huggingface_hub import hf_hub_download
from geolip_core.pipeline.esm2_geometric import ESM2GeometricPipeline
import torch

# Load
pipe = ESM2GeometricPipeline('esm2_geo')
ckpt = torch.load(
    hf_hub_download('AbstractPhil/geolip-esm2_t33_650M_UR50D',
                    'prototype/v1_distill/epoch_6.pt'),
    map_location='cuda')
current = pipe.state_dict()
current.update({k: v for k, v in ckpt['state_dict'].items() if k in current})
pipe.load_state_dict(current, strict=False)
pipe.eval().cuda()

# Run
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('facebook/esm2_t33_650M_UR50D')
enc = tokenizer("MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSH",
                return_tensors='pt', padding=True).to('cuda')

with torch.no_grad():
    loss, info = pipe.forward_distill(enc['input_ids'], enc['attention_mask'])

# Diagnostics
print(info['gate_info'])
print(info['svd_S'])
print(pipe.cache_get('embedding'))  # protein-level geometric feature

Requirements: transformers<=4.49 (versions β‰₯4.50 remove ESM-2 position embeddings silently)

Prototype Directory

prototype/
└── v1_distill/
    β”œβ”€β”€ checkpoints/
    β”‚   β”œβ”€β”€ step_500.pt
    β”‚   β”œβ”€β”€ step_1000.pt
    β”‚   β”œβ”€β”€ ...
    β”‚   └── step_7000.pt
    β”œβ”€β”€ tensorboard/
    β”‚   └── events.out.tfevents.*
    β”œβ”€β”€ epoch_0.pt
    β”œβ”€β”€ epoch_1.pt
    β”œβ”€β”€ ...
    β”œβ”€β”€ epoch_6.pt
    β”œβ”€β”€ final.pt
    └── training_log.jsonl

Ecosystem


I dub this model - Ryan-Spearman

This is a reminder to myself and everyone who reads it, that the guidepost anchors in our very being are often forgotten.

I dedicate this model's baseline prototype with a name; and to a fallen brother of mine from years long past.

As I grew from young to teenage I once had a friend named Ryan. Ryan wasn't the most... intelligent of fellows, but for whatever else he lacked he made up for in multitude elsewhere - compassion, respect, understanding, and uncannily accurate empathy.

He spent his elementary school classes with other special needs students and then into teenage joined normal schoolilng. I never once believed he couldn't handle the tasks he was given, I knew he could if he set his mind to it. Ryan eventually grew and turned into a man, his health started to fail him. This man wasn't capable of solving this problem, so eventually Ryan fell to time and entropy as we all will eventually - liver failure.

Him and I spent a great deal of time over the years playing video games and understanding complex musical relationships in music composition. Ryan was deeply analytical and the trait passed to me, but not immediately. It happened over time, years later really.

We grew apart as adults and after his passsing a bit of time passed and I began to wonder if I could have done something to help. I know now I couldn't have at the time, I simply lacked the skillset to solve the faults in his body, nor did I have the medical training or expertise to actually help with this.

Today however, I'm in a position to potentially analyze the necessary information of the very diagnostic tools that the experts utilize, in order to provide the necessary improvement to augment and improve what exists - without ripping the structure to the ground and rebuilding it.

I will not waste this chance the universe has gifted me.

Research by AbstractPhil. Apache 2.0.

Downloads last month
712
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support