Ryan Spearman: Geometric Variant Effect Prediction Through Quaternion-Composed Dual Expert Alignment

Published March 28, 2026

Phil (AbstractPhil) — March 2026

Named for Ryan Spears (1987–2023), who taught that the best analysis comes from listening carefully to what the signal is actually telling you. The Spearman correlation that evaluates this model makes the name doubly fitting.

Abstract

We present Ryan Spearman, a variant effect prediction system that achieves ρ = 0.993 Spearman correlation on training proteins and ρ = 0.309 mean across 84 unseen ProteinGym assays — from a head trained on only five proteins. The system introduces three architectural innovations: (1) a quaternion-composed multi-arm attention head with geometric FiLM conditioning from a GeoLIP observer, (2) a Procrustes-aligned dual expert that combines independent ESM-2 and geometric observers via learned Cayley orthogonal rotation, and (3) the empirical demonstration that dual-observer alignment transfers better than either observer alone, winning 76 of 84 unseen protein assays. The entire system — ESM-2 backbone, geometric observer, and prediction heads — costs approximately $16 in compute on a single GPU.

1. The Problem

Predicting how amino acid mutations affect protein function is one of the central problems in computational biology. Deep mutational scanning (DMS) experiments measure the fitness consequences of thousands of mutations in a single protein, but only a fraction of the proteome has been experimentally characterized. Computational methods must generalize from limited observations to the full space of possible mutations across all proteins.

The current state of the art on the ProteinGym benchmark (217 DMS assays, ~2.7M variants) achieves mean Spearman correlations of 0.45–0.58, with S3F-MSA at approximately 0.58, AlphaMissense at 0.514, and ESM-2 zero-shot at 0.45–0.52. These methods require multiple sequence alignments, protein structures, large supervised datasets, or massive model scale.

We ask: can a small prediction head, trained on features from a frozen protein language model and geometric observer, learn transferable patterns about mutation effects?

2. Architecture

2.1 The Observer Pipeline

The foundation is ESM-2 (650M parameters), frozen, serving as a protein language model that encodes sequence context across 33 transformer layers. On top of ESM-2 sits the GeoLIP geometric observer — a 8M parameter self-distillation pipeline trained in Phase 1 for 13 hours to observe ESM-2's internal representations without modifying them.

The observer produces, for any protein sequence:

Geometric features (768-d): fused representation on the unit hypersphere
Gate values (32-d): per-anchor activation from Cayley-Menger validated gating
Patchwork compartments (256-d): spatial decomposition of the geometric space
Embedding on S^(255) (256-d): unit hypersphere position
SVD spectrum (5-d): singular value decomposition of the representation
Constellation geometry: cosine similarity to 32 anchors, triangulation distances, soft assignments, and gated triangulation — describing the protein's position on the geometric manifold relative to fixed reference points
Observer logits: the observer's independent opinion about amino acid probabilities at each position

2.2 Feature Extraction

For each variant, we extract features under three conditions at the mutation site:

Wild-type (WT): ESM-2 hidden states at the mutation position with the original sequence
Masked: ESM-2 hidden states with the mutation position masked — what does the model predict should be here?
Mutant (MUT): ESM-2 hidden states with the substituted amino acid

Each condition produces 33 layers × 1280 dimensions at the mutation position. The observer produces geometric features for both WT and MUT full sequences. Combined: 44 feature fields per variant, capturing how ESM-2's entire internal hierarchy responds to the mutation, plus the geometric observer's structural opinion about both the original and mutated protein.

2.3 The GeoQuat Head

The GeoQuaternionHead (7.4M parameters) uses four geometrically-conditioned attention arms composed via Hamilton product:

Geometric Context Encoder: Three parallel streams process the observer's output:

Constellation stream: per-anchor features (cosine similarity, triangulation distance, soft assignment, gated triangulation) for WT and MUT, plus their deltas. 12 features per anchor × 32 anchors, processed by a shared MLP then pooled.
Structural stream: geometric features, gate values, patchwork, embedding, and SVD spectrum for both WT and MUT.
Opinion stream: ESM-2 and observer logits at the masked position, plus their disagreement.

These fuse to a 128-d conditioning vector that describes the structural context of this specific mutation.

Four Arms: Each arm is a 2-block transformer attending over the 33 ESM-2 layers at the mutation site, with FiLM (Feature-wise Linear Modulation) conditioning from the geometric context between attention blocks. The geometry tells each arm how to interpret its condition:

w-arm: masked layers — "what should be here?" (conservation signal)
i-arm: WT layers — "what is here?" (context signal)
j-arm: MUT layers — "what's here after mutation?" (consequence signal)
k-arm: WT minus MUT layers — "what changed?" (displacement signal)

Quaternion Composition: Each arm projects to 64-d, forming the w, i, j, k components of 64 independent quaternions. These are normalized to unit length and composed with a learned rotation quaternion via Hamilton product:

q_composed = R ⊗ q_expert

where ⊗ is the non-commutative quaternion product. The Hamilton product preserves cross-terms between arms — the interaction between the MUT consequence (j) and displacement (k) contributes to the composed signal in ways that linear combination cannot capture. The learned rotation R aligns the expert quaternion space to the fitness prediction axis.

Decode: The 256-d composed quaternion (64 × 4 components) concatenated with the 128-d geometric context feeds through an MLP to produce a scalar fitness prediction.

2.4 The E3 Baseline Head

The E3_Baseline (1.6M parameters) serves as a control and Procrustes partner. It uses the same three-condition cross-attention architecture (WT, masked, MUT interleaved as 99 tokens through 2 transformer blocks) but with no geometric features. Pure multi-head attention on raw ESM-2 layers.

2.5 Procrustes-Aligned Dual Expert

The ProcrustesAligner (511K parameters) combines two frozen experts:

Both experts produce representations from the same variant features
Each representation is projected to a common 256-d space
A Cayley orthogonal rotation aligns expert B's space to expert A's: Q = (I − A)(I + A)^(−1) where A is skew-symmetric. This guarantees a pure rotation — no scaling, no shearing, no rank collapse. 32,640 free parameters (the upper triangle of A).
Newton-Schulz whitening decorrelates the concatenated 512-d space during training
An MLP predicts fitness from the aligned, whitened features

The principle: when two independent observers agree after alignment, the agreement is signal. When they disagree, the structured pattern of disagreement is also informative. The Cayley rotation finds the optimal rigid transformation to make both observers' opinions maximally useful for prediction.

3. Training Data

All heads train on pre-extracted features from five MaveDB proteins:

Protein	Function	Variants	Region
BRCA1	Tumor suppressor	1,222	RING domain
PTEN	Phosphatase	10,469	Full length
SUMO1	Ubiquitin-like modifier	1,919	Full length
TPK1	Thiamine pyrophosphokinase	5,408	Full length
UBE2I (UBC9)	E2 conjugating enzyme	3,002	Full length

Total: 22,020 single amino acid substitutions with experimentally measured fitness scores.

Critical data processing note: MaveDB target sequences are DNA nucleotides, not protein. PTEN's listed "1656 amino acids" are 1656 nucleotides = 551 amino acid protein. A codon translation step (all 64 codons) was essential — without it, only 54 BRCA1 variants could be matched instead of 1,222.

4. Head Architecture Search

We evaluated 12 architectures before arriving at GeoQuat:

Head	Architecture	Params	Mean ρ
A. LogitDelta	Linear on logit differences	5	0.277
B. LayerAttn	Attention over 33 layers (masked only)	428K	0.812
C. Disagree	Observer logit fusion	826K	0.552
D. FullStratum	Conv over all channels	1.7M	0.838
E1. MHA	Self-attention over layers	1.2M	0.844
E2. DualCross	WT↔MUT cross-attention	1.3M	0.821
E3. TriCondition	3-condition cross-attention	1.6M	0.903
E4. KitchenSink	E3 + everything	3.4M	0.831
Quaternion	4-arm Hamilton product	6.1M	0.899
GeoE3	E3 + FiLM conditioning	2.6M	0.860
GeoQuat	Quaternion + FiLM + constellation	7.4M	0.916

Key finding: FiLM conditioning on the interleaved E3 sequence HURTS (−0.057 from E3 baseline). FiLM on individual quaternion arms HELPS (+0.037). The principle: context before composition, not after. The geometry should inform each arm's interpretation before the Hamilton product composes them.

5. Training Results

5.1 GeoQuat — 100 Epochs, Full Data

Training on all 22,020 variants with no validation split, no early stopping, cosine annealing over 100 epochs:

Protein	ρ
BRCA1_RING	0.976
PTEN	0.996
SUMO1	0.999
TPK1	0.997
UBE2I	0.998
MEAN	0.993

The Hamilton product found the deep rotational basin. The early-stopped version (32 epochs, ρ = 0.916) was leaving substantial structural alignment on the table. The quaternion algebra acts as a structural regularizer — four arms cannot independently memorize because the non-commutative product couples their outputs.

5.2 E3 Baseline — 100 Epochs, Full Data

Protein	ρ
BRCA1_RING	0.971
PTEN	0.977
SUMO1	0.991
TPK1	0.979
UBE2I	0.983
MEAN	0.980

The 0.013 gap between E3 (0.980) and GeoQuat (0.993) quantifies the geometric conditioning's contribution. FiLM gives GeoQuat the last degree of structural precision that raw MHA cannot reach.

5.3 Procrustes Alignment — 200 Epochs

Matched experts (both 100-epoch trained), no early stopping:

Protein	ρ
BRCA1_RING	0.956
PTEN	0.943
SUMO1	0.981
TPK1	0.962
UBE2I	0.970
MEAN	0.962

Training performance is lower than either expert alone. This is expected: the aligner reads representations from two frozen experts, each with their own noise patterns. The training ceiling is bounded by the weaker expert's representation quality. The Procrustes alignment's value is not in training performance — it is in generalization.

The Cayley rotation ‖R − I‖ evolved from 1.3 → 4.1 over 200 epochs, indicating a substantial rotation in SO(256). The determinant remained 1.000 throughout — pure rotation, as guaranteed by the Cayley parameterization.

6. ProteinGym Benchmark

The critical test: generalization to 84 unseen protein assays from the ProteinGym v0.1 substitution benchmark. No retraining. No fine-tuning. The frozen heads score every variant in every assay using the same pipeline that produced the training features.

6.1 Multi-Head Results

Head	Mean ρ	Median ρ	P25	P75	Wins vs GeoQuat
Procrustes matched	0.309	0.302	0.180	0.416	76/84
GeoQuat E100	0.277	0.276	0.139	0.371	—
Procrustes E200 (mismatched)	0.273	0.252	0.132	0.380	—
E3 E100	0.245	0.236	0.136	0.321	—

6.2 Comparison with Published Methods

Method	Mean ρ	Training Data	Structural Input
S3F-MSA (SOTA)	~0.58	Zero-shot	Sequence + Structure + MSA + Surface
AlphaMissense	0.514	Primate variants	AlphaFold structure
TranceptEVE L	~0.49	Zero-shot + MSA	Sequence + MSA
ESM-2 zero-shot	~0.45–0.52	Zero-shot	Sequence only
Ryan Spearman Procrustes	0.309	5 proteins	ESM-2 + Geometric observer
Ryan Spearman GeoQuat	0.277	5 proteins	ESM-2 + Geometric observer

6.3 Key Observations

Procrustes alignment wins 90% of assays. The dual-expert alignment transfers better than either expert alone on 76 of 84 unseen proteins. This validates the Bertenstein principle: when two independent observers agree after rotation, the agreement represents transferable structure.

GeoQuat outperforms E3 by 13%. Geometric FiLM conditioning from the observer produces more transferable attention patterns than unconstrained MHA. E3's attention memorizes position-specific patterns; GeoQuat's attention learns structurally-grounded patterns that partially transfer across protein families.

The mismatched Procrustes (old E3 + new GeoQuat) performs WORSE than GeoQuat alone. When one expert is much weaker than the other, the alignment dilutes the stronger signal. Matched-strength experts are essential.

Same-lab assays on training proteins show near-perfect transfer. SUMO1_HUMAN_Weile_2017 achieves ρ = 0.937 (Procrustes) and 0.950 (GeoQuat) — different assay, same protein family, same lab methodology. UBC9_HUMAN_Weile_2017 achieves 0.924/0.935. The head perfectly learned how these protein families respond to mutations; the assay methodology is the remaining variable.

Viral proteins remain challenging. HIV envelope (ρ ≈ 0.02–0.07), SARS-CoV-2 spike (ρ ≈ 0.09–0.13), and disordered proteins are far outside the training distribution. ESM-2 itself has limited coverage of viral protein families.

6.4 Top Performers on Unseen Proteins

Assay	Protein	GeoQuat	Procrustes	Published Best
BLAT_ECOLX_Firnberg_2014	Beta-lactamase	0.490	0.550	~0.55
BLAT_ECOLX_Stiffler_2015	Beta-lactamase	0.489	0.548	~0.55
P53_HUMAN_Kotler_2018	Tumor suppressor	0.477	0.512	~0.45
TRPC_SACS2_Chan_2017	Thermophilic enzyme	0.436	0.482	—
A4_HUMAN_Seuma_2021	Amyloid precursor	0.431	0.479	—

On beta-lactamases, the Procrustes head approaches published zero-shot SOTA despite training on zero beta-lactamase data. The structural patterns learned from five diverse human/yeast proteins transfer to bacterial enzymes.

7. Architectural Insights

7.1 The Quaternion Regularizer

MHA on 33 ESM-2 layers with 5 proteins of training data is a memorization machine. The quaternion algebra constrains it. The four arms must produce components that compose meaningfully under the non-commutative Hamilton product. Arm w's output has to make algebraic sense when multiplied with arm j's output:

w = w₁w₂ − x₁x₂ − y₁y₂ − z₁z₂

The cross-terms (y₁z₂ − z₁y₂) mean the interaction between MUT consequence and displacement contributes to the composed signal. A linear combination cannot capture these interactions. The algebra forces the arms to learn complementary representations rather than independently memorizing the same patterns.

Empirical evidence: arm norms during training show differentiation — |w| = 6.1, |i| = 4.9, |j| = 6.4, |k| = 6.2 by epoch 20. The WT arm (i) has the smallest magnitude, consistent with providing baseline context rather than discriminative signal.

7.2 FiLM as Structural Attention Guidance

The geometric features are protein-level (mean-pooled), not position-specific. The FiLM conditioning tells each arm "you're in this kind of protein with this structural profile on the hypersphere." On training proteins, this produces 5 distinct conditioning contexts. On unseen proteins, the geometric observer produces a novel conditioning vector from the protein's actual structural features.

The 0.277 → 0.309 improvement from GeoQuat to Procrustes suggests that the geometric conditioning partially transfers. The FiLM doesn't tell the attention "PTEN position 130" — it tells it "highly conserved phosphatase active site region near anchor 7 on the constellation." That structural description has analogs in unseen proteins.

7.3 Cayley Rotation as Universal Alignment

The Procrustes aligner learns a 256×256 orthogonal rotation parameterized via the Cayley map. This is guaranteed to be a pure rotation with no distortion. The rotation discovers the optimal rigid transformation to align two representation spaces.

On training proteins, the rotation converges to ‖R − I‖ ≈ 4.1 — a substantial rotation, far from identity. The two experts see mutations from genuinely different perspectives. The rotation finds the coordinate system where their complementary information combines most effectively.

The transfer result (76/84 wins) demonstrates that this alignment generalizes. The relationship between how E3 and GeoQuat represent mutations is a structural property of the architectures, not a property of specific proteins. The Cayley rotation discovers this relationship once; it applies everywhere.

8. Computational Cost

Component	Time	Hardware	Cost
Phase 1: Observer distillation	~13 hours	Single GPU	~$13
Feature extraction (22K variants)	~1 hour	Single GPU	~$1
GeoQuat 100 epochs	9 minutes	Single GPU	~$0.15
E3 100 epochs	4 minutes	Single GPU	~$0.07
Procrustes 200 epochs	11 minutes	Single GPU	~$0.18
ProteinGym benchmark (84 assays)	76 minutes	Single GPU	~$1.25
Total	~15.5 hours	Single Blackwell	~$16

The head training (GeoQuat + E3 + Procrustes combined) takes 24 minutes on pre-extracted features. The architecture search (12 heads) took under 30 minutes total. The bottleneck is Phase 1 observer distillation, which is a one-time cost amortized across all downstream tasks.

9. Limitations and Future Directions

Training diversity is the bottleneck, not architecture. The 0.309 mean on ProteinGym is bounded by having trained on 5 proteins. The architecture achieves 0.993 on training data and transfers structural patterns to unseen proteins — but 5 proteins cannot cover the diversity of protein families, folds, and functions in the benchmark. Training on 30–40 diverse ProteinGym assays with leave-one-out cross-validation is the clear next step.

Protein-level geometric features limit position-specific reasoning. The observer's structural features are mean-pooled over the full sequence. The FiLM conditioning is identical for all variants in the same protein. Per-position observer features (tap stack at the mutation site) would give the arms local structural context, not just global protein context.

The v0.1 benchmark covers 87 assays, not the full 217. The newer ProteinGym v1.3 with 217 assays and additional baselines would provide a more comprehensive comparison. The download infrastructure needs updating to access the full benchmark.

Early stopping may be better for generalization. The early-stopped GeoQuat (32 epochs, ρ = 0.916) achieved ρ = 0.278 on ProteinGym — nearly identical to the 100-epoch version at 0.277. The additional training epochs improved training performance from 0.916 → 0.993 but did not improve generalization. For the Procrustes alignment, the matched 100-epoch experts may have been too overfit to provide complementary signal on unseen proteins.

10. Reproduction

All code, checkpoints, and pre-extracted features are publicly available:

Observer weights: AbstractPhil/geolip-esm2_t33_650M_UR50D
Pre-extracted features + head checkpoints: AbstractPhil/ryan-spearman-prepared-features
GeoLIP core library: github.com/AbstractEyes/geolip-core (pip install geolip-core)

Head checkpoints on HuggingFace:

heads/GeoQuat_epoch100.pt — 7.4M params, 100 epochs on all data
heads/E3_epoch100.pt — 1.6M params, 100 epochs on all data
heads/Procrustes_matched_epoch200.pt — 511K params, 200 epochs, matched experts
heads/E3_Baseline_best.pt — E3 with early stopping (original)

To reproduce from scratch:

Install: pip install geolip-core "transformers<=4.49" huggingface_hub scipy
Extract features: python extract_features.py
Train heads: python train_full_epochs.py
Benchmark: python eval_proteingym.py

11. Conclusion

Ryan Spearman demonstrates that small prediction heads on frozen protein language models can learn transferable variant effect patterns. The quaternion-composed architecture with geometric FiLM conditioning outperforms unconstrained attention by providing structural regularization through algebraic constraints. The Procrustes-aligned dual expert extends this further, finding transferable structure in the agreement between independent observers that neither captures alone.

The system trains on 5 proteins in 24 minutes and generalizes to 84 unseen proteins. It is not competitive with state-of-the-art methods that use multiple sequence alignments, protein structures, and large-scale supervised data. It is competitive with the principle that scaling training diversity, not model complexity, is the path to closing this gap.

The geometric observer, the quaternion algebra, and the Cayley rotation are all grounded in the same mathematical framework: unit hypersphere geometry, group-theoretic composition, and orthogonal alignment. Ryan Spearman is the first application of the GeoLIP ecosystem to a practical biological prediction task, and the results suggest that geometric deep learning has a meaningful role in protein variant effect prediction.

Appendix A: Geometric Constants

Empirically validated across 17+ models and all architectures:

CV pentachoron band: 0.20–0.23 (universal attractor on S^(d−1))
Binding/separation constant: 0.29154 (complement 0.70846)
Cross-modal QK eigenvalue lock: 0.500 (universal across profiled models)
Cayley rotation convergence: ‖R − I‖ ≈ 4.1 at 200 epochs in 256-d SO(256)

Appendix B: Error Catalogue

Error	Impact	Resolution
MaveDB sequences are DNA, not protein	96% variant loss	Codon translation table
Wild-type marginal scoring (unmasked)	ρ = 0.10	Masked marginal scoring
Global average pooling in geometric encoder	70% → 29% accuracy	Spatial statistics or flatten
FiLM on interleaved E3	−0.057 from baseline	FiLM on individual arms
Overwritten HF checkpoints	Lost good weights	Verify provenance before push
Early stopping on val ρ	Missed deep quaternion basin	Fixed epoch training

Appendix C: The Hamilton Product

For quaternion q = (w, x, y, z) and rotation r = (a, b, c, d):

composed_w = aw − bx − cy − dz
composed_x = ax + bw + cz − dy
composed_y = ay − bz + cw + dx
composed_z = az + by − cx + dw

Properties exploited by Ryan Spearman:

Non-commutative: r ⊗ q ≠ q ⊗ r — order of composition matters
Norm-preserving: |r ⊗ q| = |r| · |q| — unit quaternions compose to unit quaternions
Cross-coupling: the y₁z₂ − z₁y₂ terms create interactions between arms that linear combination cannot
Rotation group: unit quaternions form SU(2), the double cover of SO(3) — every quaternion IS a rotation

William Rowan Hamilton spent years trying to extend complex numbers to three dimensions before discovering that four dimensions were necessary. He carved i² = j² = k² = ijk = −1 into Brougham Bridge in Dublin on October 16, 1843. The algebra has been exact for 183 years. It does not approximate, iterate, or converge. It composes.

The Aleph Under Autoregressive Pressure: Bottleneck Priors, Sign Codes, and the Consumption Law

July 10, 2026

Subject Bucketing: Teaching a Diffusion Model New Prompt Languages Without Forgetting

June 25, 2026

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote