GeoLIP Geometric Transformer v1.1
Prototype β A dual-stream transformer architecture with constellation-routed attention, quaternion composition, and per-layer Cayley alignment. Built on the GeoLIP geometric observation pipeline.
This is a research prototype. Settings are fixed for initial validation. Future versions will expand configuration, improve component integration, and introduce autotuning based on task characteristics rather than hard-set hyperparameters.
Not CM Valid
Requires replanning.
Geometric Residuals added
This is similar to the routing that was developed and tested to nearly 99.995% residual preservation down the chain, this should exhibit similar behavior to ROPE but be considerably more robust to certain stress levels, and will work directly in conjunction with rope after a bit of compensation.
Early Experiments Show
When this transformer overfits, the quaternion weights internally continue to geometrically refine.
The system continues to learn even when overfitted. The losses likely need to be tweaked to reflect that.
I'm setting up a full geometric, statistics, and rulings-based balancer to detect what is causing the overfitting and why. This will include a series of deep-level probes, analysis, determinants, structural assessments, CV, CM, SVD solidity, encoding similarity, embedding similarity, anchors active and collapsed, contribution levels, and more.
I'll get to the bottom of it.
Compilation Available
When using the geofractal compiler, simply call model.compile() and it will compile your transformers down.
Eager was around 22s per epoch with cifar10, and it's around 7.1s per epoch post compilation using the geofractal compilation system.
Architecture
The Geometric Transformer is a sequence-to-sequence model that processes token sequences through two parallel attention streams β one driven by standard content-based attention, the other routed by a learned geometric reference frame. The streams are aligned via orthogonal rotation and composed through quaternion algebra, producing an output that carries both content and structural signal.
Each layer implements the full GeoLIP observation pipeline: project to manifold, observe through a constellation, curate via patchwork, condition attention with FiLM, and compose through the Hamilton product.
Input tokens (B, L, D)
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β GeometricTransformerLayer β
β β
β ManifoldProjection ββ S^(d-1) per position β
β β β
β ConstellationObserver ββ full observation dict β
β β β
β PositionGeometricContext ββ FiLM context β
β β β
β ββββββ΄βββββ β
β β β β
β Stream A Stream B β
β Content Geometric β
β (MHA) (FiLM Q,K) β
β β β β
β β CayleyOrthogonal β
β β (align B β A) β
β β β β
β ββββββ¬βββββ β
β β β
β QuaternionCompose β
β (w=content, i=geo, j=disagree, k=agree) β
β β β
β Decode + Gated Residual β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
Output tokens (B, L, D) + geo_state dict (11 fields)
Components
ManifoldProjection
The Input stage. Each position's hidden state is projected from model space into the constellation's embedding space and L2-normalized to sit on the unit hypersphere S^(d-1).
This is the tap β it reads the transformer's internal representation at every position without modifying it. The projection is per-position and per-layer, meaning each layer's constellation observes a different view of the representation as it evolves through the network.
The L2 normalization is not cosmetic. The constellation's triangulation operates in cosine distance space. Without normalization, magnitude differences between positions would dominate the distance computation and the constellation would measure intensity rather than direction. On the sphere, every point is equidistant from the origin, so the only signal is angular relationship β which is the structural signal we want.
ManifoldProjection(d_model=256, manifold_dim=128)
# Linear(256 β 128) β LayerNorm β L2 normalize
# Output: (B, L, 128) on S^127
ConstellationObserver
The Association and Curation stages of the GeoLIP pipeline, composed into a single observation.
The constellation is a set of learned anchor points on S^(d-1), initialized via repulsion to maximize angular separation. Given an embedding on the sphere, triangulation computes the cosine distance to every anchor. This distance profile β how far the point is from each reference β is the constellation's measurement of that point's position in the geometric reference frame.
Triangulation alone gives raw distances. The curation stage interprets them through patchwork compartments β small MLPs that each read a round-robin partition of the distance profile and produce a local interpretation. The bridge predicts what the soft assignment should be from the patchwork's perspective, creating a self-consistency check between the two views.
The observer returns a full observation dictionary per position: the embedding on the sphere, cosine similarities, triangulation distances, soft assignment, nearest anchor, patchwork features, and bridge prediction. Nothing is discarded β downstream components choose what they need.
The constellation is the primary state of the geometric pipeline. It owns the reference frame. Everything else β the FiLM conditioning, the attention routing, the quaternion composition β reads from or conditions upon this frame.
ConstellationObserver(dim=128, n_anchors=16, n_comp=4, d_comp=16)
# Output dict keys:
# embedding (B*L, 128) position on S^(d-1)
# triangulation (B*L, 16) cosine distances to anchors
# cos_to_anchors (B*L, 16) raw cosine similarities
# assignment (B*L, 16) soft assignment via temperature softmax
# nearest (B*L,) closest anchor index
# patchwork (B*L, 64) compartment features
# bridge (B*L, 16) patchwork's assignment estimate
PositionGeometricContext
Curation into conditioning. Takes the full observation dictionary and fuses it into a per-position context vector suitable for FiLM modulation.
The fusion is deliberate about what it combines. The anchor stream processes cosine similarities, soft assignment, and triangulation distances together β these are the "where is this point relative to the reference frame" features. The structural stream processes patchwork features and the raw embedding β these are the "what does this region of the sphere look like" features. The two streams are fused into a single context vector.
This separation matters because the anchor features are about position (where on the manifold) while the structural features are about character (what the local geometry looks like). FiLM needs both to route attention effectively β it needs to know not just that a token is near anchor 7, but what anchor 7's neighborhood contains.
PositionGeometricContext(n_anchors=16, pw_dim=64, manifold_dim=128, context_dim=64)
# anchor_mlp: (16*3=48) β 64 position features
# struct_mlp: (64+128=192) β 64 character features
# fuse: (64+64=128) β 64 combined context
ContentAttention (Stream A)
Standard multi-head self-attention with FFN. No geometric conditioning. This stream sees what a conventional transformer would see β token relationships based purely on learned content projections.
Its role in the architecture is to provide the baseline interpretation. Content attention captures the patterns that standard transformers are good at: semantic similarity, positional relationships learned through data, and the statistical regularities of the training distribution.
Stream A is deliberately left unmodified so the quaternion composition has a clean content signal to compare against the geometric signal. If both streams were conditioned on geometry, there would be no reference point for measuring disagreement.
GeometricAttention (Stream B)
Attention with FiLM modulation on Q and K from the curated constellation context. V is left unmodulated.
The design principle is proven from the Ryan Spearman variant effect prediction system: FiLM on individual arms BEFORE composition, not after. The geometric context controls WHERE attention flows (through Q and K), but does not modify WHAT is attended to (V stays pure). This separation means the constellation routes attention to geometrically relevant positions without contaminating the value signal with geometric artifacts.
The FFN within this stream also receives FiLM conditioning between its layers, allowing the nonlinearity to be geometry-aware. The constellation context modulates the intermediate representation after the first linear layer and before the second, conditioning the activation pattern on geometric position.
FiLM layers are identity-initialized (Ξ³=1, Ξ²=0), meaning Stream B starts as a copy of what Stream A would produce. The geometric routing emerges during training as the FiLM parameters learn to diverge from identity based on the constellation's observations.
GeometricAttention(d_model=256, n_heads=8, context_dim=64)
# Q = FiLM(W_q(x), geo_ctx) geometry routes queries
# K = FiLM(W_k(x), geo_ctx) geometry routes keys
# V = W_v(x) values stay pure
# FFN: Linear β FiLM(Β·, geo_ctx) β Linear
CayleyOrthogonal
Guaranteed rotation in SO(d) via the Cayley map. Aligns Stream B's basis to Stream A's basis before composition.
The two attention streams operate in the same vector space but develop different internal coordinate systems during training. Without alignment, their outputs are not directly comparable β subtracting them (the disagreement arm) would mix basis artifacts with genuine disagreement. The Cayley rotation ensures the comparison happens in a shared coordinate frame.
The Cayley map constructs the rotation from a skew-symmetric matrix A:
Q = (I - A)(I + A)^(-1)
This guarantees det(Q) = 1 always β it is a proper rotation, never a reflection, never a scaling. The skew-symmetric constraint means the learned parameters (the upper triangle of A) can take any value and the resulting Q is always a valid rotation matrix. There are no failure modes, no gradient clipping needed, no projection steps.
The rotation is input-independent β it depends only on learned parameters, not on the batch. During eval, the rotation matrix is cached since parameters don't change between forward passes.
QuaternionCompose
Four-arm Hamilton product composition. The structural regularizer that prevents independent memorization of the two streams.
The four arms are:
- w (content): What does standard attention think?
- i (aligned geometric): What does geometric attention think, in the content basis?
- j (disagreement): content β aligned_geometric. Where do they diverge?
- k (agreement): content β aligned_geometric. Where do they converge?
Each arm is projected to a quaternion component, normalized, and multiplied against a learned rotation quaternion via the Hamilton product. The Hamilton product is non-commutative β qβqβ β qβqβ β which means the four arms cannot be processed independently. The w-arm's contribution to the output depends on what the i-arm produced, which depends on the j-arm, and so on. This algebraic coupling is the regularization mechanism.
In the Ryan Spearman protein variant experiments, this coupling was the difference between 0.245 (E3 baseline, pure MHA, worst generalizer) and 0.309 (Procrustes with matched quaternion experts, best generalizer across 84 unseen proteins). MHA memorizes without constraint. The Hamilton product forces complementary representations.
The disagreement arm (j) empirically carries the most transferable signal. It represents what the geometric observation sees that content attention misses β the structural surprise. In the protein experiments, this surprise signal generalized to unseen proteins better than either stream alone.
The composition is fully vectorized β a single batched Hamilton product over all quaternion dimensions simultaneously, with no Python loops.
QuaternionCompose(input_dim=256, quat_dim=32)
# 4 projections: Linear(256 β 32) each
# Batched Hamilton product: (N, 4, 32) Γ (N, 4, 32) β (N, 4, 32)
# Output: (N, 128) flattened
FiLMLayer
Feature-wise Linear Modulation. Produces Ξ³ and Ξ² from a conditioning vector and applies Ξ³ β x + Ξ² element-wise.
Every FiLM layer in the architecture is identity-initialized: Ξ³ starts at 1, Ξ² starts at 0. This means the geometric conditioning has zero effect at initialization β the model begins as a standard transformer and learns to incorporate geometric signal as training progresses. This is critical for stability: the constellation's initial random anchor positions produce meaningless observations, and identity-init prevents those meaningless observations from corrupting the attention patterns during early training.
Gated Residual
The layer output is a learned gate between the input and the decoded quaternion composition:
g = Ο(Linear([x; decoded]))
x_out = g β decoded + (1 - g) β x
The gate allows the model to selectively ignore the geometric transformer's contribution at positions where it is not helpful. Early in training, the gate can stay near 0.5 (pass-through), and as the constellation and quaternion learn useful representations, the gate opens to incorporate them.
Cross-Layer Cayley Rotation
Between consecutive layers, an additional CayleyOrthogonal rotation aligns the output basis of layer i to the expected input basis of layer i+1. Each layer's constellation develops its own reference frame, and these frames are generally not aligned. The cross-layer rotation is the learned coordinate transform between frames.
Full Geometric State
Each layer returns a complete geo_state dictionary with 11 fields. Nothing is compressed and discarded β task heads, analysis tools, and downstream layers have access to the full geometric residual at every depth.
| Field | Shape | Description |
|---|---|---|
embedding |
(B, L, manifold_dim) | Per-position point on S^(d-1) |
geo_ctx |
(B, L, context_dim) | Compressed FiLM conditioning vector |
triangulation |
(B, L, A) | Cosine distances to all anchors |
cos_to_anchors |
(B, L, A) | Raw cosine similarities |
assignment |
(B, L, A) | Soft assignment via temperature softmax |
nearest |
(B, L) | Nearest anchor index per position |
patchwork |
(B, L, pw_dim) | Compartment interpretation features |
bridge |
(B, L, A) | Patchwork's estimate of the assignment |
content |
(B, L, D) | Stream A output |
geometric |
(B, L, D) | Stream B output (pre-rotation) |
composed |
(B, L, 4Γquat_dim) | Raw quaternion composition |
Design Rationale
Why This Format Exists
The geometric transformer exists as a prototyping structure β a stable chassis for testing geometric deep learning components in the context of attention-based sequence processing.
Constructing models from geometric primitives (constellations, patchwork, CM gating, SVD observation, quaternion algebra, Cayley rotation) introduces many moving parts, many potential faults, and many mechanisms that can interact in unexpected ways. A constellation with poorly initialized anchors produces degenerate triangulations. A FiLM layer with bad conditioning collapses attention to uniform. A quaternion composition with unaligned inputs produces noise. Any of these failures is silent β the model still trains, just poorly.
The transformer structure provides kinship with the most extensively validated architecture in deep learning. By embedding geometric components into a framework that follows transformer conventions β residual connections, layer normalization, attention patterns, positional embeddings β we inherit the stability properties that make transformers trainable. The gated residual means any geometric component can fail gracefully by producing zero useful signal, and the model degrades to a standard transformer rather than collapsing entirely.
This stability is what makes the format useful for prototyping. Components can be swapped, added, or removed without restructuring the training pipeline. The constellation can be replaced with a different association mechanism. The quaternion composition can be replaced with a different fusion strategy. The Cayley alignment can be replaced with Procrustes or whitening. Each substitution is localized because the interface contracts are simple: tensors in, tensors out, geo_state dict for diagnostics.
Experiments Leading to This Design
The architecture consolidates findings from several lines of investigation:
Ryan Spearman Variant Effect Prediction. Prediction heads on frozen ESM-2 (650M) using the GeoLIP observer pipeline. The GeoQuaternionHead achieved Ο=0.309 mean Spearman correlation across 84 unseen ProteinGym assays, with the matched Procrustes alignment winning 76 of 84 head-to-head comparisons. Key findings that shaped this transformer: FiLM on individual arms before composition transfers; FiLM after composition hurts; quaternion algebra acts as a structural regularizer that prevents MHA memorization; matched-strength experts are essential for Procrustes alignment.
Procrustes Analysis of 17 Models. Profiling T5, BERT, CLIP, DINOv2, SD/SDXL UNets, and VAEs revealed a universal cross-modal QK eigenvalue lock at 0.500 and showed that 70-76% of VAE weights are alignable via Procrustes rotation. This validated Cayley orthogonal rotation as the alignment mechanism β the structure is there, it just needs to be found.
GeoLIP Constellation Framework. Batched CV computation achieving 141Γ speedup, confirmation of the CV pentachoron band at 0.20-0.23 as a universal attractor across all tested architectures and modalities. The constellation's anchor frame on S^(d-1) with repulsion initialization produces stable reference frames that do not collapse during training.
ConvGeometricBackbone. Multi-depth geometric observation with SVDObserver, CM-gated anchor selection, and patchwork interpretation at every conv stage. Established the pattern of: observe structure β associate with constellation β curate via gating β condition the backbone. The geometric transformer adapts this pattern from spatial feature maps to token sequences.
Initial Validation: CIFAR-10
First test of the prototype. ConvSVD frontend (conv 3β64β64, SVDObserver rank-16, stride-4 patch projection) feeding 64+1 tokens into a 4-layer geometric transformer (d=256, 8 heads, 16 anchors).
85.23% test accuracy, 8.8M parameters, 35 minutes on a single GPU.
Geometric state inspection at convergence:
Stream agreement (content β geometric cosine similarity):
Layer 0: 0.491
Layer 1: 0.471
Layer 2: 0.449
Layer 3: 0.409
Anchor entropy (uniform = 2.77):
Layer 0: 2.068
Layer 1: 2.266
Layer 2: 1.893
Layer 3: 1.924
The two streams diverge deeper in the network β agreement drops from 0.49 to 0.41, meaning the constellation is routing geometric attention to fundamentally different positions than content attention. The anchors are utilized with moderate specialization (entropy 1.9-2.3 out of a maximum 2.77), indicating the constellation found meaningful structure without collapsing.
Test accuracy continued improving from 83.5% to 85.2% after training accuracy exceeded 95%, consistent with the quaternion regularization effect observed in the protein experiments.
Usage
from geolip_core.pipeline.components.geometric_transformer import (
GeometricTransformer, geo_transformer_small, geo_transformer_esm2
)
# Small prototype
model = geo_transformer_small('my_model', n_layers=4)
out = model(tokens) # (B, L, D) β (B, L, D)
# With full geometric state
out, geo_states = model(tokens, return_geo_state=True)
# geo_states: list of dicts, one per layer, 11 fields each
# ESM-2 scale (d=1280, 16 heads)
model = geo_transformer_esm2('esm2_geo', n_layers=6)
# Vision (d=384, for scatter/SVD patch tokens)
from geolip_core.pipeline.components.geometric_transformer import geo_transformer_vision
model = geo_transformer_vision('vit_geo', n_layers=4)
Router Structure
The model follows the geofractal BaseTower composition pattern. Components are accessed via bracket notation and can be swapped at runtime.
model['layer_0']['observer'] # ConstellationObserver for layer 0
model['layer_0']['projection'] # ManifoldProjection for layer 0
model['layer_0']['content'] # ContentAttention (Stream A)
model['layer_0']['geometric'] # GeometricAttention (Stream B)
model['layer_0']['rotation'] # CayleyOrthogonal (B β A alignment)
model['layer_0']['compose'] # QuaternionCompose
model['cross_rot_0'] # Cross-layer Cayley rotation (0 β 1)
model['final_norm'] # Output LayerNorm
Status
This is v1 β a working prototype that validates the core hypothesis: constellation-routed dual-stream attention with quaternion composition produces models that learn meaningful geometric structure and resist dead overfitting.
Planned expansions:
- Autotuned anchor count, manifold dimension, and compartment configuration based on task
- CM determinant gating (AnchorGate) integrated into the per-position constellation observation
- SVD observation within the transformer layers (not just the Input stage)
- Per-position observer tap features for local geometric conditioning
- Integration with the full GeoLIP loss landscape (spread, bridge, contrastive)
- Compile/fuse optimization via geofractal's CompileRouter
Author: AbstractPhil
License: Apache 2.0
Framework: PyTorch, geolip-core, geofractal