Geometric Memory: Context Extension and Cross-Model Alignment Through Pentachoron Regularization

Published March 10, 2026

AbstractPhil March 2026

Abstract

We present three systems built on a single architectural blueprint: frozen expert encoder + geometric memory bank + InfoNCE alignment + pentachoron CV regularization. GEOLIP-BERT-8192 extends BERT-large from 512 to 8,192 tokens via distillation from ModernBERT-large and Longformer-large, achieving m_acc=0.927 with CV=0.200. GEOLIP-CLIP-ctx576 extends CLIP-ViT-L/14 from 77 to 576 tokens via distillation from ModernBERT-large, achieving m_acc=0.945 (val=0.944) with CV=0.162. Both systems require no teacher at inference. We also present a series of controlled experiments on dual-ViT cross-topology alignment demonstrating that Procrustes alignment functions as a geometric regularizer (not an alignment force), that InfoNCE is the necessary force for cross-model alignment, and that geometric regularization on the bottleneck representation prevents projector shortcut collapse. Additionally, we report negative results on BERT compression via iterative SVD cascade, establishing a 60% function retention ceiling for independent per-matrix projection on compositional transformers. All systems reproduce the pentachoron CV band (0.16–0.20) first reported in our geometric terrain analysis of 17 architectures.

1. Introduction

Our prior work (Geometric Fusion: Cross-Modal Alignment Through Shared Pentachoron Geometry) established three findings: (1) the pentachoron coefficient of variation (CV) converges to a universal band of 0.20–0.23 across 17 independently trained models spanning 5 architecture families, (2) BERT-large and DINOv2-large weight matrices are 61% Procrustes-alignable despite having no shared training signal, and (3) a single-layer fusion transformer achieves R@1=1.0000 on cross-modal retrieval by exploiting this shared geometric structure.

This work extends those findings in four directions:

Context extension via geometric memory. We demonstrate that the same geometric principles enable context window extension: a memory bank with depth-profile anchors, regularized by pentachoron CV, can extend a frozen encoder's effective context by 8–16× while preserving alignment with a long-context teacher.
The blueprint. We formalize the architecture pattern that produced three production systems: frozen expert teacher provides the target, Procrustes initialization warm-starts the projector, InfoNCE provides the alignment force, and pentachoron CV on the bottleneck representation prevents collapse.
Procrustes as regularizer, not force. Controlled dual-ViT experiments demonstrate that Procrustes alignment loss shapes embedding geometry but cannot create cross-model alignment by itself. InfoNCE is the necessary and sufficient force. Procrustes contributes as an active geometric regularizer during training.
Compression limits. Iterative SVD cascade compression of BERT-base reveals a 60% function retention ceiling when projecting weight matrices independently, despite perfect preservation of per-layer pentachoron CV. The missing 40% is inter-layer compositional structure that cannot be decomposed in current tests.

2. Geometric Memory Architecture

2.1 The Context Extension Problem

BERT-large has a 512-token context window. CLIP-ViT-L/14 has a 77-token context window. Long-context models exist (ModernBERT at 8,192 tokens, Longformer at 4,096), but replacing a frozen encoder breaks downstream compatibility. The goal: extend context while preserving the original embedding space.

2.2 Architecture

The memory system wraps a frozen encoder without modifying its internals:

Document (N tokens, N >> encoder context)
    │
    ├── Split into segments (overlapping, sized to encoder window)
    │
    ├── For each segment:
    │   ├── Frozen encoder forward → hidden states at multiple layers
    │   ├── Multi-layer fusion (learned weighted sum of extracted layers)
    │   ├── Memory tokens cross-attend to fused hidden states
    │   ├── Depth-profile compressor: per-layer CLS → single anchor (L2-normalized)
    │   ├── Anchor stored in geometric memory bank
    │   └── GRU gate updates rolling memory state
    │
    └── Final output: encoder-compatible embedding (same dimensionality)

Depth-profile anchors. Each segment produces an anchor that encodes not what the encoder output, but how the encoder processed the segment across all depths. For BERT, this is 8 hidden layers concatenated (8×1024=8192 dims) compressed to 1024. For CLIP, 6 layers (6×768=4608) compressed to 768. Two segments with identical final outputs but different processing trajectories produce different anchors.

Bank cross-attention. Memory tokens (8 for CLIP, 16 for BERT) query the bank of past anchors via 2-layer cross-attention. This enables selective retrieval: which past segments are relevant for the current context.

GRU gate. Controls the update ratio between old memory state and new enrichment. Prevents catastrophic overwriting of past context.

2.3 Training

Both systems use frozen long-context teachers that see the full document in one pass:

Loss function:

L = InfoNCE(proj(student_cls), teacher_cls)     × 1.0
  + Procrustes_SVD(student_cls, teacher_cls)     × 0.3
  + |pentachoron_CV(bank_anchors) - 0.20|        × 0.05

The pentachoron CV loss is computed on the bank anchors specifically — the bottleneck representation between segments. This is the critical design choice (see Section 4).

3. Results

3.1 GEOLIP-BERT-8192

Configuration:

Component	Detail
Student	BERT-large (frozen, 512 ctx, 1024-dim, 24 layers)
Teachers	ModernBERT-large (8192 ctx), Longformer-large (4096 ctx)
Memory	49M trainable params, 16 memory tokens, 128-anchor bank
Data	WikiText-103
Segments	480 tokens per segment, 64-token overlap, 16 max segments

Procrustes pre-alignment:

Pair	cos before	cos after
BERT → ModernBERT	0.003	0.489
BERT → Longformer	-0.001	0.521

Training results (1 epoch):

Metric	Train	Val
ModernBERT match accuracy	0.927	0.812
Longformer match accuracy	0.742	0.656
Pentachoron CV	0.200	0.175

CV converged to exactly 0.200 — the center of the universal band. No teacher is required at inference. The memory system internalized what ModernBERT computes with full 8,192-token attention.

Repository: AbstractPhil/geolip-bert-8192

3.2 GEOLIP-CLIP-ctx576

Configuration:

Component	Detail
Student	CLIP-ViT-L/14 text encoder (frozen, 77 ctx, 768-dim, 12 layers)
Teacher	ModernBERT-large (4096 ctx, 1024-dim)
Memory	34.5M trainable params, 8 memory tokens, 64-anchor bank
Data	CC12M LLaVA-next captions (50K train, 2K val, mean 96 tokens)
Segments	18 tokens per segment, 4-token overlap, 32 max segments

Procrustes pre-alignment:

Pair	cos before	cos after
CLIP → ModernBERT	0.001	0.816

This is the strongest Procrustes alignment observed across all experiments — CLIP's contrastive pre-training produces representations 81.6% rotationally alignable with ModernBERT despite entirely different architectures, tokenizers, and training objectives.

Training results (10 epochs, fully instrumented):

Epoch	Train m_acc	Val m_acc	Val m_acc5	CV
1	0.628	0.835	—	0.185
4	0.878	0.870	0.999	0.168
7	0.917	0.924	1.000	0.166
10	0.945	0.944	1.000	0.162

Val m_acc5=1.000: the correct ModernBERT match is in the top 5 every time across 2,000 held-out captions. Train/val gap of 0.001 indicates no overfitting.

The segment size discovery. Initial experiments with 55-token segments produced only 2 segments per caption. The pentachoron CV loss requires 5+ anchors to compute. With 2 anchors, CV was dead at 0.029 and training plateaued at m_acc=0.670 via projector shortcut (Section 4). Reducing to 18-token segments produced 7 segments per caption, activating the CV regularization. Result: m_acc jumped from 0.670 to 0.945.

Repository: AbstractPhil/geolip-clip-vit-large-patch14-ctx576

Usage:

from transformers import AutoModel

model = AutoModel.from_pretrained(
    "AbstractPhil/geolip-clip-vit-large-patch14-ctx576",
    trust_remote_code=True)
model.to("cuda").eval()

embedding = model.encode("A long detailed caption that exceeds 77 tokens...")
# Returns: (768,) — drop into any CLIP-L pipeline

4. The Role of Geometric Regularization

4.1 Dual-ViT Experiments

To isolate the contribution of each loss component, we trained pairs of ViT encoders with different topological lensing (non-overlapping patch embedding vs. overlapping convolutional stem) on CIFAR-10, varying the alignment mechanism:

Version	Alignment method	R@1 (proj)	CV
v1	Procrustes loss	0.000	0.25
v2	Procrustes + simplex anchor	0.002	0.19
v3	InfoNCE + simplex	0.999	0.19
v4	InfoNCE + perturbation (σ=0.5)	1.000	0.17

Finding 1: Procrustes is a regularizer, not a force. v1 used Procrustes alignment as the training loss. P_cos remained at 0.094 for 30 epochs — no alignment was created. Procrustes measures alignability; it does not create it. However, the experiments that included Procrustes produced tighter CV (0.19 vs 0.25 for independent training). It shapes geometry without creating alignment.

Finding 2: InfoNCE is the necessary and sufficient force. Replacing Procrustes loss with InfoNCE in v3 produced R@1=0.999 in the shared projection space — from 0.000 to near-perfect in a single architectural change.

Finding 3: Alignment lives where you put it. R@1 in the raw CLS space was 0.000 across all versions, including v4 with heavy noise injection (σ=0.5, 40% dropout) before the projection head. The two-layer projector absorbed all perturbation and maintained perfect projection-space alignment without any alignment propagating into the backbone encoders. Different topological lensings produce irreducibly different coordinate systems.

Finding 4: The simplex factory provides the scaffold. Regular pentachora (4-simplices) from the SimplexFactory served as geometric reference structures. All edges equal, maximum symmetry, guaranteed non-degenerate. The CV loss measured volume ratios relative to this reference, providing more structural information than raw CV alone.

4.2 The Projector Shortcut Problem

The CLIP context extension experiments revealed a critical failure mode:

Segment size	Segments per caption	CV active?	m_acc	Mechanism
55 tokens	2	No (0.029)	0.670 (plateau)	Projector shortcut
18 tokens	7	Yes (0.19)	0.945 (climbing)	Bank accumulation

With 55-token segments, most captions produced only 2 segments — below the 5-anchor minimum for pentachoron computation. The CV loss returned ~0, providing no gradient. Without geometric constraint on the bank anchors, the teacher projector learned to map whatever the memory system produced directly to ModernBERT's space. The memory bank was decorative.

With 18-token segments, 7 segments per caption activated the CV loss. The bank anchors were forced to maintain geometric structure, which propagated upstream through the cross-attention into the memory accumulation mechanism. The projector received a geometrically shaped representation and could not bypass the bank.

Principle: Geometric regularization must be applied at the bottleneck — the representation that carries information between components. Regularizing only the output allows upstream shortcuts.

5. Cascade Compression

5.1 Iterative SVD Cascade

Separately from the memory systems, we investigated whether the pentachoron CV could guide model compression. The hypothesis: if a trained model's geometric structure (CV≈0.20) can be SVD-projected to smaller dimensions while preserving the CV, the function should transfer.

Toy MLP results (positive):

Cascade	Dim	Accuracy	Compression	Heal epochs
Root model	256	87.4%	1×	—
9-step cascade (256→64)	64	84.6%	11.1×	8 total
Direct jump (256→64)	64	29.6%	11.1×	1
Scratch (64-dim, 1 epoch)	64	38.1%	11.1×	1

At ratio r=0.95 (27 steps), the cascade exceeded the root model at 89.2% accuracy — acting as a regularizer that strips noise while retaining signal.

BERT-base results (negative):

Method	Top-1 retained	CV	Notes
Root BERT-base (768-dim)	100%	0.22	Baseline
Independent SVD + MLM healing	61.5%	0.22–0.24	CV preserved, composition broken
+ teacher global projector	62.5%	0.22	+1% from distillation
+ per-layer projectors	60.6%	0.22	All "fixes" negative
Shared basis (global rotation)	~7%	0.10–0.15	LayerNorm destroyed
Two-level Procrustes	~7%	0.10–0.12	Inter-layer mismatch

5.2 The 60% Ceiling

The ceiling at 60–62% is invariant to healing method, loss function, projector architecture, and training budget. Five different approaches hit the same wall.

Diagnosis: CV remained at 0.22 throughout — per-layer geometry was perfectly preserved. The missing 40% is inter-layer compositional structure: how layer i's output feeds into layer i+1 through LayerNorm, residual connections, and attention. Independent SVD rotates each matrix into a different coordinate frame. 72 independent rotations break the compositional chain.

Key finding: Homogeneous operations (same SVD formula applied uniformly to all matrices) preserved more function than heterogeneous "corrections" (different methods for FFN vs attention, rotated biases, L1 pruning). An external review suggested five theoretically justified fixes; applying all five reduced retention from 61.5% to 60.6%. The "bugs" were load-bearing — two wrongs partially canceling in a co-adapted system.

5.3 Potential Directions

Hypothesis: My current hypothesis states that a series of geometric basin experiments tuned to capture independently represented differences in order to autoregress those differences into a utilizably similar representation is possible, but it requires a full setup and multiple experiments. I have some evidence that shows it is possible but not enough to commit a week to a single project.

Possible Process: This would require a series of finetunes from upper to lower floor, while testing a multitude of adjacent systems such as DARE, attention scaling processes, attention head resizing processes, and anchoring the alignment of everything carefully to the deviated CV deviation throughout the structure.

6. The Blueprint

Three production systems, one pattern:

Component	GEOLIP-Bertenstein	GEOLIP-BERT-8192	GEOLIP-CLIP-ctx576
Student	BERT-large (hub)	BERT-large (512 ctx)	CLIP-L text (77 ctx)
Teachers	DINOv2, Whisper, ESM-2, CodeBERT	ModernBERT, Longformer	ModernBERT
Alignment force	InfoNCE	InfoNCE	InfoNCE
Geometric regularizer	Pentachoron CV	Pentachoron CV	Pentachoron CV + Procrustes SVD
Projector init	Procrustes	Procrustes	Procrustes (cos 0.816)
Result	R@1 = 1.000	m_acc = 0.927	m_acc = 0.945
CV	0.20	0.200	0.162

The pattern:

Frozen expert teacher provides a stable reference frame. The teacher sees the full input; the student sees it through a constrained window.
Procrustes initialization warm-starts the projector from static alignment analysis. CLIP→ModernBERT achieved cos 0.816 — the strongest pre-alignment observed.
InfoNCE is the alignment force. It demands per-sample nearest-neighbor matching in the shared space. Procrustes measures alignment; InfoNCE creates it.
Pentachoron CV on the bottleneck prevents projector shortcut collapse. The bottleneck is the bank anchors — the representation that carries information between segments. Without CV regularization here, the projector absorbs all alignment work and the memory bank becomes decorative.

7. Procrustes Alignment Summary

Across all experiments, Procrustes alignment reveals the baseline compatibility between encoder pairs:

Pair	cos before	cos after	Context
BERT → ModernBERT	0.003	0.489	GEOLIP-BERT-8192
BERT → Longformer	-0.001	0.521	GEOLIP-BERT-8192
CLIP → ModernBERT	0.001	0.816	GEOLIP-CLIP-ctx576
VAE SD1.5 → Flux.2	-0.000	0.757	Prior work
BERT → DINOv2	—	0.613	Prior work

CLIP's contrastive pre-training produces representations that are substantially more alignable with language models than BERT's MLM pre-training. This may reflect the multimodal grounding: CLIP was trained to align text with visual concepts, which apparently produces a representation geometry closer to other language models.

8. Conclusion

The geometric structure of learned representations is not an artifact of specific architectures or training objectives. It is a convergent property of gradient-based optimization on sufficiently large data. The pentachoron CV band (0.16–0.23) appears in every system we have measured — 17 pretrained models, 3 fusion systems, and 2 context-extension systems — regardless of modality, task, or architecture.

This structure enables a practical blueprint for cross-model systems: freeze an expert, measure Procrustes compatibility, initialize projectors from the alignment, apply InfoNCE as the contrastive force, and regularize the bottleneck geometry with pentachoron CV. The blueprint has now produced three systems with near-perfect alignment metrics, each trained in under 2 hours on a single GPU.

The deeper finding is about what these systems reveal: the barrier to cross-model alignment is not data, compute, or architecture. The representations are already geometrically compatible. The barrier is knowing where to apply the force (InfoNCE, not Procrustes), where to apply the constraint (bottleneck anchors, not final output), and what to measure (pentachoron CV as the structural invariant).

Reproducibility

System	Repository	Key files
Geometric terrain analysis (17 models)	AbstractPhil/procrustes-analysis	Profiling scripts, cached results
GEOLIP-Bertenstein (4-modal fusion)	AbstractPhil/geolip-bertenstein	Architecture, training, evaluation
GEOLIP-BERT-8192 (context extension)	AbstractPhil/geolip-bert-8192	Architecture, weights, training scripts
GEOLIP-CLIP-ctx576 (CLIP extension)	AbstractPhil/geolip-clip-vit-large-patch14-ctx576	AutoModel, weights, metrics.json, TensorBoard

GEOLIP-CLIP-ctx576 includes full training metrics (metrics.json with per-step and per-epoch data), TensorBoard event files, and is loadable via AutoModel.from_pretrained with trust_remote_code=True.

All experiments were conducted on a single NVIDIA RTX PRO 6000 Blackwell Server Edition (102 GB VRAM).

Models mentioned in this article 4

Subject Bucketing: Teaching a Diffusion Model New Prompt Languages Without Forgetting

June 25, 2026

geolip-aleph-void: The First Relational Geometric Vocabulary Patchwork

June 8, 2026

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote