Geometric Memory: Context Extension and Cross-Model Alignment Through Pentachoron Regularization

Community Article Published March 10, 2026

AbstractPhil March 2026


Abstract

We present three systems built on a single architectural blueprint: frozen expert encoder + geometric memory bank + InfoNCE alignment + pentachoron CV regularization. GEOLIP-BERT-8192 extends BERT-large from 512 to 8,192 tokens via distillation from ModernBERT-large and Longformer-large, achieving m_acc=0.927 with CV=0.200. GEOLIP-CLIP-ctx576 extends CLIP-ViT-L/14 from 77 to 576 tokens via distillation from ModernBERT-large, achieving m_acc=0.945 (val=0.944) with CV=0.162. Both systems require no teacher at inference. We also present a series of controlled experiments on dual-ViT cross-topology alignment demonstrating that Procrustes alignment functions as a geometric regularizer (not an alignment force), that InfoNCE is the necessary force for cross-model alignment, and that geometric regularization on the bottleneck representation prevents projector shortcut collapse. Additionally, we report negative results on BERT compression via iterative SVD cascade, establishing a 60% function retention ceiling for independent per-matrix projection on compositional transformers. All systems reproduce the pentachoron CV band (0.16–0.20) first reported in our geometric terrain analysis of 17 architectures.


1. Introduction

Our prior work (Geometric Fusion: Cross-Modal Alignment Through Shared Pentachoron Geometry) established three findings: (1) the pentachoron coefficient of variation (CV) converges to a universal band of 0.20–0.23 across 17 independently trained models spanning 5 architecture families, (2) BERT-large and DINOv2-large weight matrices are 61% Procrustes-alignable despite having no shared training signal, and (3) a single-layer fusion transformer achieves R@1=1.0000 on cross-modal retrieval by exploiting this shared geometric structure.

This work extends those findings in four directions:

  1. Context extension via geometric memory. We demonstrate that the same geometric principles enable context window extension: a memory bank with depth-profile anchors, regularized by pentachoron CV, can extend a frozen encoder's effective context by 8–16× while preserving alignment with a long-context teacher.

  2. The blueprint. We formalize the architecture pattern that produced three production systems: frozen expert teacher provides the target, Procrustes initialization warm-starts the projector, InfoNCE provides the alignment force, and pentachoron CV on the bottleneck representation prevents collapse.

  3. Procrustes as regularizer, not force. Controlled dual-ViT experiments demonstrate that Procrustes alignment loss shapes embedding geometry but cannot create cross-model alignment by itself. InfoNCE is the necessary and sufficient force. Procrustes contributes as an active geometric regularizer during training.

  4. Compression limits. Iterative SVD cascade compression of BERT-base reveals a 60% function retention ceiling when projecting weight matrices independently, despite perfect preservation of per-layer pentachoron CV. The missing 40% is inter-layer compositional structure that cannot be decomposed in current tests.


2. Geometric Memory Architecture

2.1 The Context Extension Problem

BERT-large has a 512-token context window. CLIP-ViT-L/14 has a 77-token context window. Long-context models exist (ModernBERT at 8,192 tokens, Longformer at 4,096), but replacing a frozen encoder breaks downstream compatibility. The goal: extend context while preserving the original embedding space.

2.2 Architecture

The memory system wraps a frozen encoder without modifying its internals:

Document (N tokens, N >> encoder context)
    │
    ├── Split into segments (overlapping, sized to encoder window)
    │
    ├── For each segment:
    │   ├── Frozen encoder forward → hidden states at multiple layers
    │   ├── Multi-layer fusion (learned weighted sum of extracted layers)
    │   ├── Memory tokens cross-attend to fused hidden states
    │   ├── Depth-profile compressor: per-layer CLS → single anchor (L2-normalized)
    │   ├── Anchor stored in geometric memory bank
    │   └── GRU gate updates rolling memory state
    │
    └── Final output: encoder-compatible embedding (same dimensionality)

Depth-profile anchors. Each segment produces an anchor that encodes not what the encoder output, but how the encoder processed the segment across all depths. For BERT, this is 8 hidden layers concatenated (8×1024=8192 dims) compressed to 1024. For CLIP, 6 layers (6×768=4608) compressed to 768. Two segments with identical final outputs but different processing trajectories produce different anchors.

Bank cross-attention. Memory tokens (8 for CLIP, 16 for BERT) query the bank of past anchors via 2-layer cross-attention. This enables selective retrieval: which past segments are relevant for the current context.

GRU gate. Controls the update ratio between old memory state and new enrichment. Prevents catastrophic overwriting of past context.

2.3 Training

Both systems use frozen long-context teachers that see the full document in one pass:

Loss function:

L = InfoNCE(proj(student_cls), teacher_cls)     × 1.0
  + Procrustes_SVD(student_cls, teacher_cls)     × 0.3
  + |pentachoron_CV(bank_anchors) - 0.20|        × 0.05

The pentachoron CV loss is computed on the bank anchors specifically — the bottleneck representation between segments. This is the critical design choice (see Section 4).


3. Results

3.1 GEOLIP-BERT-8192

Configuration:

Component Detail
Student BERT-large (frozen, 512 ctx, 1024-dim, 24 layers)
Teachers ModernBERT-large (8192 ctx), Longformer-large (4096 ctx)
Memory 49M trainable params, 16 memory tokens, 128-anchor bank
Data WikiText-103
Segments 480 tokens per segment, 64-token overlap, 16 max segments

Procrustes pre-alignment:

Pair cos before cos after
BERT → ModernBERT 0.003 0.489
BERT → Longformer -0.001 0.521

Training results (1 epoch):

Metric Train Val
ModernBERT match accuracy 0.927 0.812
Longformer match accuracy 0.742 0.656
Pentachoron CV 0.200 0.175

CV converged to exactly 0.200 — the center of the universal band. No teacher is required at inference. The memory system internalized what ModernBERT computes with full 8,192-token attention.

Repository: AbstractPhil/geolip-bert-8192

3.2 GEOLIP-CLIP-ctx576

Configuration:

Component Detail
Student CLIP-ViT-L/14 text encoder (frozen, 77 ctx, 768-dim, 12 layers)
Teacher ModernBERT-large (4096 ctx, 1024-dim)
Memory 34.5M trainable params, 8 memory tokens, 64-anchor bank
Data CC12M LLaVA-next captions (50K train, 2K val, mean 96 tokens)
Segments 18 tokens per segment, 4-token overlap, 32 max segments

Procrustes pre-alignment:

Pair cos before cos after
CLIP → ModernBERT 0.001 0.816

This is the strongest Procrustes alignment observed across all experiments — CLIP's contrastive pre-training produces representations 81.6% rotationally alignable with ModernBERT despite entirely different architectures, tokenizers, and training objectives.

Training results (10 epochs, fully instrumented):

Epoch Train m_acc Val m_acc Val m_acc5 CV
1 0.628 0.835 0.185
4 0.878 0.870 0.999 0.168
7 0.917 0.924 1.000 0.166
10 0.945 0.944 1.000 0.162

Val m_acc5=1.000: the correct ModernBERT match is in the top 5 every time across 2,000 held-out captions. Train/val gap of 0.001 indicates no overfitting.

The segment size discovery. Initial experiments with 55-token segments produced only 2 segments per caption. The pentachoron CV loss requires 5+ anchors to compute. With 2 anchors, CV was dead at 0.029 and training plateaued at m_acc=0.670 via projector shortcut (Section 4). Reducing to 18-token segments produced 7 segments per caption, activating the CV regularization. Result: m_acc jumped from 0.670 to 0.945.

Repository: AbstractPhil/geolip-clip-vit-large-patch14-ctx576

Usage:

from transformers import AutoModel

model = AutoModel.from_pretrained(
    "AbstractPhil/geolip-clip-vit-large-patch14-ctx576",
    trust_remote_code=True)
model.to("cuda").eval()

embedding = model.encode("A long detailed caption that exceeds 77 tokens...")
# Returns: (768,) — drop into any CLIP-L pipeline

4. The Role of Geometric Regularization

4.1 Dual-ViT Experiments

To isolate the contribution of each loss component, we trained pairs of ViT encoders with different topological lensing (non-overlapping patch embedding vs. overlapping convolutional stem) on CIFAR-10, varying the alignment mechanism:

Version Alignment method R@1 (proj) R@1 (raw CLS) CV
v1 Procrustes loss 0.000 0.000 0.25
v2 Procrustes + simplex anchor 0.002 0.000 0.19
v3 InfoNCE + simplex 0.999 0.000 0.19
v4 InfoNCE + perturbation (σ=0.5) 1.000 0.000 0.17

Finding 1: Procrustes is a regularizer, not a force. v1 used Procrustes alignment as the training loss. P_cos remained at 0.094 for 30 epochs — no alignment was created. Procrustes measures alignability; it does not create it. However, the experiments that included Procrustes produced tighter CV (0.19 vs 0.25 for independent training). It shapes geometry without creating alignment.

Finding 2: InfoNCE is the necessary and sufficient force. Replacing Procrustes loss with InfoNCE in v3 produced R@1=0.999 in the shared projection space — from 0.000 to near-perfect in a single architectural change.

Finding 3: Alignment lives where you put it. R@1 in the raw CLS space was 0.000 across all versions, including v4 with heavy noise injection (σ=0.5, 40% dropout) before the projection head. The two-layer projector absorbed all perturbation and maintained perfect projection-space alignment without any alignment propagating into the backbone encoders. Different topological lensings produce irreducibly different coordinate systems.

Finding 4: The simplex factory provides the scaffold. Regular pentachora (4-simplices) from the SimplexFactory served as geometric reference structures. All edges equal, maximum symmetry, guaranteed non-degenerate. The CV loss measured volume ratios relative to this reference, providing more structural information than raw CV alone.

4.2 The Projector Shortcut Problem

The CLIP context extension experiments revealed a critical failure mode:

Segment size Segments per caption CV active? m_acc Mechanism
55 tokens 2 No (0.029) 0.670 (plateau) Projector shortcut
18 tokens 7 Yes (0.19) 0.945 (climbing) Bank accumulation

With 55-token segments, most captions produced only 2 segments — below the 5-anchor minimum for pentachoron computation. The CV loss returned ~0, providing no gradient. Without geometric constraint on the bank anchors, the teacher projector learned to map whatever the memory system produced directly to ModernBERT's space. The memory bank was decorative.

With 18-token segments, 7 segments per caption activated the CV loss. The bank anchors were forced to maintain geometric structure, which propagated upstream through the cross-attention into the memory accumulation mechanism. The projector received a geometrically shaped representation and could not bypass the bank.

Principle: Geometric regularization must be applied at the bottleneck — the representation that carries information between components. Regularizing only the output allows upstream shortcuts.


5. Cascade Compression

5.1 Iterative SVD Cascade

Separately from the memory systems, we investigated whether the pentachoron CV could guide model compression. The hypothesis: if a trained model's geometric structure (CV≈0.20) can be SVD-projected to smaller dimensions while preserving the CV, the function should transfer.

Toy MLP results (positive):

Cascade Dim Accuracy Compression Heal epochs
Root model 256 87.4%
9-step cascade (256→64) 64 84.6% 11.1× 8 total
Direct jump (256→64) 64 29.6% 11.1× 1
Scratch (64-dim, 1 epoch) 64 38.1% 11.1× 1

At ratio r=0.95 (27 steps), the cascade exceeded the root model at 89.2% accuracy — acting as a regularizer that strips noise while retaining signal.

BERT-base results (negative):

Method Top-1 retained CV Notes
Root BERT-base (768-dim) 100% 0.22 Baseline
Independent SVD + MLM healing 61.5% 0.22–0.24 CV preserved, composition broken
+ teacher global projector 62.5% 0.22 +1% from distillation
+ per-layer projectors 60.6% 0.22 All "fixes" negative
Shared basis (global rotation) ~7% 0.10–0.15 LayerNorm destroyed
Two-level Procrustes ~7% 0.10–0.12 Inter-layer mismatch

5.2 The 60% Ceiling

The ceiling at 60–62% is invariant to healing method, loss function, projector architecture, and training budget. Five different approaches hit the same wall.

Diagnosis: CV remained at 0.22 throughout — per-layer geometry was perfectly preserved. The missing 40% is inter-layer compositional structure: how layer i's output feeds into layer i+1 through LayerNorm, residual connections, and attention. Independent SVD rotates each matrix into a different coordinate frame. 72 independent rotations break the compositional chain.

Key finding: Homogeneous operations (same SVD formula applied uniformly to all matrices) preserved more function than heterogeneous "corrections" (different methods for FFN vs attention, rotated biases, L1 pruning). An external review suggested five theoretically justified fixes; applying all five reduced retention from 61.5% to 60.6%. The "bugs" were load-bearing — two wrongs partially canceling in a co-adapted system.

5.3 Potential Directions

Hypothesis: My current hypothesis states that a series of geometric basin experiments tuned to capture independently represented differences in order to autoregress those differences into a utilizably similar representation is possible, but it requires a full setup and multiple experiments. I have some evidence that shows it is possible but not enough to commit a week to a single project.

Possible Process: This would require a series of finetunes from upper to lower floor, while testing a multitude of adjacent systems such as DARE, attention scaling processes, attention head resizing processes, and anchoring the alignment of everything carefully to the deviated CV deviation throughout the structure.


6. The Blueprint

Three production systems, one pattern:

Component GEOLIP-Bertenstein GEOLIP-BERT-8192 GEOLIP-CLIP-ctx576
Student BERT-large (hub) BERT-large (512 ctx) CLIP-L text (77 ctx)
Teachers DINOv2, Whisper, ESM-2, CodeBERT ModernBERT, Longformer ModernBERT
Alignment force InfoNCE InfoNCE InfoNCE
Geometric regularizer Pentachoron CV Pentachoron CV Pentachoron CV + Procrustes SVD
Projector init Procrustes Procrustes Procrustes (cos 0.816)
Result R@1 = 1.000 m_acc = 0.927 m_acc = 0.945
CV 0.20 0.200 0.162

The pattern:

  1. Frozen expert teacher provides a stable reference frame. The teacher sees the full input; the student sees it through a constrained window.
  2. Procrustes initialization warm-starts the projector from static alignment analysis. CLIP→ModernBERT achieved cos 0.816 — the strongest pre-alignment observed.
  3. InfoNCE is the alignment force. It demands per-sample nearest-neighbor matching in the shared space. Procrustes measures alignment; InfoNCE creates it.
  4. Pentachoron CV on the bottleneck prevents projector shortcut collapse. The bottleneck is the bank anchors — the representation that carries information between segments. Without CV regularization here, the projector absorbs all alignment work and the memory bank becomes decorative.

7. Procrustes Alignment Summary

Across all experiments, Procrustes alignment reveals the baseline compatibility between encoder pairs:

Pair cos before cos after Context
BERT → ModernBERT 0.003 0.489 GEOLIP-BERT-8192
BERT → Longformer -0.001 0.521 GEOLIP-BERT-8192
CLIP → ModernBERT 0.001 0.816 GEOLIP-CLIP-ctx576
VAE SD1.5 → Flux.2 -0.000 0.757 Prior work
BERT → DINOv2 0.613 Prior work

CLIP's contrastive pre-training produces representations that are substantially more alignable with language models than BERT's MLM pre-training. This may reflect the multimodal grounding: CLIP was trained to align text with visual concepts, which apparently produces a representation geometry closer to other language models.


8. Conclusion

The geometric structure of learned representations is not an artifact of specific architectures or training objectives. It is a convergent property of gradient-based optimization on sufficiently large data. The pentachoron CV band (0.16–0.23) appears in every system we have measured — 17 pretrained models, 3 fusion systems, and 2 context-extension systems — regardless of modality, task, or architecture.

This structure enables a practical blueprint for cross-model systems: freeze an expert, measure Procrustes compatibility, initialize projectors from the alignment, apply InfoNCE as the contrastive force, and regularize the bottleneck geometry with pentachoron CV. The blueprint has now produced three systems with near-perfect alignment metrics, each trained in under 2 hours on a single GPU.

The deeper finding is about what these systems reveal: the barrier to cross-model alignment is not data, compute, or architecture. The representations are already geometrically compatible. The barrier is knowing where to apply the force (InfoNCE, not Procrustes), where to apply the constraint (bottleneck anchors, not final output), and what to measure (pentachoron CV as the structural invariant).


Reproducibility

System Repository Key files
Geometric terrain analysis (17 models) AbstractPhil/procrustes-analysis Profiling scripts, cached results
GEOLIP-Bertenstein (4-modal fusion) AbstractPhil/geolip-bertenstein Architecture, training, evaluation
GEOLIP-BERT-8192 (context extension) AbstractPhil/geolip-bert-8192 Architecture, weights, training scripts
GEOLIP-CLIP-ctx576 (CLIP extension) AbstractPhil/geolip-clip-vit-large-patch14-ctx576 AutoModel, weights, metrics.json, TensorBoard

GEOLIP-CLIP-ctx576 includes full training metrics (metrics.json with per-step and per-epoch data), TensorBoard event files, and is loadable via AutoModel.from_pretrained with trust_remote_code=True.

All experiments were conducted on a single NVIDIA RTX PRO 6000 Blackwell Server Edition (102 GB VRAM).

Community

Sign up or log in to comment