Geometric Memory: Context Extension and Cross-Model Alignment Through Pentachoron Regularization
AbstractPhil March 2026
Abstract
We present three systems built on a single architectural blueprint: frozen expert encoder + geometric memory bank + InfoNCE alignment + pentachoron CV regularization. GEOLIP-BERT-8192 extends BERT-large from 512 to 8,192 tokens via distillation from ModernBERT-large and Longformer-large, achieving m_acc=0.927 with CV=0.200. GEOLIP-CLIP-ctx576 extends CLIP-ViT-L/14 from 77 to 576 tokens via distillation from ModernBERT-large, achieving m_acc=0.945 (val=0.944) with CV=0.162. Both systems require no teacher at inference. We also present a series of controlled experiments on dual-ViT cross-topology alignment demonstrating that Procrustes alignment functions as a geometric regularizer (not an alignment force), that InfoNCE is the necessary force for cross-model alignment, and that geometric regularization on the bottleneck representation prevents projector shortcut collapse. Additionally, we report negative results on BERT compression via iterative SVD cascade, establishing a 60% function retention ceiling for independent per-matrix projection on compositional transformers. All systems reproduce the pentachoron CV band (0.16–0.20) first reported in our geometric terrain analysis of 17 architectures.
1. Introduction
Our prior work (Geometric Fusion: Cross-Modal Alignment Through Shared Pentachoron Geometry) established three findings: (1) the pentachoron coefficient of variation (CV) converges to a universal band of 0.20–0.23 across 17 independently trained models spanning 5 architecture families, (2) BERT-large and DINOv2-large weight matrices are 61% Procrustes-alignable despite having no shared training signal, and (3) a single-layer fusion transformer achieves R@1=1.0000 on cross-modal retrieval by exploiting this shared geometric structure.
This work extends those findings in four directions:
Context extension via geometric memory. We demonstrate that the same geometric principles enable context window extension: a memory bank with depth-profile anchors, regularized by pentachoron CV, can extend a frozen encoder's effective context by 8–16× while preserving alignment with a long-context teacher.
The blueprint. We formalize the architecture pattern that produced three production systems: frozen expert teacher provides the target, Procrustes initialization warm-starts the projector, InfoNCE provides the alignment force, and pentachoron CV on the bottleneck representation prevents collapse.
Procrustes as regularizer, not force. Controlled dual-ViT experiments demonstrate that Procrustes alignment loss shapes embedding geometry but cannot create cross-model alignment by itself. InfoNCE is the necessary and sufficient force. Procrustes contributes as an active geometric regularizer during training.
Compression limits. Iterative SVD cascade compression of BERT-base reveals a 60% function retention ceiling when projecting weight matrices independently, despite perfect preservation of per-layer pentachoron CV. The missing 40% is inter-layer compositional structure that cannot be decomposed in current tests.
2. Geometric Memory Architecture
2.1 The Context Extension Problem
BERT-large has a 512-token context window. CLIP-ViT-L/14 has a 77-token context window. Long-context models exist (ModernBERT at 8,192 tokens, Longformer at 4,096), but replacing a frozen encoder breaks downstream compatibility. The goal: extend context while preserving the original embedding space.
2.2 Architecture
The memory system wraps a frozen encoder without modifying its internals:
Document (N tokens, N >> encoder context)
│
├── Split into segments (overlapping, sized to encoder window)
│
├── For each segment:
│ ├── Frozen encoder forward → hidden states at multiple layers
│ ├── Multi-layer fusion (learned weighted sum of extracted layers)
│ ├── Memory tokens cross-attend to fused hidden states
│ ├── Depth-profile compressor: per-layer CLS → single anchor (L2-normalized)
│ ├── Anchor stored in geometric memory bank
│ └── GRU gate updates rolling memory state
│
└── Final output: encoder-compatible embedding (same dimensionality)
Depth-profile anchors. Each segment produces an anchor that encodes not what the encoder output, but how the encoder processed the segment across all depths. For BERT, this is 8 hidden layers concatenated (8×1024=8192 dims) compressed to 1024. For CLIP, 6 layers (6×768=4608) compressed to 768. Two segments with identical final outputs but different processing trajectories produce different anchors.
Bank cross-attention. Memory tokens (8 for CLIP, 16 for BERT) query the bank of past anchors via 2-layer cross-attention. This enables selective retrieval: which past segments are relevant for the current context.
GRU gate. Controls the update ratio between old memory state and new enrichment. Prevents catastrophic overwriting of past context.
2.3 Training
Both systems use frozen long-context teachers that see the full document in one pass:
Loss function:
L = InfoNCE(proj(student_cls), teacher_cls) × 1.0
+ Procrustes_SVD(student_cls, teacher_cls) × 0.3
+ |pentachoron_CV(bank_anchors) - 0.20| × 0.05
The pentachoron CV loss is computed on the bank anchors specifically — the bottleneck representation between segments. This is the critical design choice (see Section 4).
3. Results
3.1 GEOLIP-BERT-8192
Configuration:
| Component | Detail |
|---|---|
| Student | BERT-large (frozen, 512 ctx, 1024-dim, 24 layers) |
| Teachers | ModernBERT-large (8192 ctx), Longformer-large (4096 ctx) |
| Memory | 49M trainable params, 16 memory tokens, 128-anchor bank |
| Data | WikiText-103 |
| Segments | 480 tokens per segment, 64-token overlap, 16 max segments |
Procrustes pre-alignment:
| Pair | cos before | cos after |
|---|---|---|
| BERT → ModernBERT | 0.003 | 0.489 |
| BERT → Longformer | -0.001 | 0.521 |
Training results (1 epoch):
| Metric | Train | Val |
|---|---|---|
| ModernBERT match accuracy | 0.927 | 0.812 |
| Longformer match accuracy | 0.742 | 0.656 |
| Pentachoron CV | 0.200 | 0.175 |
CV converged to exactly 0.200 — the center of the universal band. No teacher is required at inference. The memory system internalized what ModernBERT computes with full 8,192-token attention.
Repository: AbstractPhil/geolip-bert-8192
3.2 GEOLIP-CLIP-ctx576
Configuration:
| Component | Detail |
|---|---|
| Student | CLIP-ViT-L/14 text encoder (frozen, 77 ctx, 768-dim, 12 layers) |
| Teacher | ModernBERT-large (4096 ctx, 1024-dim) |
| Memory | 34.5M trainable params, 8 memory tokens, 64-anchor bank |
| Data | CC12M LLaVA-next captions (50K train, 2K val, mean 96 tokens) |
| Segments | 18 tokens per segment, 4-token overlap, 32 max segments |
Procrustes pre-alignment:
| Pair | cos before | cos after |
|---|---|---|
| CLIP → ModernBERT | 0.001 | 0.816 |
This is the strongest Procrustes alignment observed across all experiments — CLIP's contrastive pre-training produces representations 81.6% rotationally alignable with ModernBERT despite entirely different architectures, tokenizers, and training objectives.
Training results (10 epochs, fully instrumented):
| Epoch | Train m_acc | Val m_acc | Val m_acc5 | CV |
|---|---|---|---|---|
| 1 | 0.628 | 0.835 | — | 0.185 |
| 4 | 0.878 | 0.870 | 0.999 | 0.168 |
| 7 | 0.917 | 0.924 | 1.000 | 0.166 |
| 10 | 0.945 | 0.944 | 1.000 | 0.162 |
Val m_acc5=1.000: the correct ModernBERT match is in the top 5 every time across 2,000 held-out captions. Train/val gap of 0.001 indicates no overfitting.
The segment size discovery. Initial experiments with 55-token segments produced only 2 segments per caption. The pentachoron CV loss requires 5+ anchors to compute. With 2 anchors, CV was dead at 0.029 and training plateaued at m_acc=0.670 via projector shortcut (Section 4). Reducing to 18-token segments produced 7 segments per caption, activating the CV regularization. Result: m_acc jumped from 0.670 to 0.945.
Repository: AbstractPhil/geolip-clip-vit-large-patch14-ctx576
Usage:
from transformers import AutoModel
model = AutoModel.from_pretrained(
"AbstractPhil/geolip-clip-vit-large-patch14-ctx576",
trust_remote_code=True)
model.to("cuda").eval()
embedding = model.encode("A long detailed caption that exceeds 77 tokens...")
# Returns: (768,) — drop into any CLIP-L pipeline
4. The Role of Geometric Regularization
4.1 Dual-ViT Experiments
To isolate the contribution of each loss component, we trained pairs of ViT encoders with different topological lensing (non-overlapping patch embedding vs. overlapping convolutional stem) on CIFAR-10, varying the alignment mechanism:
| Version | Alignment method | R@1 (proj) | R@1 (raw CLS) | CV |
|---|---|---|---|---|
| v1 | Procrustes loss | 0.000 | 0.000 | 0.25 |
| v2 | Procrustes + simplex anchor | 0.002 | 0.000 | 0.19 |
| v3 | InfoNCE + simplex | 0.999 | 0.000 | 0.19 |
| v4 | InfoNCE + perturbation (σ=0.5) | 1.000 | 0.000 | 0.17 |
Finding 1: Procrustes is a regularizer, not a force. v1 used Procrustes alignment as the training loss. P_cos remained at 0.094 for 30 epochs — no alignment was created. Procrustes measures alignability; it does not create it. However, the experiments that included Procrustes produced tighter CV (0.19 vs 0.25 for independent training). It shapes geometry without creating alignment.
Finding 2: InfoNCE is the necessary and sufficient force. Replacing Procrustes loss with InfoNCE in v3 produced R@1=0.999 in the shared projection space — from 0.000 to near-perfect in a single architectural change.
Finding 3: Alignment lives where you put it. R@1 in the raw CLS space was 0.000 across all versions, including v4 with heavy noise injection (σ=0.5, 40% dropout) before the projection head. The two-layer projector absorbed all perturbation and maintained perfect projection-space alignment without any alignment propagating into the backbone encoders. Different topological lensings produce irreducibly different coordinate systems.
Finding 4: The simplex factory provides the scaffold. Regular pentachora (4-simplices) from the SimplexFactory served as geometric reference structures. All edges equal, maximum symmetry, guaranteed non-degenerate. The CV loss measured volume ratios relative to this reference, providing more structural information than raw CV alone.
4.2 The Projector Shortcut Problem
The CLIP context extension experiments revealed a critical failure mode:
| Segment size | Segments per caption | CV active? | m_acc | Mechanism |
|---|---|---|---|---|
| 55 tokens | 2 | No (0.029) | 0.670 (plateau) | Projector shortcut |
| 18 tokens | 7 | Yes (0.19) | 0.945 (climbing) | Bank accumulation |
With 55-token segments, most captions produced only 2 segments — below the 5-anchor minimum for pentachoron computation. The CV loss returned ~0, providing no gradient. Without geometric constraint on the bank anchors, the teacher projector learned to map whatever the memory system produced directly to ModernBERT's space. The memory bank was decorative.
With 18-token segments, 7 segments per caption activated the CV loss. The bank anchors were forced to maintain geometric structure, which propagated upstream through the cross-attention into the memory accumulation mechanism. The projector received a geometrically shaped representation and could not bypass the bank.
Principle: Geometric regularization must be applied at the bottleneck — the representation that carries information between components. Regularizing only the output allows upstream shortcuts.
5. Cascade Compression
5.1 Iterative SVD Cascade
Separately from the memory systems, we investigated whether the pentachoron CV could guide model compression. The hypothesis: if a trained model's geometric structure (CV≈0.20) can be SVD-projected to smaller dimensions while preserving the CV, the function should transfer.
Toy MLP results (positive):
| Cascade | Dim | Accuracy | Compression | Heal epochs |
|---|---|---|---|---|
| Root model | 256 | 87.4% | 1× | — |
| 9-step cascade (256→64) | 64 | 84.6% | 11.1× | 8 total |
| Direct jump (256→64) | 64 | 29.6% | 11.1× | 1 |
| Scratch (64-dim, 1 epoch) | 64 | 38.1% | 11.1× | 1 |
At ratio r=0.95 (27 steps), the cascade exceeded the root model at 89.2% accuracy — acting as a regularizer that strips noise while retaining signal.
BERT-base results (negative):
| Method | Top-1 retained | CV | Notes |
|---|---|---|---|
| Root BERT-base (768-dim) | 100% | 0.22 | Baseline |
| Independent SVD + MLM healing | 61.5% | 0.22–0.24 | CV preserved, composition broken |
| + teacher global projector | 62.5% | 0.22 | +1% from distillation |
| + per-layer projectors | 60.6% | 0.22 | All "fixes" negative |
| Shared basis (global rotation) | ~7% | 0.10–0.15 | LayerNorm destroyed |
| Two-level Procrustes | ~7% | 0.10–0.12 | Inter-layer mismatch |
5.2 The 60% Ceiling
The ceiling at 60–62% is invariant to healing method, loss function, projector architecture, and training budget. Five different approaches hit the same wall.
Diagnosis: CV remained at 0.22 throughout — per-layer geometry was perfectly preserved. The missing 40% is inter-layer compositional structure: how layer i's output feeds into layer i+1 through LayerNorm, residual connections, and attention. Independent SVD rotates each matrix into a different coordinate frame. 72 independent rotations break the compositional chain.
Key finding: Homogeneous operations (same SVD formula applied uniformly to all matrices) preserved more function than heterogeneous "corrections" (different methods for FFN vs attention, rotated biases, L1 pruning). An external review suggested five theoretically justified fixes; applying all five reduced retention from 61.5% to 60.6%. The "bugs" were load-bearing — two wrongs partially canceling in a co-adapted system.
5.3 Potential Directions
Hypothesis: My current hypothesis states that a series of geometric basin experiments tuned to capture independently represented differences in order to autoregress those differences into a utilizably similar representation is possible, but it requires a full setup and multiple experiments. I have some evidence that shows it is possible but not enough to commit a week to a single project.
Possible Process: This would require a series of finetunes from upper to lower floor, while testing a multitude of adjacent systems such as DARE, attention scaling processes, attention head resizing processes, and anchoring the alignment of everything carefully to the deviated CV deviation throughout the structure.
6. The Blueprint
Three production systems, one pattern:
| Component | GEOLIP-Bertenstein | GEOLIP-BERT-8192 | GEOLIP-CLIP-ctx576 |
|---|---|---|---|
| Student | BERT-large (hub) | BERT-large (512 ctx) | CLIP-L text (77 ctx) |
| Teachers | DINOv2, Whisper, ESM-2, CodeBERT | ModernBERT, Longformer | ModernBERT |
| Alignment force | InfoNCE | InfoNCE | InfoNCE |
| Geometric regularizer | Pentachoron CV | Pentachoron CV | Pentachoron CV + Procrustes SVD |
| Projector init | Procrustes | Procrustes | Procrustes (cos 0.816) |
| Result | R@1 = 1.000 | m_acc = 0.927 | m_acc = 0.945 |
| CV | 0.20 | 0.200 | 0.162 |
The pattern:
- Frozen expert teacher provides a stable reference frame. The teacher sees the full input; the student sees it through a constrained window.
- Procrustes initialization warm-starts the projector from static alignment analysis. CLIP→ModernBERT achieved cos 0.816 — the strongest pre-alignment observed.
- InfoNCE is the alignment force. It demands per-sample nearest-neighbor matching in the shared space. Procrustes measures alignment; InfoNCE creates it.
- Pentachoron CV on the bottleneck prevents projector shortcut collapse. The bottleneck is the bank anchors — the representation that carries information between segments. Without CV regularization here, the projector absorbs all alignment work and the memory bank becomes decorative.
7. Procrustes Alignment Summary
Across all experiments, Procrustes alignment reveals the baseline compatibility between encoder pairs:
| Pair | cos before | cos after | Context |
|---|---|---|---|
| BERT → ModernBERT | 0.003 | 0.489 | GEOLIP-BERT-8192 |
| BERT → Longformer | -0.001 | 0.521 | GEOLIP-BERT-8192 |
| CLIP → ModernBERT | 0.001 | 0.816 | GEOLIP-CLIP-ctx576 |
| VAE SD1.5 → Flux.2 | -0.000 | 0.757 | Prior work |
| BERT → DINOv2 | — | 0.613 | Prior work |
CLIP's contrastive pre-training produces representations that are substantially more alignable with language models than BERT's MLM pre-training. This may reflect the multimodal grounding: CLIP was trained to align text with visual concepts, which apparently produces a representation geometry closer to other language models.
8. Conclusion
The geometric structure of learned representations is not an artifact of specific architectures or training objectives. It is a convergent property of gradient-based optimization on sufficiently large data. The pentachoron CV band (0.16–0.23) appears in every system we have measured — 17 pretrained models, 3 fusion systems, and 2 context-extension systems — regardless of modality, task, or architecture.
This structure enables a practical blueprint for cross-model systems: freeze an expert, measure Procrustes compatibility, initialize projectors from the alignment, apply InfoNCE as the contrastive force, and regularize the bottleneck geometry with pentachoron CV. The blueprint has now produced three systems with near-perfect alignment metrics, each trained in under 2 hours on a single GPU.
The deeper finding is about what these systems reveal: the barrier to cross-model alignment is not data, compute, or architecture. The representations are already geometrically compatible. The barrier is knowing where to apply the force (InfoNCE, not Procrustes), where to apply the constraint (bottleneck anchors, not final output), and what to measure (pentachoron CV as the structural invariant).
Reproducibility
| System | Repository | Key files |
|---|---|---|
| Geometric terrain analysis (17 models) | AbstractPhil/procrustes-analysis | Profiling scripts, cached results |
| GEOLIP-Bertenstein (4-modal fusion) | AbstractPhil/geolip-bertenstein | Architecture, training, evaluation |
| GEOLIP-BERT-8192 (context extension) | AbstractPhil/geolip-bert-8192 | Architecture, weights, training scripts |
| GEOLIP-CLIP-ctx576 (CLIP extension) | AbstractPhil/geolip-clip-vit-large-patch14-ctx576 | AutoModel, weights, metrics.json, TensorBoard |
GEOLIP-CLIP-ctx576 includes full training metrics (metrics.json with per-step and per-epoch data), TensorBoard event files, and is loadable via AutoModel.from_pretrained with trust_remote_code=True.
All experiments were conducted on a single NVIDIA RTX PRO 6000 Blackwell Server Edition (102 GB VRAM).