| --- |
| license: apache-2.0 |
| tags: |
| - geometric-deep-learning |
| - memory-augmented |
| - bert |
| - modernbert |
| - longformer |
| - procrustes |
| - pentachoron |
| - distillation |
| - long-context |
| language: en |
| pipeline_tag: feature-extraction |
| --- |
| |
| # NOTE |
| This version is not fully trained. It was only trained on a fraction of wikipedia 103, meaning it will be highly unreliable overall, |
| but the proof of concept is present and there will be a full version. |
|
|
| Based on that data; it reaches a ModernBert 92% and LongFormer 71% accuracy accordingly using a piece of the wikipedia validation set, roughly 3000 articles. |
| This geometric memory when fully saturated with the necessary information, will create deeply complex encodings along many spectrum of encodings. |
| For now it's only an approximation. An EFFECTIVE approximation, but still an approximation of what's to come. |
|
|
| # GEOLIP-BERT-8192 β Teacher-Distilled Geometric Memory |
|
|
| **Extends BERT-large (512 ctx) to 8192 effective context via geometric recurrent memory, |
| trained by distillation from frozen long-context teachers.** |
|
|
| ## Architecture |
|
|
| ``` |
| Document (up to 8192 tokens) |
| β |
| βββ ModernBERT-large βββ 8192 ctx, full bidirectional βββ teacher |
| β (frozen, 395M) |
| β |
| βββ Longformer-large βββ 4096 ctx, sliding+global ββββ teacher |
| β (frozen, 435M) |
| β |
| βββ BERT-large ββββββββ 480 ctx Γ 16 segments ββββββββ student |
| (frozen, 334M) |
| + Geometric Memory System (49345544 trainable) |
| ``` |
|
|
| ### Memory System Components |
|
|
| | Component | Params | Function | |
| |---|---|---| |
| | **Depth Compressor** | ~19M | 8-layer CLS profile (8192-dim) β 1024-dim anchor | |
| | **Geometric Bank** | ~17M | 128-anchor store + 2-layer cross-attention | |
| | **GRU Gate** | ~6M | Controls memory token updates between segments | |
| | **Layer Fusion** | ~1M | Learned weighted sum of 8 BERT layers | |
| | **Teacher Projectors** | ~2M | Procrustes-initialized Linear(1024β1024) Γ 2 | |
|
|
| ### Multi-Layer Extraction |
|
|
| Extracts from BERT layers: **(2, 5, 8, 11, 14, 17, 20, 23)** |
| spanning syntax β mid-semantic β deep-semantic β task output. |
|
|
| ### Procrustes Pre-Alignment |
|
|
| Static orthogonal Procrustes computed once before training: |
| - BERT β ModernBERT: cos 0.003 β **0.489** |
| - BERT β Longformer: cos -0.001 β **0.521** |
|
|
| Used to initialize teacher projectors. Fine-tuned during training. |
|
|
| ## Training |
|
|
| - **Data:** WikiText-103 |
| - **Loss:** InfoNCE distillation (student vs teacher CLS) + pentachoron CV |
| - **Teachers:** ModernBERT-large (8192 ctx), Longformer-large (4096 ctx) |
| - **Student:** BERT-large (frozen) + geometric memory (trainable) |
| - **Hardware:** NVIDIA RTX PRO 6000 Blackwell (102 GB) |
|
|
| ## Geometric Regularization |
|
|
| Bank anchors regularized via **Cayley-Menger pentachoron volumes** |
| (coefficient of variation β target 0.20). This maintains uniform geometric |
| structure in the anchor space, preventing collapse. |
|
|
| ## Connection to GEOLIP-Bertenstein |
|
|
| This model applies the [Bertenstein](https://huggingface.co/AbstractPhil/geolip-bertenstein) |
| pattern to context length: |
| - **Bertenstein**: frozen modal experts teach a shared geometric space (cross-modal) |
| - **GEOLIP-BERT-8192**: frozen long-context experts teach a memory system (cross-context) |
|
|
| Both exploit the same insight: a frozen expert provides a **stable reference frame** |
| that prevents geometric collapse during self-supervised training. |
|
|
| ## Related Repos |
|
|
| - [AbstractPhil/geolip-bertenstein](https://huggingface.co/AbstractPhil/geolip-bertenstein) β Multi-modal geometric fusion |
| - [AbstractPhil/procrustes-analysis](https://huggingface.co/AbstractPhil/procrustes-analysis) β 17-model Procrustes profiling |
| - [AbstractPhil/geolip-bertenstein-cache](https://huggingface.co/datasets/AbstractPhil/geolip-bertenstein-cache) β Expert embedding caches |
|
|
| ## Usage |
|
|
| ```python |
| from transformers import BertModel, BertTokenizer |
| from deep_bert_v3 import DeepBertV3, DeepBertV3Config |
| from safetensors.torch import load_file |
| |
| config = DeepBertV3Config() |
| model = DeepBertV3(config) |
| |
| # Load trained memory system weights |
| state = load_file("checkpoints/best/memory_system.safetensors") |
| model.load_state_dict(state, strict=False) |
| |
| tokenizer = BertTokenizer.from_pretrained("google-bert/bert-large-uncased") |
| state = model.init_state(batch_size=1, device="cuda") |
| |
| # Process document segment by segment |
| for segment_text in document_segments: |
| tokens = tokenizer(segment_text, return_tensors="pt", max_length=480, |
| padding="max_length", truncation=True).to("cuda") |
| outputs, state = model(tokens["input_ids"], tokens["attention_mask"], state) |
| |
| # Final output encodes full document context |
| document_embedding = outputs["memory_output"] # (1, 1024) |
| ``` |
|
|
| ## License |
|
|
| Apache 2.0 |
|
|