Update README.md

6042865 verified 9 days ago

4.78 kB

	---
	license: apache-2.0
	tags:
	- geometric-deep-learning
	- memory-augmented
	- bert
	- modernbert
	- longformer
	- procrustes
	- pentachoron
	- distillation
	- long-context
	language: en
	pipeline_tag: feature-extraction
	---

	# NOTE
	This version is not fully trained. It was only trained on a fraction of wikipedia 103, meaning it will be highly unreliable overall,
	but the proof of concept is present and there will be a full version.

	Based on that data; it reaches a ModernBert 92% and LongFormer 71% accuracy accordingly using a piece of the wikipedia validation set, roughly 3000 articles.
	This geometric memory when fully saturated with the necessary information, will create deeply complex encodings along many spectrum of encodings.
	For now it's only an approximation. An EFFECTIVE approximation, but still an approximation of what's to come.

	# GEOLIP-BERT-8192 — Teacher-Distilled Geometric Memory

	**Extends BERT-large (512 ctx) to 8192 effective context via geometric recurrent memory,
	trained by distillation from frozen long-context teachers.**

	## Architecture

	```
	Document (up to 8192 tokens)
	│
	├── ModernBERT-large ─── 8192 ctx, full bidirectional ─── teacher
	│ (frozen, 395M)
	│
	├── Longformer-large ─── 4096 ctx, sliding+global ──── teacher
	│ (frozen, 435M)
	│
	└── BERT-large ──────── 480 ctx × 16 segments ──────── student
	(frozen, 334M)
	+ Geometric Memory System (49345544 trainable)
	```

	### Memory System Components

	\| Component \| Params \| Function \|
	\|---\|---\|---\|
	\| Depth Compressor \| ~19M \| 8-layer CLS profile (8192-dim) → 1024-dim anchor \|
	\| Geometric Bank \| ~17M \| 128-anchor store + 2-layer cross-attention \|
	\| GRU Gate \| ~6M \| Controls memory token updates between segments \|
	\| Layer Fusion \| ~1M \| Learned weighted sum of 8 BERT layers \|
	\| Teacher Projectors \| ~2M \| Procrustes-initialized Linear(1024→1024) × 2 \|

	### Multi-Layer Extraction

	Extracts from BERT layers: (2, 5, 8, 11, 14, 17, 20, 23)
	spanning syntax → mid-semantic → deep-semantic → task output.

	### Procrustes Pre-Alignment

	Static orthogonal Procrustes computed once before training:
	- BERT → ModernBERT: cos 0.003 → 0.489
	- BERT → Longformer: cos -0.001 → 0.521

	Used to initialize teacher projectors. Fine-tuned during training.

	## Training

	- Data: WikiText-103
	- Loss: InfoNCE distillation (student vs teacher CLS) + pentachoron CV
	- Teachers: ModernBERT-large (8192 ctx), Longformer-large (4096 ctx)
	- Student: BERT-large (frozen) + geometric memory (trainable)
	- Hardware: NVIDIA RTX PRO 6000 Blackwell (102 GB)

	## Geometric Regularization

	Bank anchors regularized via Cayley-Menger pentachoron volumes
	(coefficient of variation → target 0.20). This maintains uniform geometric
	structure in the anchor space, preventing collapse.

	## Connection to GEOLIP-Bertenstein

	This model applies the [Bertenstein](https://huggingface.co/AbstractPhil/geolip-bertenstein)
	pattern to context length:
	- Bertenstein: frozen modal experts teach a shared geometric space (cross-modal)
	- GEOLIP-BERT-8192: frozen long-context experts teach a memory system (cross-context)

	Both exploit the same insight: a frozen expert provides a stable reference frame
	that prevents geometric collapse during self-supervised training.

	## Related Repos

	- [AbstractPhil/geolip-bertenstein](https://huggingface.co/AbstractPhil/geolip-bertenstein) — Multi-modal geometric fusion
	- [AbstractPhil/procrustes-analysis](https://huggingface.co/AbstractPhil/procrustes-analysis) — 17-model Procrustes profiling
	- [AbstractPhil/geolip-bertenstein-cache](https://huggingface.co/datasets/AbstractPhil/geolip-bertenstein-cache) — Expert embedding caches

	## Usage

	```python
	from transformers import BertModel, BertTokenizer
	from deep_bert_v3 import DeepBertV3, DeepBertV3Config
	from safetensors.torch import load_file

	config = DeepBertV3Config()
	model = DeepBertV3(config)

	# Load trained memory system weights
	state = load_file("checkpoints/best/memory_system.safetensors")
	model.load_state_dict(state, strict=False)

	tokenizer = BertTokenizer.from_pretrained("google-bert/bert-large-uncased")
	state = model.init_state(batch_size=1, device="cuda")

	# Process document segment by segment
	for segment_text in document_segments:
	tokens = tokenizer(segment_text, return_tensors="pt", max_length=480,
	padding="max_length", truncation=True).to("cuda")
	outputs, state = model(tokens["input_ids"], tokens["attention_mask"], state)

	# Final output encodes full document context
	document_embedding = outputs["memory_output"] # (1, 1024)
	```

	## License

	Apache 2.0