Create README.md

5d154e8 verified 8 days ago

11.1 kB

	# K-Simplex Language Model Prototype

	A geometric autoregressive language model using Cayley-Menger validated k-simplex channels. This architecture replaces traditional transformer embeddings with geometrically-constrained structures that maintain mathematical validity throughout training.

	## Overview

	This model explores whether geometric inductive bias can improve language modeling by representing each token position as a hierarchy of k-simplices (edge → triangle → tetrahedron → 5-cell) with learnable deformations validated by the Cayley-Menger determinant.

	Key Results:
	- Shakespeare corpus: Val PPL 113.74 at epoch 8
	- 100% geometric validity maintained throughout training
	- Coherent dialogue generation with proper character attribution
	- 54M parameters (due to 50k BPE vocabulary)

	---

	## Architecture

	### Conceptual Foundation

	Traditional transformers represent tokens as flat vectors. This architecture represents each token as a stack of k-simplex structures where:

	\| K-Level \| Structure \| Vertices \| Distance Pairs \| Geometric Meaning \|
	\|---------\|-----------\|----------\|----------------\|-------------------\|
	\| k=1 \| Edge \| 2 \| 1 \| 1D linear relationship \|
	\| k=2 \| Triangle \| 3 \| 3 \| 2D planar structure \|
	\| k=3 \| Tetrahedron \| 4 \| 6 \| 3D volumetric structure \|
	\| k=4 \| 5-cell \| 5 \| 10 \| 4D hypervolume \|

	Each k-level captures progressively higher-dimensional geometric relationships, providing a structured representation space that traditional embeddings lack.

	### Token Flow

	```
	Token ID
	↓
	Embedding Layer (vocab_size × embed_dim)
	↓
	Positional Encoding
	↓
	┌─────────────────────────────────────────┐
	│ TokenToKChannels │
	│ Projects to [B, T, K, feat_dim] │
	│ Each position gets K simplex channels │
	└─────────────────────────────────────────┘
	↓
	┌─────────────────────────────────────────┐
	│ GeoBlock × num_blocks │
	│ ┌─────────────────────────────────┐ │
	│ │ KChannelCrossAttention │ │
	│ │ K-levels attend to each other │ │
	│ │ (within each token position) │ │
	│ └─────────────────────────────────┘ │
	│ ┌─────────────────────────────────┐ │
	│ │ CausalSequenceAttention │ │
	│ │ Tokens attend causally │ │
	│ │ (across sequence, masked) │ │
	│ └─────────────────────────────────┘ │
	│ ┌─────────────────────────────────┐ │
	│ │ MLP │ │
	│ └─────────────────────────────────┘ │
	└─────────────────────────────────────────┘
	↓
	LM Head → Logits [B, T, vocab_size]
	```

	---

	## Geometric Formulas

	### Cayley-Menger Determinant

	For a k-simplex with vertices $v_0, v_1, \ldots, v_k$, the squared volume is computed via:

	$$
	\text{Vol}^2 = \frac{(-1)^{k+1}}{2^k (k!)^2} \det(CM)
	$$

	Where the Cayley-Menger matrix is:

	$$
	CM = \begin{pmatrix}
	0 & 1 & 1 & \cdots & 1 \\
	1 & 0 & d_{01}^2 & \cdots & d_{0k}^2 \\
	1 & d_{01}^2 & 0 & \cdots & d_{1k}^2 \\
	\vdots & \vdots & \vdots & \ddots & \vdots \\
	1 & d_{0k}^2 & d_{1k}^2 & \cdots & 0
	\end{pmatrix}
	$$

	Validity Criterion: $\text{Vol}^2 > 0$ indicates a non-degenerate simplex.

	### Template Deformation

	Each k-simplex starts from a regular (equilateral) template and learns deformations:

	$$
	v_i^{(\text{deformed})} = v_i^{(\text{template})} + \alpha \cdot \Delta v_i
	$$

	Where:
	- $v_i^{(\text{template})}$ = vertices of regular k-simplex
	- $\alpha$ = deformation scale (BASE_DEFORM = 0.05)
	- $\Delta v_i$ = learned offset from neural network

	### Geometric Gating

	Features are gated by geometric validity:

	$$
	\text{output} = \text{features} \odot \text{gate}(\text{geo}) \odot \sigma(\text{Vol}^2 \cdot 10^6)
	$$

	Where:
	- $\text{gate}(\text{geo}) = \sigma(W \cdot [d^2 \\| \text{Vol}^2])$
	- The sigmoid on Vol² acts as a soft validity mask
	- Invalid simplices (Vol² < 0) have their features suppressed

	### Loss Function

	$$
	\mathcal{L} = \mathcal{L}_{CE} + \lambda \cdot \mathcal{L}_{validity}
	$$

	Where:
	- $\mathcal{L}_{CE}$ = Cross-entropy for next-token prediction
	- $\mathcal{L}_{validity} = \text{mean}(\text{ReLU}(-\text{Vol}^2))$ penalizes collapsed simplices
	- $\lambda = 0.1$ (validity weight)

	---

	## Safe Deformation Analysis

	Extensive testing via the K-Simplex Geometric Explorer revealed critical stability zones:

	### Stability Zones by K-Depth

	\| Configuration \| Differentiation Zone \| Collapse Threshold \|
	\|---------------\|---------------------\|-------------------\|
	\| k=1-4, edim=16 \| 0.15 - 0.35 \| ~0.50 \|
	\| k=1-4, edim=32 \| 0.15 - 0.50 \| >2.0 \|
	\| k=1-6, edim=16 \| 0.35 - 0.45 \| ~0.50 \|
	\| k=1-6, edim=32 \| 0.25 - 0.60 \| >2.0 \|

	### Key Findings

	1. Deformation Scale Safety: BASE_DEFORM=0.05 is extremely conservative. The geometry can safely handle 10-40× more deformation.

	2. Embedding Dimension as Stability Buffer:
	```
	edim / k_max = stability_ratio

	ratio ≥ 8× → Very stable, deform up to 2.0
	ratio ≥ 4× → Comfortable margin
	ratio ≥ 2× → Tight but functional
	```

	3. Vol² Behavior Under Deformation:
	- Low deform (0-0.15): Clear k-level hierarchy, Vol² decreases exponentially with k
	- Medium deform (0.15-0.35): Optimal zone - distinct geometric signatures per k
	- High deform (>0.5): Noise dominates, k-levels converge, geometric meaning lost

	4. Vol² Scaling:
	```
	k=1: Vol² ~ 1e+0 (edge length squared)
	k=2: Vol² ~ 1e-1 (triangle area squared)
	k=3: Vol² ~ 1e-2 (tetrahedron volume squared)
	k=4: Vol² ~ 1e-3 (5-cell hypervolume squared)
	```
	Exponential decay is expected and healthy.

	### Recommended Production Settings

	```python
	# Conservative (proven)
	BASE_DEFORM = 0.05
	edim = 16
	depth = 4 # k=1,2,3,4

	# Aggressive (tested safe)
	BASE_DEFORM = 0.15
	edim = 32
	depth = 4

	# Experimental
	BASE_DEFORM = learnable_per_k # Allow network to find optimal
	edim = 2 * depth # Minimum viable
	```

	---

	## Training Configuration

	### Model Hyperparameters

	```python
	config = {
	"vocab_size": 50257, # GPT-2 BPE tokenizer
	"max_seq_len": 256,
	"embed_dim": 384,
	"depth": 4, # k=1,2,3,4
	"edim": 16, # Vertex coordinate dimension
	"feat_dim": 96, # Features per vertex
	"hidden": 384,
	"num_heads": 8,
	"num_blocks": 8,
	"dropout": 0.1,
	}
	```

	### Training Hyperparameters

	```python
	training = {
	"batch_size": 48,
	"seq_len": 256,
	"lr": 3e-4,
	"weight_decay": 0.1,
	"num_epochs": 50,
	"grad_clip": 1.0,
	"ce_weight": 1.0,
	"validity_weight": 0.1,
	"scheduler": "CosineAnnealingLR",
	"stride": 128, # Non-overlapping sequences
	}
	```

	---

	## Results

	### Training Progression

	\| Epoch \| Train PPL \| Val PPL \| Status \|
	\|-------\|-----------\|---------\|--------\|
	\| 1 \| 492 \| 299 \| Learning \|
	\| 5 \| 77 \| 132 \| Improving \|
	\| 8 \| 44 \| 114 \| Best \|
	\| 15 \| 15 \| 145 \| Overfitting \|

	### Geometric Health

	Throughout training:
	- Validity: 100% at all k-levels
	- Vol² k=1: ~0.92 (stable)
	- Vol² k=2: ~0.16 (stable)
	- Vol² k=3: ~0.03 (stable)
	- Vol² k=4: ~0.001 (stable)

	### Generation Quality

	Epoch 1:
	```
	ROMEO: , If, and a head I am IAB, What,
	```

	Epoch 15+:
	```
	ROMEO: if thou swear'st the Duke of love of it.
	MERCUTIO: Why, is it good.
	ROMEO: And for the jest love that.
	```

	The model learns:
	- Character names and dialogue structure
	- Turn-taking conventions
	- Shakespearean vocabulary and cadence
	- Coherent multi-turn exchanges

	---

	## Geometric Dimensions Output

	Each k-level contributes to the final representation:

	\| K \| Geo Dim \| Components \| Info Content \|
	\|---\|---------\|------------\|--------------\|
	\| 1 \| 2 \| 1 d² + 1 vol² \| Edge metric \|
	\| 2 \| 4 \| 3 d² + 1 vol² \| Triangle shape \|
	\| 3 \| 7 \| 6 d² + 1 vol² \| Tetrahedron form \|
	\| 4 \| 11 \| 10 d² + 1 vol² \| 5-cell structure \|
	\| Total \| 24 \| \| Pure geometry \|

	With feat_dim=96: Output = 96 + 24 = 120 dims per k-level, ×4 k-levels = 480 total geometric dims per token.

	---

	## File Structure

	```
	AbstractPhil/ksimplex-llm-prototype/
	├── README.md # This file
	├── trainer.py # Training script
	├── inference.py # Generation script
	├── config.json # Model configuration
	├── checkpoints/
	│ ├── checkpoint_epoch_001.pt
	│ ├── checkpoint_epoch_008.pt # Best val PPL
	│ └── checkpoint_latest.pt
	└── samples/
	└── samples_epoch_*.json # Generated text samples
	```

	---

	## Usage

	### Inference

	```python
	from inference import load_model, generate

	model, tokenizer = load_model("AbstractPhil/ksimplex-llm-prototype")

	text = generate(
	model,
	tokenizer,
	prompt="ROMEO: ",
	max_tokens=100,
	temperature=0.8,
	top_k=50
	)
	print(text)
	```

	### Training

	```bash
	python trainer.py \
	--data shakespeare.txt \
	--epochs 50 \
	--batch_size 48 \
	--lr 3e-4
	```

	---

	## Future Directions

	### Planned Experiments

	1. Learnable Deformation Scale: Per-k learnable α parameter
	2. Volume Consistency Loss: Maintain k-level differentiation
	```python
	coherence_loss = -torch.std(torch.log(vol2_stack + 1e-10))
	```
	3. K-Depth Ablation: Test k=1,2,3 only (remove k=4 noise floor)
	4. Vol² Normalization: Scale by k to equalize magnitudes
	5. Larger Data: WikiText-103, OpenWebText

	### Theoretical Questions

	- Does the geometric structure provide better length generalization?
	- Can we interpret k-level activations semantically?
	- Does geometric validity correlate with generation quality?
	- Can we prune k-levels without performance loss?

	---

	## Citation

	```bibtex
	@misc{ksimplex-llm-2026,
	author = {AbstractPhil},
	title = {K-Simplex Language Model: Geometric Autoregression with Cayley-Menger Validation},
	year = {2026},
	publisher = {HuggingFace},
	url = {https://huggingface.co/AbstractPhil/ksimplex-llm-prototype}
	}
	```

	---

	## License

	MIT License - Free to use, modify, and distribute.

	---

	## Acknowledgments

	Built on the foundation of geometric deep learning research exploring k-simplex structures, pentachoron navigation, and Cayley-Menger determinant validation for neural network regularization.

	"The geometry is the representation."