AbstractPhil's picture
Create README.md
5d154e8 verified
# K-Simplex Language Model Prototype
A geometric autoregressive language model using Cayley-Menger validated k-simplex channels. This architecture replaces traditional transformer embeddings with geometrically-constrained structures that maintain mathematical validity throughout training.
## Overview
This model explores whether **geometric inductive bias** can improve language modeling by representing each token position as a hierarchy of k-simplices (edge β†’ triangle β†’ tetrahedron β†’ 5-cell) with learnable deformations validated by the Cayley-Menger determinant.
**Key Results:**
- Shakespeare corpus: **Val PPL 113.74** at epoch 8
- 100% geometric validity maintained throughout training
- Coherent dialogue generation with proper character attribution
- 54M parameters (due to 50k BPE vocabulary)
---
## Architecture
### Conceptual Foundation
Traditional transformers represent tokens as flat vectors. This architecture represents each token as a **stack of k-simplex structures** where:
| K-Level | Structure | Vertices | Distance Pairs | Geometric Meaning |
|---------|-----------|----------|----------------|-------------------|
| k=1 | Edge | 2 | 1 | 1D linear relationship |
| k=2 | Triangle | 3 | 3 | 2D planar structure |
| k=3 | Tetrahedron | 4 | 6 | 3D volumetric structure |
| k=4 | 5-cell | 5 | 10 | 4D hypervolume |
Each k-level captures progressively higher-dimensional geometric relationships, providing a structured representation space that traditional embeddings lack.
### Token Flow
```
Token ID
↓
Embedding Layer (vocab_size Γ— embed_dim)
↓
Positional Encoding
↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ TokenToKChannels β”‚
β”‚ Projects to [B, T, K, feat_dim] β”‚
β”‚ Each position gets K simplex channels β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ GeoBlock Γ— num_blocks β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ KChannelCrossAttention β”‚ β”‚
β”‚ β”‚ K-levels attend to each other β”‚ β”‚
β”‚ β”‚ (within each token position) β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ CausalSequenceAttention β”‚ β”‚
β”‚ β”‚ Tokens attend causally β”‚ β”‚
β”‚ β”‚ (across sequence, masked) β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ MLP β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
↓
LM Head β†’ Logits [B, T, vocab_size]
```
---
## Geometric Formulas
### Cayley-Menger Determinant
For a k-simplex with vertices $v_0, v_1, \ldots, v_k$, the squared volume is computed via:
$$
\text{Vol}^2 = \frac{(-1)^{k+1}}{2^k (k!)^2} \det(CM)
$$
Where the Cayley-Menger matrix is:
$$
CM = \begin{pmatrix}
0 & 1 & 1 & \cdots & 1 \\
1 & 0 & d_{01}^2 & \cdots & d_{0k}^2 \\
1 & d_{01}^2 & 0 & \cdots & d_{1k}^2 \\
\vdots & \vdots & \vdots & \ddots & \vdots \\
1 & d_{0k}^2 & d_{1k}^2 & \cdots & 0
\end{pmatrix}
$$
**Validity Criterion:** $\text{Vol}^2 > 0$ indicates a non-degenerate simplex.
### Template Deformation
Each k-simplex starts from a regular (equilateral) template and learns deformations:
$$
v_i^{(\text{deformed})} = v_i^{(\text{template})} + \alpha \cdot \Delta v_i
$$
Where:
- $v_i^{(\text{template})}$ = vertices of regular k-simplex
- $\alpha$ = deformation scale (BASE_DEFORM = 0.05)
- $\Delta v_i$ = learned offset from neural network
### Geometric Gating
Features are gated by geometric validity:
$$
\text{output} = \text{features} \odot \text{gate}(\text{geo}) \odot \sigma(\text{Vol}^2 \cdot 10^6)
$$
Where:
- $\text{gate}(\text{geo}) = \sigma(W \cdot [d^2 \| \text{Vol}^2])$
- The sigmoid on VolΒ² acts as a soft validity mask
- Invalid simplices (VolΒ² < 0) have their features suppressed
### Loss Function
$$
\mathcal{L} = \mathcal{L}_{CE} + \lambda \cdot \mathcal{L}_{validity}
$$
Where:
- $\mathcal{L}_{CE}$ = Cross-entropy for next-token prediction
- $\mathcal{L}_{validity} = \text{mean}(\text{ReLU}(-\text{Vol}^2))$ penalizes collapsed simplices
- $\lambda = 0.1$ (validity weight)
---
## Safe Deformation Analysis
Extensive testing via the K-Simplex Geometric Explorer revealed critical stability zones:
### Stability Zones by K-Depth
| Configuration | Differentiation Zone | Collapse Threshold |
|---------------|---------------------|-------------------|
| k=1-4, edim=16 | 0.15 - 0.35 | ~0.50 |
| k=1-4, edim=32 | 0.15 - 0.50 | >2.0 |
| k=1-6, edim=16 | 0.35 - 0.45 | ~0.50 |
| k=1-6, edim=32 | 0.25 - 0.60 | >2.0 |
### Key Findings
1. **Deformation Scale Safety**: BASE_DEFORM=0.05 is extremely conservative. The geometry can safely handle 10-40Γ— more deformation.
2. **Embedding Dimension as Stability Buffer**:
```
edim / k_max = stability_ratio
ratio β‰₯ 8Γ— β†’ Very stable, deform up to 2.0
ratio β‰₯ 4Γ— β†’ Comfortable margin
ratio β‰₯ 2Γ— β†’ Tight but functional
```
3. **VolΒ² Behavior Under Deformation**:
- Low deform (0-0.15): Clear k-level hierarchy, VolΒ² decreases exponentially with k
- Medium deform (0.15-0.35): **Optimal zone** - distinct geometric signatures per k
- High deform (>0.5): Noise dominates, k-levels converge, geometric meaning lost
4. **VolΒ² Scaling**:
```
k=1: VolΒ² ~ 1e+0 (edge length squared)
k=2: VolΒ² ~ 1e-1 (triangle area squared)
k=3: VolΒ² ~ 1e-2 (tetrahedron volume squared)
k=4: VolΒ² ~ 1e-3 (5-cell hypervolume squared)
```
Exponential decay is expected and healthy.
### Recommended Production Settings
```python
# Conservative (proven)
BASE_DEFORM = 0.05
edim = 16
depth = 4 # k=1,2,3,4
# Aggressive (tested safe)
BASE_DEFORM = 0.15
edim = 32
depth = 4
# Experimental
BASE_DEFORM = learnable_per_k # Allow network to find optimal
edim = 2 * depth # Minimum viable
```
---
## Training Configuration
### Model Hyperparameters
```python
config = {
"vocab_size": 50257, # GPT-2 BPE tokenizer
"max_seq_len": 256,
"embed_dim": 384,
"depth": 4, # k=1,2,3,4
"edim": 16, # Vertex coordinate dimension
"feat_dim": 96, # Features per vertex
"hidden": 384,
"num_heads": 8,
"num_blocks": 8,
"dropout": 0.1,
}
```
### Training Hyperparameters
```python
training = {
"batch_size": 48,
"seq_len": 256,
"lr": 3e-4,
"weight_decay": 0.1,
"num_epochs": 50,
"grad_clip": 1.0,
"ce_weight": 1.0,
"validity_weight": 0.1,
"scheduler": "CosineAnnealingLR",
"stride": 128, # Non-overlapping sequences
}
```
---
## Results
### Training Progression
| Epoch | Train PPL | Val PPL | Status |
|-------|-----------|---------|--------|
| 1 | 492 | 299 | Learning |
| 5 | 77 | 132 | Improving |
| 8 | 44 | **114** | **Best** |
| 15 | 15 | 145 | Overfitting |
### Geometric Health
Throughout training:
- **Validity**: 100% at all k-levels
- **VolΒ² k=1**: ~0.92 (stable)
- **VolΒ² k=2**: ~0.16 (stable)
- **VolΒ² k=3**: ~0.03 (stable)
- **VolΒ² k=4**: ~0.001 (stable)
### Generation Quality
**Epoch 1:**
```
ROMEO: , If, and a head I am IAB, What,
```
**Epoch 15+:**
```
ROMEO: if thou swear'st the Duke of love of it.
MERCUTIO: Why, is it good.
ROMEO: And for the jest love that.
```
The model learns:
- Character names and dialogue structure
- Turn-taking conventions
- Shakespearean vocabulary and cadence
- Coherent multi-turn exchanges
---
## Geometric Dimensions Output
Each k-level contributes to the final representation:
| K | Geo Dim | Components | Info Content |
|---|---------|------------|--------------|
| 1 | 2 | 1 dΒ² + 1 volΒ² | Edge metric |
| 2 | 4 | 3 dΒ² + 1 volΒ² | Triangle shape |
| 3 | 7 | 6 dΒ² + 1 volΒ² | Tetrahedron form |
| 4 | 11 | 10 dΒ² + 1 volΒ² | 5-cell structure |
| **Total** | **24** | | Pure geometry |
With feat_dim=96: Output = 96 + 24 = 120 dims per k-level, Γ—4 k-levels = 480 total geometric dims per token.
---
## File Structure
```
AbstractPhil/ksimplex-llm-prototype/
β”œβ”€β”€ README.md # This file
β”œβ”€β”€ trainer.py # Training script
β”œβ”€β”€ inference.py # Generation script
β”œβ”€β”€ config.json # Model configuration
β”œβ”€β”€ checkpoints/
β”‚ β”œβ”€β”€ checkpoint_epoch_001.pt
β”‚ β”œβ”€β”€ checkpoint_epoch_008.pt # Best val PPL
β”‚ └── checkpoint_latest.pt
└── samples/
└── samples_epoch_*.json # Generated text samples
```
---
## Usage
### Inference
```python
from inference import load_model, generate
model, tokenizer = load_model("AbstractPhil/ksimplex-llm-prototype")
text = generate(
model,
tokenizer,
prompt="ROMEO: ",
max_tokens=100,
temperature=0.8,
top_k=50
)
print(text)
```
### Training
```bash
python trainer.py \
--data shakespeare.txt \
--epochs 50 \
--batch_size 48 \
--lr 3e-4
```
---
## Future Directions
### Planned Experiments
1. **Learnable Deformation Scale**: Per-k learnable Ξ± parameter
2. **Volume Consistency Loss**: Maintain k-level differentiation
```python
coherence_loss = -torch.std(torch.log(vol2_stack + 1e-10))
```
3. **K-Depth Ablation**: Test k=1,2,3 only (remove k=4 noise floor)
4. **VolΒ² Normalization**: Scale by k to equalize magnitudes
5. **Larger Data**: WikiText-103, OpenWebText
### Theoretical Questions
- Does the geometric structure provide better length generalization?
- Can we interpret k-level activations semantically?
- Does geometric validity correlate with generation quality?
- Can we prune k-levels without performance loss?
---
## Citation
```bibtex
@misc{ksimplex-llm-2026,
author = {AbstractPhil},
title = {K-Simplex Language Model: Geometric Autoregression with Cayley-Menger Validation},
year = {2026},
publisher = {HuggingFace},
url = {https://huggingface.co/AbstractPhil/ksimplex-llm-prototype}
}
```
---
## License
MIT License - Free to use, modify, and distribute.
---
## Acknowledgments
Built on the foundation of geometric deep learning research exploring k-simplex structures, pentachoron navigation, and Cayley-Menger determinant validation for neural network regularization.
*"The geometry is the representation."*