AbstractPhil
/

ksimplex-llm-prototype

Model card Files Files and versions

xet

Community

AbstractPhil commited on 9 days ago

Commit

5d154e8

verified ·

1 Parent(s): 9e5a420

Create README.md

Browse files

Files changed (1) hide show

README.md +384 -0

README.md ADDED Viewed

	@@ -0,0 +1,384 @@

+# K-Simplex Language Model Prototype
+A geometric autoregressive language model using Cayley-Menger validated k-simplex channels. This architecture replaces traditional transformer embeddings with geometrically-constrained structures that maintain mathematical validity throughout training.
+## Overview
+This model explores whether **geometric inductive bias** can improve language modeling by representing each token position as a hierarchy of k-simplices (edge → triangle → tetrahedron → 5-cell) with learnable deformations validated by the Cayley-Menger determinant.
+**Key Results:**
+- Shakespeare corpus: **Val PPL 113.74** at epoch 8
+- 100% geometric validity maintained throughout training
+- Coherent dialogue generation with proper character attribution
+- 54M parameters (due to 50k BPE vocabulary)
+---
+## Architecture
+### Conceptual Foundation
+Traditional transformers represent tokens as flat vectors. This architecture represents each token as a **stack of k-simplex structures** where:
+| K-Level | Structure | Vertices | Distance Pairs | Geometric Meaning |
+|---------|-----------|----------|----------------|-------------------|
+| k=1 | Edge | 2 | 1 | 1D linear relationship |
+| k=2 | Triangle | 3 | 3 | 2D planar structure |
+| k=3 | Tetrahedron | 4 | 6 | 3D volumetric structure |
+| k=4 | 5-cell | 5 | 10 | 4D hypervolume |
+Each k-level captures progressively higher-dimensional geometric relationships, providing a structured representation space that traditional embeddings lack.
+### Token Flow
+```
+Token ID
+    ↓
+Embedding Layer (vocab_size × embed_dim)
+    ↓
+Positional Encoding
+    ↓
+┌─────────────────────────────────────────┐
+│         TokenToKChannels                │
+│  Projects to [B, T, K, feat_dim]        │
+│  Each position gets K simplex channels  │
+└─────────────────────────────────────────┘
+    ↓
+┌─────────────────────────────────────────┐
+│         GeoBlock × num_blocks           │
+│  ┌─────────────────────────────────┐    │
+│  │ KChannelCrossAttention          │    │
+│  │ K-levels attend to each other   │    │
+│  │ (within each token position)    │    │
+│  └─────────────────────────────────┘    │
+│  ┌─────────────────────────────────┐    │
+│  │ CausalSequenceAttention         │    │
+│  │ Tokens attend causally          │    │
+│  │ (across sequence, masked)       │    │
+│  └─────────────────────────────────┘    │
+│  ┌─────────────────────────────────┐    │
+│  │ MLP                             │    │
+│  └─────────────────────────────────┘    │
+└─────────────────────────────────────────┘
+    ↓
+LM Head → Logits [B, T, vocab_size]
+```
+---
+## Geometric Formulas
+### Cayley-Menger Determinant
+For a k-simplex with vertices $v_0, v_1, \ldots, v_k$, the squared volume is computed via:
+$$
+\text{Vol}^2 = \frac{(-1)^{k+1}}{2^k (k!)^2} \det(CM)
+$$
+Where the Cayley-Menger matrix is:
+$$
+CM = \begin{pmatrix}
+0 & 1 & 1 & \cdots & 1 \\
+1 & 0 & d_{01}^2 & \cdots & d_{0k}^2 \\
+1 & d_{01}^2 & 0 & \cdots & d_{1k}^2 \\
+\vdots & \vdots & \vdots & \ddots & \vdots \\
+1 & d_{0k}^2 & d_{1k}^2 & \cdots & 0
+\end{pmatrix}
+$$
+**Validity Criterion:** $\text{Vol}^2 > 0$ indicates a non-degenerate simplex.
+### Template Deformation
+Each k-simplex starts from a regular (equilateral) template and learns deformations:
+$$
+v_i^{(\text{deformed})} = v_i^{(\text{template})} + \alpha \cdot \Delta v_i
+$$
+Where:
+- $v_i^{(\text{template})}$ = vertices of regular k-simplex
+- $\alpha$ = deformation scale (BASE_DEFORM = 0.05)
+- $\Delta v_i$ = learned offset from neural network
+### Geometric Gating
+Features are gated by geometric validity:
+$$
+\text{output} = \text{features} \odot \text{gate}(\text{geo}) \odot \sigma(\text{Vol}^2 \cdot 10^6)
+$$
+Where:
+- $\text{gate}(\text{geo}) = \sigma(W \cdot [d^2 \| \text{Vol}^2])$
+- The sigmoid on Vol² acts as a soft validity mask
+- Invalid simplices (Vol² < 0) have their features suppressed
+### Loss Function
+$$
+\mathcal{L} = \mathcal{L}_{CE} + \lambda \cdot \mathcal{L}_{validity}
+$$
+Where:
+- $\mathcal{L}_{CE}$ = Cross-entropy for next-token prediction
+- $\mathcal{L}_{validity} = \text{mean}(\text{ReLU}(-\text{Vol}^2))$ penalizes collapsed simplices
+- $\lambda = 0.1$ (validity weight)
+---
+## Safe Deformation Analysis
+Extensive testing via the K-Simplex Geometric Explorer revealed critical stability zones:
+### Stability Zones by K-Depth
+| Configuration | Differentiation Zone | Collapse Threshold |
+|---------------|---------------------|-------------------|
+| k=1-4, edim=16 | 0.15 - 0.35 | ~0.50 |
+| k=1-4, edim=32 | 0.15 - 0.50 | >2.0 |
+| k=1-6, edim=16 | 0.35 - 0.45 | ~0.50 |
+| k=1-6, edim=32 | 0.25 - 0.60 | >2.0 |
+### Key Findings
+1. **Deformation Scale Safety**: BASE_DEFORM=0.05 is extremely conservative. The geometry can safely handle 10-40× more deformation.
+2. **Embedding Dimension as Stability Buffer**:
+   ```
+   edim / k_max = stability_ratio
+   ratio ≥ 8×  → Very stable, deform up to 2.0
+   ratio ≥ 4×  → Comfortable margin
+   ratio ≥ 2×  → Tight but functional
+   ```
+3. **Vol² Behavior Under Deformation**:
+   - Low deform (0-0.15): Clear k-level hierarchy, Vol² decreases exponentially with k
+   - Medium deform (0.15-0.35): **Optimal zone** - distinct geometric signatures per k
+   - High deform (>0.5): Noise dominates, k-levels converge, geometric meaning lost
+4. **Vol² Scaling**:
+   ```
+   k=1: Vol² ~ 1e+0 (edge length squared)
+   k=2: Vol² ~ 1e-1 (triangle area squared)
+   k=3: Vol² ~ 1e-2 (tetrahedron volume squared)
+   k=4: Vol² ~ 1e-3 (5-cell hypervolume squared)
+   ```
+   Exponential decay is expected and healthy.
+### Recommended Production Settings
+```python
+# Conservative (proven)
+BASE_DEFORM = 0.05
+edim = 16
+depth = 4  # k=1,2,3,4
+# Aggressive (tested safe)
+BASE_DEFORM = 0.15
+edim = 32
+depth = 4
+# Experimental
+BASE_DEFORM = learnable_per_k  # Allow network to find optimal
+edim = 2 * depth  # Minimum viable
+```
+---
+## Training Configuration
+### Model Hyperparameters
+```python
+config = {
+    "vocab_size": 50257,      # GPT-2 BPE tokenizer
+    "max_seq_len": 256,
+    "embed_dim": 384,
+    "depth": 4,               # k=1,2,3,4
+    "edim": 16,               # Vertex coordinate dimension
+    "feat_dim": 96,           # Features per vertex
+    "hidden": 384,
+    "num_heads": 8,
+    "num_blocks": 8,
+    "dropout": 0.1,
+}
+```
+### Training Hyperparameters
+```python
+training = {
+    "batch_size": 48,
+    "seq_len": 256,
+    "lr": 3e-4,
+    "weight_decay": 0.1,
+    "num_epochs": 50,
+    "grad_clip": 1.0,
+    "ce_weight": 1.0,
+    "validity_weight": 0.1,
+    "scheduler": "CosineAnnealingLR",
+    "stride": 128,            # Non-overlapping sequences
+}
+```
+---
+## Results
+### Training Progression
+| Epoch | Train PPL | Val PPL | Status |
+|-------|-----------|---------|--------|
+| 1 | 492 | 299 | Learning |
+| 5 | 77 | 132 | Improving |
+| 8 | 44 | **114** | **Best** |
+| 15 | 15 | 145 | Overfitting |
+### Geometric Health
+Throughout training:
+- **Validity**: 100% at all k-levels
+- **Vol² k=1**: ~0.92 (stable)
+- **Vol² k=2**: ~0.16 (stable)
+- **Vol² k=3**: ~0.03 (stable)
+- **Vol² k=4**: ~0.001 (stable)
+### Generation Quality
+**Epoch 1:**
+```
+ROMEO: , If, and a head I am IAB, What,
+```
+**Epoch 15+:**
+```
+ROMEO: if thou swear'st the Duke of love of it.
+MERCUTIO: Why, is it good.
+ROMEO: And for the jest love that.
+```
+The model learns:
+- Character names and dialogue structure
+- Turn-taking conventions
+- Shakespearean vocabulary and cadence
+- Coherent multi-turn exchanges
+---
+## Geometric Dimensions Output
+Each k-level contributes to the final representation:
+| K | Geo Dim | Components | Info Content |
+|---|---------|------------|--------------|
+| 1 | 2 | 1 d² + 1 vol² | Edge metric |
+| 2 | 4 | 3 d² + 1 vol² | Triangle shape |
+| 3 | 7 | 6 d² + 1 vol² | Tetrahedron form |
+| 4 | 11 | 10 d² + 1 vol² | 5-cell structure |
+| **Total** | **24** | | Pure geometry |
+With feat_dim=96: Output = 96 + 24 = 120 dims per k-level, ×4 k-levels = 480 total geometric dims per token.
+---
+## File Structure
+```
+AbstractPhil/ksimplex-llm-prototype/
+├── README.md                 # This file
+├── trainer.py               # Training script
+├── inference.py             # Generation script
+├── config.json              # Model configuration
+├── checkpoints/
+│   ├── checkpoint_epoch_001.pt
+│   ├── checkpoint_epoch_008.pt  # Best val PPL
+│   └── checkpoint_latest.pt
+└── samples/
+    └── samples_epoch_*.json  # Generated text samples
+```
+---
+## Usage
+### Inference
+```python
+from inference import load_model, generate
+model, tokenizer = load_model("AbstractPhil/ksimplex-llm-prototype")
+text = generate(
+    model,
+    tokenizer,
+    prompt="ROMEO: ",
+    max_tokens=100,
+    temperature=0.8,
+    top_k=50
+)
+print(text)
+```
+### Training
+```bash
+python trainer.py \
+    --data shakespeare.txt \
+    --epochs 50 \
+    --batch_size 48 \
+    --lr 3e-4
+```
+---
+## Future Directions
+### Planned Experiments
+1. **Learnable Deformation Scale**: Per-k learnable α parameter
+2. **Volume Consistency Loss**: Maintain k-level differentiation
+   ```python
+   coherence_loss = -torch.std(torch.log(vol2_stack + 1e-10))
+   ```
+3. **K-Depth Ablation**: Test k=1,2,3 only (remove k=4 noise floor)
+4. **Vol² Normalization**: Scale by k to equalize magnitudes
+5. **Larger Data**: WikiText-103, OpenWebText
+### Theoretical Questions
+- Does the geometric structure provide better length generalization?
+- Can we interpret k-level activations semantically?
+- Does geometric validity correlate with generation quality?
+- Can we prune k-levels without performance loss?
+---
+## Citation
+```bibtex
+@misc{ksimplex-llm-2026,
+  author = {AbstractPhil},
+  title = {K-Simplex Language Model: Geometric Autoregression with Cayley-Menger Validation},
+  year = {2026},
+  publisher = {HuggingFace},
+  url = {https://huggingface.co/AbstractPhil/ksimplex-llm-prototype}
+}
+```
+---
+## License
+MIT License - Free to use, modify, and distribute.
+---
+## Acknowledgments
+Built on the foundation of geometric deep learning research exploring k-simplex structures, pentachoron navigation, and Cayley-Menger determinant validation for neural network regularization.
+*"The geometry is the representation."*