# K-Simplex Language Model Prototype A geometric autoregressive language model using Cayley-Menger validated k-simplex channels. This architecture replaces traditional transformer embeddings with geometrically-constrained structures that maintain mathematical validity throughout training. ## Overview This model explores whether **geometric inductive bias** can improve language modeling by representing each token position as a hierarchy of k-simplices (edge → triangle → tetrahedron → 5-cell) with learnable deformations validated by the Cayley-Menger determinant. **Key Results:** - Shakespeare corpus: **Val PPL 113.74** at epoch 8 - 100% geometric validity maintained throughout training - Coherent dialogue generation with proper character attribution - 54M parameters (due to 50k BPE vocabulary) --- ## Architecture ### Conceptual Foundation Traditional transformers represent tokens as flat vectors. This architecture represents each token as a **stack of k-simplex structures** where: | K-Level | Structure | Vertices | Distance Pairs | Geometric Meaning | |---------|-----------|----------|----------------|-------------------| | k=1 | Edge | 2 | 1 | 1D linear relationship | | k=2 | Triangle | 3 | 3 | 2D planar structure | | k=3 | Tetrahedron | 4 | 6 | 3D volumetric structure | | k=4 | 5-cell | 5 | 10 | 4D hypervolume | Each k-level captures progressively higher-dimensional geometric relationships, providing a structured representation space that traditional embeddings lack. ### Token Flow ``` Token ID ↓ Embedding Layer (vocab_size × embed_dim) ↓ Positional Encoding ↓ ┌─────────────────────────────────────────┐ │ TokenToKChannels │ │ Projects to [B, T, K, feat_dim] │ │ Each position gets K simplex channels │ └─────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────┐ │ GeoBlock × num_blocks │ │ ┌─────────────────────────────────┐ │ │ │ KChannelCrossAttention │ │ │ │ K-levels attend to each other │ │ │ │ (within each token position) │ │ │ └─────────────────────────────────┘ │ │ ┌─────────────────────────────────┐ │ │ │ CausalSequenceAttention │ │ │ │ Tokens attend causally │ │ │ │ (across sequence, masked) │ │ │ └─────────────────────────────────┘ │ │ ┌─────────────────────────────────┐ │ │ │ MLP │ │ │ └─────────────────────────────────┘ │ └─────────────────────────────────────────┘ ↓ LM Head → Logits [B, T, vocab_size] ``` --- ## Geometric Formulas ### Cayley-Menger Determinant For a k-simplex with vertices $v_0, v_1, \ldots, v_k$, the squared volume is computed via: $$ \text{Vol}^2 = \frac{(-1)^{k+1}}{2^k (k!)^2} \det(CM) $$ Where the Cayley-Menger matrix is: $$ CM = \begin{pmatrix} 0 & 1 & 1 & \cdots & 1 \\ 1 & 0 & d_{01}^2 & \cdots & d_{0k}^2 \\ 1 & d_{01}^2 & 0 & \cdots & d_{1k}^2 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & d_{0k}^2 & d_{1k}^2 & \cdots & 0 \end{pmatrix} $$ **Validity Criterion:** $\text{Vol}^2 > 0$ indicates a non-degenerate simplex. ### Template Deformation Each k-simplex starts from a regular (equilateral) template and learns deformations: $$ v_i^{(\text{deformed})} = v_i^{(\text{template})} + \alpha \cdot \Delta v_i $$ Where: - $v_i^{(\text{template})}$ = vertices of regular k-simplex - $\alpha$ = deformation scale (BASE_DEFORM = 0.05) - $\Delta v_i$ = learned offset from neural network ### Geometric Gating Features are gated by geometric validity: $$ \text{output} = \text{features} \odot \text{gate}(\text{geo}) \odot \sigma(\text{Vol}^2 \cdot 10^6) $$ Where: - $\text{gate}(\text{geo}) = \sigma(W \cdot [d^2 \| \text{Vol}^2])$ - The sigmoid on Vol² acts as a soft validity mask - Invalid simplices (Vol² < 0) have their features suppressed ### Loss Function $$ \mathcal{L} = \mathcal{L}_{CE} + \lambda \cdot \mathcal{L}_{validity} $$ Where: - $\mathcal{L}_{CE}$ = Cross-entropy for next-token prediction - $\mathcal{L}_{validity} = \text{mean}(\text{ReLU}(-\text{Vol}^2))$ penalizes collapsed simplices - $\lambda = 0.1$ (validity weight) --- ## Safe Deformation Analysis Extensive testing via the K-Simplex Geometric Explorer revealed critical stability zones: ### Stability Zones by K-Depth | Configuration | Differentiation Zone | Collapse Threshold | |---------------|---------------------|-------------------| | k=1-4, edim=16 | 0.15 - 0.35 | ~0.50 | | k=1-4, edim=32 | 0.15 - 0.50 | >2.0 | | k=1-6, edim=16 | 0.35 - 0.45 | ~0.50 | | k=1-6, edim=32 | 0.25 - 0.60 | >2.0 | ### Key Findings 1. **Deformation Scale Safety**: BASE_DEFORM=0.05 is extremely conservative. The geometry can safely handle 10-40× more deformation. 2. **Embedding Dimension as Stability Buffer**: ``` edim / k_max = stability_ratio ratio ≥ 8× → Very stable, deform up to 2.0 ratio ≥ 4× → Comfortable margin ratio ≥ 2× → Tight but functional ``` 3. **Vol² Behavior Under Deformation**: - Low deform (0-0.15): Clear k-level hierarchy, Vol² decreases exponentially with k - Medium deform (0.15-0.35): **Optimal zone** - distinct geometric signatures per k - High deform (>0.5): Noise dominates, k-levels converge, geometric meaning lost 4. **Vol² Scaling**: ``` k=1: Vol² ~ 1e+0 (edge length squared) k=2: Vol² ~ 1e-1 (triangle area squared) k=3: Vol² ~ 1e-2 (tetrahedron volume squared) k=4: Vol² ~ 1e-3 (5-cell hypervolume squared) ``` Exponential decay is expected and healthy. ### Recommended Production Settings ```python # Conservative (proven) BASE_DEFORM = 0.05 edim = 16 depth = 4 # k=1,2,3,4 # Aggressive (tested safe) BASE_DEFORM = 0.15 edim = 32 depth = 4 # Experimental BASE_DEFORM = learnable_per_k # Allow network to find optimal edim = 2 * depth # Minimum viable ``` --- ## Training Configuration ### Model Hyperparameters ```python config = { "vocab_size": 50257, # GPT-2 BPE tokenizer "max_seq_len": 256, "embed_dim": 384, "depth": 4, # k=1,2,3,4 "edim": 16, # Vertex coordinate dimension "feat_dim": 96, # Features per vertex "hidden": 384, "num_heads": 8, "num_blocks": 8, "dropout": 0.1, } ``` ### Training Hyperparameters ```python training = { "batch_size": 48, "seq_len": 256, "lr": 3e-4, "weight_decay": 0.1, "num_epochs": 50, "grad_clip": 1.0, "ce_weight": 1.0, "validity_weight": 0.1, "scheduler": "CosineAnnealingLR", "stride": 128, # Non-overlapping sequences } ``` --- ## Results ### Training Progression | Epoch | Train PPL | Val PPL | Status | |-------|-----------|---------|--------| | 1 | 492 | 299 | Learning | | 5 | 77 | 132 | Improving | | 8 | 44 | **114** | **Best** | | 15 | 15 | 145 | Overfitting | ### Geometric Health Throughout training: - **Validity**: 100% at all k-levels - **Vol² k=1**: ~0.92 (stable) - **Vol² k=2**: ~0.16 (stable) - **Vol² k=3**: ~0.03 (stable) - **Vol² k=4**: ~0.001 (stable) ### Generation Quality **Epoch 1:** ``` ROMEO: , If, and a head I am IAB, What, ``` **Epoch 15+:** ``` ROMEO: if thou swear'st the Duke of love of it. MERCUTIO: Why, is it good. ROMEO: And for the jest love that. ``` The model learns: - Character names and dialogue structure - Turn-taking conventions - Shakespearean vocabulary and cadence - Coherent multi-turn exchanges --- ## Geometric Dimensions Output Each k-level contributes to the final representation: | K | Geo Dim | Components | Info Content | |---|---------|------------|--------------| | 1 | 2 | 1 d² + 1 vol² | Edge metric | | 2 | 4 | 3 d² + 1 vol² | Triangle shape | | 3 | 7 | 6 d² + 1 vol² | Tetrahedron form | | 4 | 11 | 10 d² + 1 vol² | 5-cell structure | | **Total** | **24** | | Pure geometry | With feat_dim=96: Output = 96 + 24 = 120 dims per k-level, ×4 k-levels = 480 total geometric dims per token. --- ## File Structure ``` AbstractPhil/ksimplex-llm-prototype/ ├── README.md # This file ├── trainer.py # Training script ├── inference.py # Generation script ├── config.json # Model configuration ├── checkpoints/ │ ├── checkpoint_epoch_001.pt │ ├── checkpoint_epoch_008.pt # Best val PPL │ └── checkpoint_latest.pt └── samples/ └── samples_epoch_*.json # Generated text samples ``` --- ## Usage ### Inference ```python from inference import load_model, generate model, tokenizer = load_model("AbstractPhil/ksimplex-llm-prototype") text = generate( model, tokenizer, prompt="ROMEO: ", max_tokens=100, temperature=0.8, top_k=50 ) print(text) ``` ### Training ```bash python trainer.py \ --data shakespeare.txt \ --epochs 50 \ --batch_size 48 \ --lr 3e-4 ``` --- ## Future Directions ### Planned Experiments 1. **Learnable Deformation Scale**: Per-k learnable α parameter 2. **Volume Consistency Loss**: Maintain k-level differentiation ```python coherence_loss = -torch.std(torch.log(vol2_stack + 1e-10)) ``` 3. **K-Depth Ablation**: Test k=1,2,3 only (remove k=4 noise floor) 4. **Vol² Normalization**: Scale by k to equalize magnitudes 5. **Larger Data**: WikiText-103, OpenWebText ### Theoretical Questions - Does the geometric structure provide better length generalization? - Can we interpret k-level activations semantically? - Does geometric validity correlate with generation quality? - Can we prune k-levels without performance loss? --- ## Citation ```bibtex @misc{ksimplex-llm-2026, author = {AbstractPhil}, title = {K-Simplex Language Model: Geometric Autoregression with Cayley-Menger Validation}, year = {2026}, publisher = {HuggingFace}, url = {https://huggingface.co/AbstractPhil/ksimplex-llm-prototype} } ``` --- ## License MIT License - Free to use, modify, and distribute. --- ## Acknowledgments Built on the foundation of geometric deep learning research exploring k-simplex structures, pentachoron navigation, and Cayley-Menger determinant validation for neural network regularization. *"The geometry is the representation."*