| # K-Simplex Language Model Prototype | |
| A geometric autoregressive language model using Cayley-Menger validated k-simplex channels. This architecture replaces traditional transformer embeddings with geometrically-constrained structures that maintain mathematical validity throughout training. | |
| ## Overview | |
| This model explores whether **geometric inductive bias** can improve language modeling by representing each token position as a hierarchy of k-simplices (edge β triangle β tetrahedron β 5-cell) with learnable deformations validated by the Cayley-Menger determinant. | |
| **Key Results:** | |
| - Shakespeare corpus: **Val PPL 113.74** at epoch 8 | |
| - 100% geometric validity maintained throughout training | |
| - Coherent dialogue generation with proper character attribution | |
| - 54M parameters (due to 50k BPE vocabulary) | |
| --- | |
| ## Architecture | |
| ### Conceptual Foundation | |
| Traditional transformers represent tokens as flat vectors. This architecture represents each token as a **stack of k-simplex structures** where: | |
| | K-Level | Structure | Vertices | Distance Pairs | Geometric Meaning | | |
| |---------|-----------|----------|----------------|-------------------| | |
| | k=1 | Edge | 2 | 1 | 1D linear relationship | | |
| | k=2 | Triangle | 3 | 3 | 2D planar structure | | |
| | k=3 | Tetrahedron | 4 | 6 | 3D volumetric structure | | |
| | k=4 | 5-cell | 5 | 10 | 4D hypervolume | | |
| Each k-level captures progressively higher-dimensional geometric relationships, providing a structured representation space that traditional embeddings lack. | |
| ### Token Flow | |
| ``` | |
| Token ID | |
| β | |
| Embedding Layer (vocab_size Γ embed_dim) | |
| β | |
| Positional Encoding | |
| β | |
| βββββββββββββββββββββββββββββββββββββββββββ | |
| β TokenToKChannels β | |
| β Projects to [B, T, K, feat_dim] β | |
| β Each position gets K simplex channels β | |
| βββββββββββββββββββββββββββββββββββββββββββ | |
| β | |
| βββββββββββββββββββββββββββββββββββββββββββ | |
| β GeoBlock Γ num_blocks β | |
| β βββββββββββββββββββββββββββββββββββ β | |
| β β KChannelCrossAttention β β | |
| β β K-levels attend to each other β β | |
| β β (within each token position) β β | |
| β βββββββββββββββββββββββββββββββββββ β | |
| β βββββββββββββββββββββββββββββββββββ β | |
| β β CausalSequenceAttention β β | |
| β β Tokens attend causally β β | |
| β β (across sequence, masked) β β | |
| β βββββββββββββββββββββββββββββββββββ β | |
| β βββββββββββββββββββββββββββββββββββ β | |
| β β MLP β β | |
| β βββββββββββββββββββββββββββββββββββ β | |
| βββββββββββββββββββββββββββββββββββββββββββ | |
| β | |
| LM Head β Logits [B, T, vocab_size] | |
| ``` | |
| --- | |
| ## Geometric Formulas | |
| ### Cayley-Menger Determinant | |
| For a k-simplex with vertices $v_0, v_1, \ldots, v_k$, the squared volume is computed via: | |
| $$ | |
| \text{Vol}^2 = \frac{(-1)^{k+1}}{2^k (k!)^2} \det(CM) | |
| $$ | |
| Where the Cayley-Menger matrix is: | |
| $$ | |
| CM = \begin{pmatrix} | |
| 0 & 1 & 1 & \cdots & 1 \\ | |
| 1 & 0 & d_{01}^2 & \cdots & d_{0k}^2 \\ | |
| 1 & d_{01}^2 & 0 & \cdots & d_{1k}^2 \\ | |
| \vdots & \vdots & \vdots & \ddots & \vdots \\ | |
| 1 & d_{0k}^2 & d_{1k}^2 & \cdots & 0 | |
| \end{pmatrix} | |
| $$ | |
| **Validity Criterion:** $\text{Vol}^2 > 0$ indicates a non-degenerate simplex. | |
| ### Template Deformation | |
| Each k-simplex starts from a regular (equilateral) template and learns deformations: | |
| $$ | |
| v_i^{(\text{deformed})} = v_i^{(\text{template})} + \alpha \cdot \Delta v_i | |
| $$ | |
| Where: | |
| - $v_i^{(\text{template})}$ = vertices of regular k-simplex | |
| - $\alpha$ = deformation scale (BASE_DEFORM = 0.05) | |
| - $\Delta v_i$ = learned offset from neural network | |
| ### Geometric Gating | |
| Features are gated by geometric validity: | |
| $$ | |
| \text{output} = \text{features} \odot \text{gate}(\text{geo}) \odot \sigma(\text{Vol}^2 \cdot 10^6) | |
| $$ | |
| Where: | |
| - $\text{gate}(\text{geo}) = \sigma(W \cdot [d^2 \| \text{Vol}^2])$ | |
| - The sigmoid on VolΒ² acts as a soft validity mask | |
| - Invalid simplices (VolΒ² < 0) have their features suppressed | |
| ### Loss Function | |
| $$ | |
| \mathcal{L} = \mathcal{L}_{CE} + \lambda \cdot \mathcal{L}_{validity} | |
| $$ | |
| Where: | |
| - $\mathcal{L}_{CE}$ = Cross-entropy for next-token prediction | |
| - $\mathcal{L}_{validity} = \text{mean}(\text{ReLU}(-\text{Vol}^2))$ penalizes collapsed simplices | |
| - $\lambda = 0.1$ (validity weight) | |
| --- | |
| ## Safe Deformation Analysis | |
| Extensive testing via the K-Simplex Geometric Explorer revealed critical stability zones: | |
| ### Stability Zones by K-Depth | |
| | Configuration | Differentiation Zone | Collapse Threshold | | |
| |---------------|---------------------|-------------------| | |
| | k=1-4, edim=16 | 0.15 - 0.35 | ~0.50 | | |
| | k=1-4, edim=32 | 0.15 - 0.50 | >2.0 | | |
| | k=1-6, edim=16 | 0.35 - 0.45 | ~0.50 | | |
| | k=1-6, edim=32 | 0.25 - 0.60 | >2.0 | | |
| ### Key Findings | |
| 1. **Deformation Scale Safety**: BASE_DEFORM=0.05 is extremely conservative. The geometry can safely handle 10-40Γ more deformation. | |
| 2. **Embedding Dimension as Stability Buffer**: | |
| ``` | |
| edim / k_max = stability_ratio | |
| ratio β₯ 8Γ β Very stable, deform up to 2.0 | |
| ratio β₯ 4Γ β Comfortable margin | |
| ratio β₯ 2Γ β Tight but functional | |
| ``` | |
| 3. **VolΒ² Behavior Under Deformation**: | |
| - Low deform (0-0.15): Clear k-level hierarchy, VolΒ² decreases exponentially with k | |
| - Medium deform (0.15-0.35): **Optimal zone** - distinct geometric signatures per k | |
| - High deform (>0.5): Noise dominates, k-levels converge, geometric meaning lost | |
| 4. **VolΒ² Scaling**: | |
| ``` | |
| k=1: VolΒ² ~ 1e+0 (edge length squared) | |
| k=2: VolΒ² ~ 1e-1 (triangle area squared) | |
| k=3: VolΒ² ~ 1e-2 (tetrahedron volume squared) | |
| k=4: VolΒ² ~ 1e-3 (5-cell hypervolume squared) | |
| ``` | |
| Exponential decay is expected and healthy. | |
| ### Recommended Production Settings | |
| ```python | |
| # Conservative (proven) | |
| BASE_DEFORM = 0.05 | |
| edim = 16 | |
| depth = 4 # k=1,2,3,4 | |
| # Aggressive (tested safe) | |
| BASE_DEFORM = 0.15 | |
| edim = 32 | |
| depth = 4 | |
| # Experimental | |
| BASE_DEFORM = learnable_per_k # Allow network to find optimal | |
| edim = 2 * depth # Minimum viable | |
| ``` | |
| --- | |
| ## Training Configuration | |
| ### Model Hyperparameters | |
| ```python | |
| config = { | |
| "vocab_size": 50257, # GPT-2 BPE tokenizer | |
| "max_seq_len": 256, | |
| "embed_dim": 384, | |
| "depth": 4, # k=1,2,3,4 | |
| "edim": 16, # Vertex coordinate dimension | |
| "feat_dim": 96, # Features per vertex | |
| "hidden": 384, | |
| "num_heads": 8, | |
| "num_blocks": 8, | |
| "dropout": 0.1, | |
| } | |
| ``` | |
| ### Training Hyperparameters | |
| ```python | |
| training = { | |
| "batch_size": 48, | |
| "seq_len": 256, | |
| "lr": 3e-4, | |
| "weight_decay": 0.1, | |
| "num_epochs": 50, | |
| "grad_clip": 1.0, | |
| "ce_weight": 1.0, | |
| "validity_weight": 0.1, | |
| "scheduler": "CosineAnnealingLR", | |
| "stride": 128, # Non-overlapping sequences | |
| } | |
| ``` | |
| --- | |
| ## Results | |
| ### Training Progression | |
| | Epoch | Train PPL | Val PPL | Status | | |
| |-------|-----------|---------|--------| | |
| | 1 | 492 | 299 | Learning | | |
| | 5 | 77 | 132 | Improving | | |
| | 8 | 44 | **114** | **Best** | | |
| | 15 | 15 | 145 | Overfitting | | |
| ### Geometric Health | |
| Throughout training: | |
| - **Validity**: 100% at all k-levels | |
| - **VolΒ² k=1**: ~0.92 (stable) | |
| - **VolΒ² k=2**: ~0.16 (stable) | |
| - **VolΒ² k=3**: ~0.03 (stable) | |
| - **VolΒ² k=4**: ~0.001 (stable) | |
| ### Generation Quality | |
| **Epoch 1:** | |
| ``` | |
| ROMEO: , If, and a head I am IAB, What, | |
| ``` | |
| **Epoch 15+:** | |
| ``` | |
| ROMEO: if thou swear'st the Duke of love of it. | |
| MERCUTIO: Why, is it good. | |
| ROMEO: And for the jest love that. | |
| ``` | |
| The model learns: | |
| - Character names and dialogue structure | |
| - Turn-taking conventions | |
| - Shakespearean vocabulary and cadence | |
| - Coherent multi-turn exchanges | |
| --- | |
| ## Geometric Dimensions Output | |
| Each k-level contributes to the final representation: | |
| | K | Geo Dim | Components | Info Content | | |
| |---|---------|------------|--------------| | |
| | 1 | 2 | 1 dΒ² + 1 volΒ² | Edge metric | | |
| | 2 | 4 | 3 dΒ² + 1 volΒ² | Triangle shape | | |
| | 3 | 7 | 6 dΒ² + 1 volΒ² | Tetrahedron form | | |
| | 4 | 11 | 10 dΒ² + 1 volΒ² | 5-cell structure | | |
| | **Total** | **24** | | Pure geometry | | |
| With feat_dim=96: Output = 96 + 24 = 120 dims per k-level, Γ4 k-levels = 480 total geometric dims per token. | |
| --- | |
| ## File Structure | |
| ``` | |
| AbstractPhil/ksimplex-llm-prototype/ | |
| βββ README.md # This file | |
| βββ trainer.py # Training script | |
| βββ inference.py # Generation script | |
| βββ config.json # Model configuration | |
| βββ checkpoints/ | |
| β βββ checkpoint_epoch_001.pt | |
| β βββ checkpoint_epoch_008.pt # Best val PPL | |
| β βββ checkpoint_latest.pt | |
| βββ samples/ | |
| βββ samples_epoch_*.json # Generated text samples | |
| ``` | |
| --- | |
| ## Usage | |
| ### Inference | |
| ```python | |
| from inference import load_model, generate | |
| model, tokenizer = load_model("AbstractPhil/ksimplex-llm-prototype") | |
| text = generate( | |
| model, | |
| tokenizer, | |
| prompt="ROMEO: ", | |
| max_tokens=100, | |
| temperature=0.8, | |
| top_k=50 | |
| ) | |
| print(text) | |
| ``` | |
| ### Training | |
| ```bash | |
| python trainer.py \ | |
| --data shakespeare.txt \ | |
| --epochs 50 \ | |
| --batch_size 48 \ | |
| --lr 3e-4 | |
| ``` | |
| --- | |
| ## Future Directions | |
| ### Planned Experiments | |
| 1. **Learnable Deformation Scale**: Per-k learnable Ξ± parameter | |
| 2. **Volume Consistency Loss**: Maintain k-level differentiation | |
| ```python | |
| coherence_loss = -torch.std(torch.log(vol2_stack + 1e-10)) | |
| ``` | |
| 3. **K-Depth Ablation**: Test k=1,2,3 only (remove k=4 noise floor) | |
| 4. **VolΒ² Normalization**: Scale by k to equalize magnitudes | |
| 5. **Larger Data**: WikiText-103, OpenWebText | |
| ### Theoretical Questions | |
| - Does the geometric structure provide better length generalization? | |
| - Can we interpret k-level activations semantically? | |
| - Does geometric validity correlate with generation quality? | |
| - Can we prune k-levels without performance loss? | |
| --- | |
| ## Citation | |
| ```bibtex | |
| @misc{ksimplex-llm-2026, | |
| author = {AbstractPhil}, | |
| title = {K-Simplex Language Model: Geometric Autoregression with Cayley-Menger Validation}, | |
| year = {2026}, | |
| publisher = {HuggingFace}, | |
| url = {https://huggingface.co/AbstractPhil/ksimplex-llm-prototype} | |
| } | |
| ``` | |
| --- | |
| ## License | |
| MIT License - Free to use, modify, and distribute. | |
| --- | |
| ## Acknowledgments | |
| Built on the foundation of geometric deep learning research exploring k-simplex structures, pentachoron navigation, and Cayley-Menger determinant validation for neural network regularization. | |
| *"The geometry is the representation."* |