File size: 11,143 Bytes

5d154e8

# K-Simplex Language Model Prototype

A geometric autoregressive language model using Cayley-Menger validated k-simplex channels. This architecture replaces traditional transformer embeddings with geometrically-constrained structures that maintain mathematical validity throughout training.

## Overview

This model explores whether **geometric inductive bias** can improve language modeling by representing each token position as a hierarchy of k-simplices (edge → triangle → tetrahedron → 5-cell) with learnable deformations validated by the Cayley-Menger determinant.

**Key Results:**
- Shakespeare corpus: **Val PPL 113.74** at epoch 8
- 100% geometric validity maintained throughout training
- Coherent dialogue generation with proper character attribution
- 54M parameters (due to 50k BPE vocabulary)

---

## Architecture

### Conceptual Foundation

Traditional transformers represent tokens as flat vectors. This architecture represents each token as a **stack of k-simplex structures** where:

| K-Level | Structure | Vertices | Distance Pairs | Geometric Meaning |
|---------|-----------|----------|----------------|-------------------|
| k=1 | Edge | 2 | 1 | 1D linear relationship |
| k=2 | Triangle | 3 | 3 | 2D planar structure |
| k=3 | Tetrahedron | 4 | 6 | 3D volumetric structure |
| k=4 | 5-cell | 5 | 10 | 4D hypervolume |

Each k-level captures progressively higher-dimensional geometric relationships, providing a structured representation space that traditional embeddings lack.

### Token Flow

```
Token ID
    ↓
Embedding Layer (vocab_size × embed_dim)
    ↓
Positional Encoding
    ↓
┌─────────────────────────────────────────┐
│         TokenToKChannels                │
│  Projects to [B, T, K, feat_dim]        │
│  Each position gets K simplex channels  │
└─────────────────────────────────────────┘
    ↓
┌─────────────────────────────────────────┐
│         GeoBlock × num_blocks           │
│  ┌─────────────────────────────────┐    │
│  │ KChannelCrossAttention          │    │
│  │ K-levels attend to each other   │    │
│  │ (within each token position)    │    │
│  └─────────────────────────────────┘    │
│  ┌─────────────────────────────────┐    │
│  │ CausalSequenceAttention         │    │
│  │ Tokens attend causally          │    │
│  │ (across sequence, masked)       │    │
│  └─────────────────────────────────┘    │
│  ┌─────────────────────────────────┐    │
│  │ MLP                             │    │
│  └─────────────────────────────────┘    │
└─────────────────────────────────────────┘
    ↓
LM Head → Logits [B, T, vocab_size]
```

---

## Geometric Formulas

### Cayley-Menger Determinant

For a k-simplex with vertices $v_0, v_1, \ldots, v_k$, the squared volume is computed via:

$$
\text{Vol}^2 = \frac{(-1)^{k+1}}{2^k (k!)^2} \det(CM)
$$

Where the Cayley-Menger matrix is:

$$
CM = \begin{pmatrix}
0 & 1 & 1 & \cdots & 1 \\
1 & 0 & d_{01}^2 & \cdots & d_{0k}^2 \\
1 & d_{01}^2 & 0 & \cdots & d_{1k}^2 \\
\vdots & \vdots & \vdots & \ddots & \vdots \\
1 & d_{0k}^2 & d_{1k}^2 & \cdots & 0
\end{pmatrix}
$$

**Validity Criterion:** $\text{Vol}^2 > 0$ indicates a non-degenerate simplex.

### Template Deformation

Each k-simplex starts from a regular (equilateral) template and learns deformations:

$$
v_i^{(\text{deformed})} = v_i^{(\text{template})} + \alpha \cdot \Delta v_i
$$

Where:
- $v_i^{(\text{template})}$ = vertices of regular k-simplex
- $\alpha$ = deformation scale (BASE_DEFORM = 0.05)
- $\Delta v_i$ = learned offset from neural network

### Geometric Gating

Features are gated by geometric validity:

$$
\text{output} = \text{features} \odot \text{gate}(\text{geo}) \odot \sigma(\text{Vol}^2 \cdot 10^6)
$$

Where:
- $\text{gate}(\text{geo}) = \sigma(W \cdot [d^2 \| \text{Vol}^2])$
- The sigmoid on Vol² acts as a soft validity mask
- Invalid simplices (Vol² < 0) have their features suppressed

### Loss Function

$$
\mathcal{L} = \mathcal{L}_{CE} + \lambda \cdot \mathcal{L}_{validity}
$$

Where:
- $\mathcal{L}_{CE}$ = Cross-entropy for next-token prediction
- $\mathcal{L}_{validity} = \text{mean}(\text{ReLU}(-\text{Vol}^2))$ penalizes collapsed simplices
- $\lambda = 0.1$ (validity weight)

---

## Safe Deformation Analysis

Extensive testing via the K-Simplex Geometric Explorer revealed critical stability zones:

### Stability Zones by K-Depth

| Configuration | Differentiation Zone | Collapse Threshold |
|---------------|---------------------|-------------------|
| k=1-4, edim=16 | 0.15 - 0.35 | ~0.50 |
| k=1-4, edim=32 | 0.15 - 0.50 | >2.0 |
| k=1-6, edim=16 | 0.35 - 0.45 | ~0.50 |
| k=1-6, edim=32 | 0.25 - 0.60 | >2.0 |

### Key Findings

1. **Deformation Scale Safety**: BASE_DEFORM=0.05 is extremely conservative. The geometry can safely handle 10-40× more deformation.

2. **Embedding Dimension as Stability Buffer**:
   ```
   edim / k_max = stability_ratio
   
   ratio ≥ 8×  → Very stable, deform up to 2.0
   ratio ≥ 4×  → Comfortable margin
   ratio ≥ 2×  → Tight but functional
   ```

3. **Vol² Behavior Under Deformation**:
   - Low deform (0-0.15): Clear k-level hierarchy, Vol² decreases exponentially with k
   - Medium deform (0.15-0.35): **Optimal zone** - distinct geometric signatures per k
   - High deform (>0.5): Noise dominates, k-levels converge, geometric meaning lost

4. **Vol² Scaling**:
   ```
   k=1: Vol² ~ 1e+0 (edge length squared)
   k=2: Vol² ~ 1e-1 (triangle area squared)
   k=3: Vol² ~ 1e-2 (tetrahedron volume squared)
   k=4: Vol² ~ 1e-3 (5-cell hypervolume squared)
   ```
   Exponential decay is expected and healthy.

### Recommended Production Settings

```python
# Conservative (proven)
BASE_DEFORM = 0.05
edim = 16
depth = 4  # k=1,2,3,4

# Aggressive (tested safe)
BASE_DEFORM = 0.15
edim = 32
depth = 4

# Experimental
BASE_DEFORM = learnable_per_k  # Allow network to find optimal
edim = 2 * depth  # Minimum viable
```

---

## Training Configuration

### Model Hyperparameters

```python
config = {
    "vocab_size": 50257,      # GPT-2 BPE tokenizer
    "max_seq_len": 256,
    "embed_dim": 384,
    "depth": 4,               # k=1,2,3,4
    "edim": 16,               # Vertex coordinate dimension
    "feat_dim": 96,           # Features per vertex
    "hidden": 384,
    "num_heads": 8,
    "num_blocks": 8,
    "dropout": 0.1,
}
```

### Training Hyperparameters

```python
training = {
    "batch_size": 48,
    "seq_len": 256,
    "lr": 3e-4,
    "weight_decay": 0.1,
    "num_epochs": 50,
    "grad_clip": 1.0,
    "ce_weight": 1.0,
    "validity_weight": 0.1,
    "scheduler": "CosineAnnealingLR",
    "stride": 128,            # Non-overlapping sequences
}
```

---

## Results

### Training Progression

| Epoch | Train PPL | Val PPL | Status |
|-------|-----------|---------|--------|
| 1 | 492 | 299 | Learning |
| 5 | 77 | 132 | Improving |
| 8 | 44 | **114** | **Best** |
| 15 | 15 | 145 | Overfitting |

### Geometric Health

Throughout training:
- **Validity**: 100% at all k-levels
- **Vol² k=1**: ~0.92 (stable)
- **Vol² k=2**: ~0.16 (stable)
- **Vol² k=3**: ~0.03 (stable)
- **Vol² k=4**: ~0.001 (stable)

### Generation Quality

**Epoch 1:**
```
ROMEO: , If, and a head I am IAB, What,
```

**Epoch 15+:**
```
ROMEO: if thou swear'st the Duke of love of it.
MERCUTIO: Why, is it good.
ROMEO: And for the jest love that.
```

The model learns:
- Character names and dialogue structure
- Turn-taking conventions
- Shakespearean vocabulary and cadence
- Coherent multi-turn exchanges

---

## Geometric Dimensions Output

Each k-level contributes to the final representation:

| K | Geo Dim | Components | Info Content |
|---|---------|------------|--------------|
| 1 | 2 | 1 d² + 1 vol² | Edge metric |
| 2 | 4 | 3 d² + 1 vol² | Triangle shape |
| 3 | 7 | 6 d² + 1 vol² | Tetrahedron form |
| 4 | 11 | 10 d² + 1 vol² | 5-cell structure |
| **Total** | **24** | | Pure geometry |

With feat_dim=96: Output = 96 + 24 = 120 dims per k-level, ×4 k-levels = 480 total geometric dims per token.

---

## File Structure

```
AbstractPhil/ksimplex-llm-prototype/
├── README.md                 # This file
├── trainer.py               # Training script
├── inference.py             # Generation script
├── config.json              # Model configuration
├── checkpoints/
│   ├── checkpoint_epoch_001.pt
│   ├── checkpoint_epoch_008.pt  # Best val PPL
│   └── checkpoint_latest.pt
└── samples/
    └── samples_epoch_*.json  # Generated text samples
```

---

## Usage

### Inference

```python
from inference import load_model, generate

model, tokenizer = load_model("AbstractPhil/ksimplex-llm-prototype")

text = generate(
    model, 
    tokenizer,
    prompt="ROMEO: ",
    max_tokens=100,
    temperature=0.8,
    top_k=50
)
print(text)
```

### Training

```bash
python trainer.py \
    --data shakespeare.txt \
    --epochs 50 \
    --batch_size 48 \
    --lr 3e-4
```

---

## Future Directions

### Planned Experiments

1. **Learnable Deformation Scale**: Per-k learnable α parameter
2. **Volume Consistency Loss**: Maintain k-level differentiation
   ```python
   coherence_loss = -torch.std(torch.log(vol2_stack + 1e-10))
   ```
3. **K-Depth Ablation**: Test k=1,2,3 only (remove k=4 noise floor)
4. **Vol² Normalization**: Scale by k to equalize magnitudes
5. **Larger Data**: WikiText-103, OpenWebText

### Theoretical Questions

- Does the geometric structure provide better length generalization?
- Can we interpret k-level activations semantically?
- Does geometric validity correlate with generation quality?
- Can we prune k-levels without performance loss?

---

## Citation

```bibtex
@misc{ksimplex-llm-2026,
  author = {AbstractPhil},
  title = {K-Simplex Language Model: Geometric Autoregression with Cayley-Menger Validation},
  year = {2026},
  publisher = {HuggingFace},
  url = {https://huggingface.co/AbstractPhil/ksimplex-llm-prototype}
}
```

---

## License

MIT License - Free to use, modify, and distribute.

---

## Acknowledgments

Built on the foundation of geometric deep learning research exploring k-simplex structures, pentachoron navigation, and Cayley-Menger determinant validation for neural network regularization.

*"The geometry is the representation."*