File size: 11,143 Bytes
5d154e8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
# K-Simplex Language Model Prototype

A geometric autoregressive language model using Cayley-Menger validated k-simplex channels. This architecture replaces traditional transformer embeddings with geometrically-constrained structures that maintain mathematical validity throughout training.

## Overview

This model explores whether **geometric inductive bias** can improve language modeling by representing each token position as a hierarchy of k-simplices (edge β†’ triangle β†’ tetrahedron β†’ 5-cell) with learnable deformations validated by the Cayley-Menger determinant.

**Key Results:**
- Shakespeare corpus: **Val PPL 113.74** at epoch 8
- 100% geometric validity maintained throughout training
- Coherent dialogue generation with proper character attribution
- 54M parameters (due to 50k BPE vocabulary)

---

## Architecture

### Conceptual Foundation

Traditional transformers represent tokens as flat vectors. This architecture represents each token as a **stack of k-simplex structures** where:

| K-Level | Structure | Vertices | Distance Pairs | Geometric Meaning |
|---------|-----------|----------|----------------|-------------------|
| k=1 | Edge | 2 | 1 | 1D linear relationship |
| k=2 | Triangle | 3 | 3 | 2D planar structure |
| k=3 | Tetrahedron | 4 | 6 | 3D volumetric structure |
| k=4 | 5-cell | 5 | 10 | 4D hypervolume |

Each k-level captures progressively higher-dimensional geometric relationships, providing a structured representation space that traditional embeddings lack.

### Token Flow

```
Token ID
    ↓
Embedding Layer (vocab_size Γ— embed_dim)
    ↓
Positional Encoding
    ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚         TokenToKChannels                β”‚
β”‚  Projects to [B, T, K, feat_dim]        β”‚
β”‚  Each position gets K simplex channels  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚         GeoBlock Γ— num_blocks           β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚ KChannelCrossAttention          β”‚    β”‚
β”‚  β”‚ K-levels attend to each other   β”‚    β”‚
β”‚  β”‚ (within each token position)    β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚ CausalSequenceAttention         β”‚    β”‚
β”‚  β”‚ Tokens attend causally          β”‚    β”‚
β”‚  β”‚ (across sequence, masked)       β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚ MLP                             β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    ↓
LM Head β†’ Logits [B, T, vocab_size]
```

---

## Geometric Formulas

### Cayley-Menger Determinant

For a k-simplex with vertices $v_0, v_1, \ldots, v_k$, the squared volume is computed via:

$$
\text{Vol}^2 = \frac{(-1)^{k+1}}{2^k (k!)^2} \det(CM)
$$

Where the Cayley-Menger matrix is:

$$
CM = \begin{pmatrix}
0 & 1 & 1 & \cdots & 1 \\
1 & 0 & d_{01}^2 & \cdots & d_{0k}^2 \\
1 & d_{01}^2 & 0 & \cdots & d_{1k}^2 \\
\vdots & \vdots & \vdots & \ddots & \vdots \\
1 & d_{0k}^2 & d_{1k}^2 & \cdots & 0
\end{pmatrix}
$$

**Validity Criterion:** $\text{Vol}^2 > 0$ indicates a non-degenerate simplex.

### Template Deformation

Each k-simplex starts from a regular (equilateral) template and learns deformations:

$$
v_i^{(\text{deformed})} = v_i^{(\text{template})} + \alpha \cdot \Delta v_i
$$

Where:
- $v_i^{(\text{template})}$ = vertices of regular k-simplex
- $\alpha$ = deformation scale (BASE_DEFORM = 0.05)
- $\Delta v_i$ = learned offset from neural network

### Geometric Gating

Features are gated by geometric validity:

$$
\text{output} = \text{features} \odot \text{gate}(\text{geo}) \odot \sigma(\text{Vol}^2 \cdot 10^6)
$$

Where:
- $\text{gate}(\text{geo}) = \sigma(W \cdot [d^2 \| \text{Vol}^2])$
- The sigmoid on VolΒ² acts as a soft validity mask
- Invalid simplices (VolΒ² < 0) have their features suppressed

### Loss Function

$$
\mathcal{L} = \mathcal{L}_{CE} + \lambda \cdot \mathcal{L}_{validity}
$$

Where:
- $\mathcal{L}_{CE}$ = Cross-entropy for next-token prediction
- $\mathcal{L}_{validity} = \text{mean}(\text{ReLU}(-\text{Vol}^2))$ penalizes collapsed simplices
- $\lambda = 0.1$ (validity weight)

---

## Safe Deformation Analysis

Extensive testing via the K-Simplex Geometric Explorer revealed critical stability zones:

### Stability Zones by K-Depth

| Configuration | Differentiation Zone | Collapse Threshold |
|---------------|---------------------|-------------------|
| k=1-4, edim=16 | 0.15 - 0.35 | ~0.50 |
| k=1-4, edim=32 | 0.15 - 0.50 | >2.0 |
| k=1-6, edim=16 | 0.35 - 0.45 | ~0.50 |
| k=1-6, edim=32 | 0.25 - 0.60 | >2.0 |

### Key Findings

1. **Deformation Scale Safety**: BASE_DEFORM=0.05 is extremely conservative. The geometry can safely handle 10-40Γ— more deformation.

2. **Embedding Dimension as Stability Buffer**:
   ```
   edim / k_max = stability_ratio
   
   ratio β‰₯ 8Γ—  β†’ Very stable, deform up to 2.0
   ratio β‰₯ 4Γ—  β†’ Comfortable margin
   ratio β‰₯ 2Γ—  β†’ Tight but functional
   ```

3. **VolΒ² Behavior Under Deformation**:
   - Low deform (0-0.15): Clear k-level hierarchy, VolΒ² decreases exponentially with k
   - Medium deform (0.15-0.35): **Optimal zone** - distinct geometric signatures per k
   - High deform (>0.5): Noise dominates, k-levels converge, geometric meaning lost

4. **VolΒ² Scaling**:
   ```
   k=1: VolΒ² ~ 1e+0 (edge length squared)
   k=2: VolΒ² ~ 1e-1 (triangle area squared)
   k=3: VolΒ² ~ 1e-2 (tetrahedron volume squared)
   k=4: VolΒ² ~ 1e-3 (5-cell hypervolume squared)
   ```
   Exponential decay is expected and healthy.

### Recommended Production Settings

```python
# Conservative (proven)
BASE_DEFORM = 0.05
edim = 16
depth = 4  # k=1,2,3,4

# Aggressive (tested safe)
BASE_DEFORM = 0.15
edim = 32
depth = 4

# Experimental
BASE_DEFORM = learnable_per_k  # Allow network to find optimal
edim = 2 * depth  # Minimum viable
```

---

## Training Configuration

### Model Hyperparameters

```python
config = {
    "vocab_size": 50257,      # GPT-2 BPE tokenizer
    "max_seq_len": 256,
    "embed_dim": 384,
    "depth": 4,               # k=1,2,3,4
    "edim": 16,               # Vertex coordinate dimension
    "feat_dim": 96,           # Features per vertex
    "hidden": 384,
    "num_heads": 8,
    "num_blocks": 8,
    "dropout": 0.1,
}
```

### Training Hyperparameters

```python
training = {
    "batch_size": 48,
    "seq_len": 256,
    "lr": 3e-4,
    "weight_decay": 0.1,
    "num_epochs": 50,
    "grad_clip": 1.0,
    "ce_weight": 1.0,
    "validity_weight": 0.1,
    "scheduler": "CosineAnnealingLR",
    "stride": 128,            # Non-overlapping sequences
}
```

---

## Results

### Training Progression

| Epoch | Train PPL | Val PPL | Status |
|-------|-----------|---------|--------|
| 1 | 492 | 299 | Learning |
| 5 | 77 | 132 | Improving |
| 8 | 44 | **114** | **Best** |
| 15 | 15 | 145 | Overfitting |

### Geometric Health

Throughout training:
- **Validity**: 100% at all k-levels
- **VolΒ² k=1**: ~0.92 (stable)
- **VolΒ² k=2**: ~0.16 (stable)
- **VolΒ² k=3**: ~0.03 (stable)
- **VolΒ² k=4**: ~0.001 (stable)

### Generation Quality

**Epoch 1:**
```
ROMEO: , If, and a head I am IAB, What,
```

**Epoch 15+:**
```
ROMEO: if thou swear'st the Duke of love of it.
MERCUTIO: Why, is it good.
ROMEO: And for the jest love that.
```

The model learns:
- Character names and dialogue structure
- Turn-taking conventions
- Shakespearean vocabulary and cadence
- Coherent multi-turn exchanges

---

## Geometric Dimensions Output

Each k-level contributes to the final representation:

| K | Geo Dim | Components | Info Content |
|---|---------|------------|--------------|
| 1 | 2 | 1 dΒ² + 1 volΒ² | Edge metric |
| 2 | 4 | 3 dΒ² + 1 volΒ² | Triangle shape |
| 3 | 7 | 6 dΒ² + 1 volΒ² | Tetrahedron form |
| 4 | 11 | 10 dΒ² + 1 volΒ² | 5-cell structure |
| **Total** | **24** | | Pure geometry |

With feat_dim=96: Output = 96 + 24 = 120 dims per k-level, Γ—4 k-levels = 480 total geometric dims per token.

---

## File Structure

```
AbstractPhil/ksimplex-llm-prototype/
β”œβ”€β”€ README.md                 # This file
β”œβ”€β”€ trainer.py               # Training script
β”œβ”€β”€ inference.py             # Generation script
β”œβ”€β”€ config.json              # Model configuration
β”œβ”€β”€ checkpoints/
β”‚   β”œβ”€β”€ checkpoint_epoch_001.pt
β”‚   β”œβ”€β”€ checkpoint_epoch_008.pt  # Best val PPL
β”‚   └── checkpoint_latest.pt
└── samples/
    └── samples_epoch_*.json  # Generated text samples
```

---

## Usage

### Inference

```python
from inference import load_model, generate

model, tokenizer = load_model("AbstractPhil/ksimplex-llm-prototype")

text = generate(
    model, 
    tokenizer,
    prompt="ROMEO: ",
    max_tokens=100,
    temperature=0.8,
    top_k=50
)
print(text)
```

### Training

```bash
python trainer.py \
    --data shakespeare.txt \
    --epochs 50 \
    --batch_size 48 \
    --lr 3e-4
```

---

## Future Directions

### Planned Experiments

1. **Learnable Deformation Scale**: Per-k learnable Ξ± parameter
2. **Volume Consistency Loss**: Maintain k-level differentiation
   ```python
   coherence_loss = -torch.std(torch.log(vol2_stack + 1e-10))
   ```
3. **K-Depth Ablation**: Test k=1,2,3 only (remove k=4 noise floor)
4. **VolΒ² Normalization**: Scale by k to equalize magnitudes
5. **Larger Data**: WikiText-103, OpenWebText

### Theoretical Questions

- Does the geometric structure provide better length generalization?
- Can we interpret k-level activations semantically?
- Does geometric validity correlate with generation quality?
- Can we prune k-levels without performance loss?

---

## Citation

```bibtex
@misc{ksimplex-llm-2026,
  author = {AbstractPhil},
  title = {K-Simplex Language Model: Geometric Autoregression with Cayley-Menger Validation},
  year = {2026},
  publisher = {HuggingFace},
  url = {https://huggingface.co/AbstractPhil/ksimplex-llm-prototype}
}
```

---

## License

MIT License - Free to use, modify, and distribute.

---

## Acknowledgments

Built on the foundation of geometric deep learning research exploring k-simplex structures, pentachoron navigation, and Cayley-Menger determinant validation for neural network regularization.

*"The geometry is the representation."*