AbstractPhil commited on
Commit
5d154e8
·
verified ·
1 Parent(s): 9e5a420

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +384 -0
README.md ADDED
@@ -0,0 +1,384 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # K-Simplex Language Model Prototype
2
+
3
+ A geometric autoregressive language model using Cayley-Menger validated k-simplex channels. This architecture replaces traditional transformer embeddings with geometrically-constrained structures that maintain mathematical validity throughout training.
4
+
5
+ ## Overview
6
+
7
+ This model explores whether **geometric inductive bias** can improve language modeling by representing each token position as a hierarchy of k-simplices (edge → triangle → tetrahedron → 5-cell) with learnable deformations validated by the Cayley-Menger determinant.
8
+
9
+ **Key Results:**
10
+ - Shakespeare corpus: **Val PPL 113.74** at epoch 8
11
+ - 100% geometric validity maintained throughout training
12
+ - Coherent dialogue generation with proper character attribution
13
+ - 54M parameters (due to 50k BPE vocabulary)
14
+
15
+ ---
16
+
17
+ ## Architecture
18
+
19
+ ### Conceptual Foundation
20
+
21
+ Traditional transformers represent tokens as flat vectors. This architecture represents each token as a **stack of k-simplex structures** where:
22
+
23
+ | K-Level | Structure | Vertices | Distance Pairs | Geometric Meaning |
24
+ |---------|-----------|----------|----------------|-------------------|
25
+ | k=1 | Edge | 2 | 1 | 1D linear relationship |
26
+ | k=2 | Triangle | 3 | 3 | 2D planar structure |
27
+ | k=3 | Tetrahedron | 4 | 6 | 3D volumetric structure |
28
+ | k=4 | 5-cell | 5 | 10 | 4D hypervolume |
29
+
30
+ Each k-level captures progressively higher-dimensional geometric relationships, providing a structured representation space that traditional embeddings lack.
31
+
32
+ ### Token Flow
33
+
34
+ ```
35
+ Token ID
36
+
37
+ Embedding Layer (vocab_size × embed_dim)
38
+
39
+ Positional Encoding
40
+
41
+ ┌─────────────────────────────────────────┐
42
+ │ TokenToKChannels │
43
+ │ Projects to [B, T, K, feat_dim] │
44
+ │ Each position gets K simplex channels │
45
+ └─────────────────────────────────────────┘
46
+
47
+ ┌─────────────────────────────────────────┐
48
+ │ GeoBlock × num_blocks │
49
+ │ ┌─────────────────────────────────┐ │
50
+ │ │ KChannelCrossAttention │ │
51
+ │ │ K-levels attend to each other │ │
52
+ │ │ (within each token position) │ │
53
+ │ └─────────────────────────────────┘ │
54
+ │ ┌─────────────────────────────────┐ │
55
+ │ │ CausalSequenceAttention │ │
56
+ │ │ Tokens attend causally │ │
57
+ │ │ (across sequence, masked) │ │
58
+ │ └─────────────────────────────────┘ │
59
+ │ ┌─────────────────────────────────┐ │
60
+ │ │ MLP │ │
61
+ │ └─────────────────────────────────┘ │
62
+ └─────────────────────────────────────────┘
63
+
64
+ LM Head → Logits [B, T, vocab_size]
65
+ ```
66
+
67
+ ---
68
+
69
+ ## Geometric Formulas
70
+
71
+ ### Cayley-Menger Determinant
72
+
73
+ For a k-simplex with vertices $v_0, v_1, \ldots, v_k$, the squared volume is computed via:
74
+
75
+ $$
76
+ \text{Vol}^2 = \frac{(-1)^{k+1}}{2^k (k!)^2} \det(CM)
77
+ $$
78
+
79
+ Where the Cayley-Menger matrix is:
80
+
81
+ $$
82
+ CM = \begin{pmatrix}
83
+ 0 & 1 & 1 & \cdots & 1 \\
84
+ 1 & 0 & d_{01}^2 & \cdots & d_{0k}^2 \\
85
+ 1 & d_{01}^2 & 0 & \cdots & d_{1k}^2 \\
86
+ \vdots & \vdots & \vdots & \ddots & \vdots \\
87
+ 1 & d_{0k}^2 & d_{1k}^2 & \cdots & 0
88
+ \end{pmatrix}
89
+ $$
90
+
91
+ **Validity Criterion:** $\text{Vol}^2 > 0$ indicates a non-degenerate simplex.
92
+
93
+ ### Template Deformation
94
+
95
+ Each k-simplex starts from a regular (equilateral) template and learns deformations:
96
+
97
+ $$
98
+ v_i^{(\text{deformed})} = v_i^{(\text{template})} + \alpha \cdot \Delta v_i
99
+ $$
100
+
101
+ Where:
102
+ - $v_i^{(\text{template})}$ = vertices of regular k-simplex
103
+ - $\alpha$ = deformation scale (BASE_DEFORM = 0.05)
104
+ - $\Delta v_i$ = learned offset from neural network
105
+
106
+ ### Geometric Gating
107
+
108
+ Features are gated by geometric validity:
109
+
110
+ $$
111
+ \text{output} = \text{features} \odot \text{gate}(\text{geo}) \odot \sigma(\text{Vol}^2 \cdot 10^6)
112
+ $$
113
+
114
+ Where:
115
+ - $\text{gate}(\text{geo}) = \sigma(W \cdot [d^2 \| \text{Vol}^2])$
116
+ - The sigmoid on Vol² acts as a soft validity mask
117
+ - Invalid simplices (Vol² < 0) have their features suppressed
118
+
119
+ ### Loss Function
120
+
121
+ $$
122
+ \mathcal{L} = \mathcal{L}_{CE} + \lambda \cdot \mathcal{L}_{validity}
123
+ $$
124
+
125
+ Where:
126
+ - $\mathcal{L}_{CE}$ = Cross-entropy for next-token prediction
127
+ - $\mathcal{L}_{validity} = \text{mean}(\text{ReLU}(-\text{Vol}^2))$ penalizes collapsed simplices
128
+ - $\lambda = 0.1$ (validity weight)
129
+
130
+ ---
131
+
132
+ ## Safe Deformation Analysis
133
+
134
+ Extensive testing via the K-Simplex Geometric Explorer revealed critical stability zones:
135
+
136
+ ### Stability Zones by K-Depth
137
+
138
+ | Configuration | Differentiation Zone | Collapse Threshold |
139
+ |---------------|---------------------|-------------------|
140
+ | k=1-4, edim=16 | 0.15 - 0.35 | ~0.50 |
141
+ | k=1-4, edim=32 | 0.15 - 0.50 | >2.0 |
142
+ | k=1-6, edim=16 | 0.35 - 0.45 | ~0.50 |
143
+ | k=1-6, edim=32 | 0.25 - 0.60 | >2.0 |
144
+
145
+ ### Key Findings
146
+
147
+ 1. **Deformation Scale Safety**: BASE_DEFORM=0.05 is extremely conservative. The geometry can safely handle 10-40× more deformation.
148
+
149
+ 2. **Embedding Dimension as Stability Buffer**:
150
+ ```
151
+ edim / k_max = stability_ratio
152
+
153
+ ratio ≥ 8× → Very stable, deform up to 2.0
154
+ ratio ≥ 4× → Comfortable margin
155
+ ratio ≥ 2× → Tight but functional
156
+ ```
157
+
158
+ 3. **Vol² Behavior Under Deformation**:
159
+ - Low deform (0-0.15): Clear k-level hierarchy, Vol² decreases exponentially with k
160
+ - Medium deform (0.15-0.35): **Optimal zone** - distinct geometric signatures per k
161
+ - High deform (>0.5): Noise dominates, k-levels converge, geometric meaning lost
162
+
163
+ 4. **Vol² Scaling**:
164
+ ```
165
+ k=1: Vol² ~ 1e+0 (edge length squared)
166
+ k=2: Vol² ~ 1e-1 (triangle area squared)
167
+ k=3: Vol² ~ 1e-2 (tetrahedron volume squared)
168
+ k=4: Vol² ~ 1e-3 (5-cell hypervolume squared)
169
+ ```
170
+ Exponential decay is expected and healthy.
171
+
172
+ ### Recommended Production Settings
173
+
174
+ ```python
175
+ # Conservative (proven)
176
+ BASE_DEFORM = 0.05
177
+ edim = 16
178
+ depth = 4 # k=1,2,3,4
179
+
180
+ # Aggressive (tested safe)
181
+ BASE_DEFORM = 0.15
182
+ edim = 32
183
+ depth = 4
184
+
185
+ # Experimental
186
+ BASE_DEFORM = learnable_per_k # Allow network to find optimal
187
+ edim = 2 * depth # Minimum viable
188
+ ```
189
+
190
+ ---
191
+
192
+ ## Training Configuration
193
+
194
+ ### Model Hyperparameters
195
+
196
+ ```python
197
+ config = {
198
+ "vocab_size": 50257, # GPT-2 BPE tokenizer
199
+ "max_seq_len": 256,
200
+ "embed_dim": 384,
201
+ "depth": 4, # k=1,2,3,4
202
+ "edim": 16, # Vertex coordinate dimension
203
+ "feat_dim": 96, # Features per vertex
204
+ "hidden": 384,
205
+ "num_heads": 8,
206
+ "num_blocks": 8,
207
+ "dropout": 0.1,
208
+ }
209
+ ```
210
+
211
+ ### Training Hyperparameters
212
+
213
+ ```python
214
+ training = {
215
+ "batch_size": 48,
216
+ "seq_len": 256,
217
+ "lr": 3e-4,
218
+ "weight_decay": 0.1,
219
+ "num_epochs": 50,
220
+ "grad_clip": 1.0,
221
+ "ce_weight": 1.0,
222
+ "validity_weight": 0.1,
223
+ "scheduler": "CosineAnnealingLR",
224
+ "stride": 128, # Non-overlapping sequences
225
+ }
226
+ ```
227
+
228
+ ---
229
+
230
+ ## Results
231
+
232
+ ### Training Progression
233
+
234
+ | Epoch | Train PPL | Val PPL | Status |
235
+ |-------|-----------|---------|--------|
236
+ | 1 | 492 | 299 | Learning |
237
+ | 5 | 77 | 132 | Improving |
238
+ | 8 | 44 | **114** | **Best** |
239
+ | 15 | 15 | 145 | Overfitting |
240
+
241
+ ### Geometric Health
242
+
243
+ Throughout training:
244
+ - **Validity**: 100% at all k-levels
245
+ - **Vol² k=1**: ~0.92 (stable)
246
+ - **Vol² k=2**: ~0.16 (stable)
247
+ - **Vol² k=3**: ~0.03 (stable)
248
+ - **Vol² k=4**: ~0.001 (stable)
249
+
250
+ ### Generation Quality
251
+
252
+ **Epoch 1:**
253
+ ```
254
+ ROMEO: , If, and a head I am IAB, What,
255
+ ```
256
+
257
+ **Epoch 15+:**
258
+ ```
259
+ ROMEO: if thou swear'st the Duke of love of it.
260
+ MERCUTIO: Why, is it good.
261
+ ROMEO: And for the jest love that.
262
+ ```
263
+
264
+ The model learns:
265
+ - Character names and dialogue structure
266
+ - Turn-taking conventions
267
+ - Shakespearean vocabulary and cadence
268
+ - Coherent multi-turn exchanges
269
+
270
+ ---
271
+
272
+ ## Geometric Dimensions Output
273
+
274
+ Each k-level contributes to the final representation:
275
+
276
+ | K | Geo Dim | Components | Info Content |
277
+ |---|---------|------------|--------------|
278
+ | 1 | 2 | 1 d² + 1 vol² | Edge metric |
279
+ | 2 | 4 | 3 d² + 1 vol² | Triangle shape |
280
+ | 3 | 7 | 6 d² + 1 vol² | Tetrahedron form |
281
+ | 4 | 11 | 10 d² + 1 vol² | 5-cell structure |
282
+ | **Total** | **24** | | Pure geometry |
283
+
284
+ With feat_dim=96: Output = 96 + 24 = 120 dims per k-level, ×4 k-levels = 480 total geometric dims per token.
285
+
286
+ ---
287
+
288
+ ## File Structure
289
+
290
+ ```
291
+ AbstractPhil/ksimplex-llm-prototype/
292
+ ├── README.md # This file
293
+ ├── trainer.py # Training script
294
+ ├── inference.py # Generation script
295
+ ├── config.json # Model configuration
296
+ ├── checkpoints/
297
+ │ ├── checkpoint_epoch_001.pt
298
+ │ ├── checkpoint_epoch_008.pt # Best val PPL
299
+ │ └── checkpoint_latest.pt
300
+ └── samples/
301
+ └── samples_epoch_*.json # Generated text samples
302
+ ```
303
+
304
+ ---
305
+
306
+ ## Usage
307
+
308
+ ### Inference
309
+
310
+ ```python
311
+ from inference import load_model, generate
312
+
313
+ model, tokenizer = load_model("AbstractPhil/ksimplex-llm-prototype")
314
+
315
+ text = generate(
316
+ model,
317
+ tokenizer,
318
+ prompt="ROMEO: ",
319
+ max_tokens=100,
320
+ temperature=0.8,
321
+ top_k=50
322
+ )
323
+ print(text)
324
+ ```
325
+
326
+ ### Training
327
+
328
+ ```bash
329
+ python trainer.py \
330
+ --data shakespeare.txt \
331
+ --epochs 50 \
332
+ --batch_size 48 \
333
+ --lr 3e-4
334
+ ```
335
+
336
+ ---
337
+
338
+ ## Future Directions
339
+
340
+ ### Planned Experiments
341
+
342
+ 1. **Learnable Deformation Scale**: Per-k learnable α parameter
343
+ 2. **Volume Consistency Loss**: Maintain k-level differentiation
344
+ ```python
345
+ coherence_loss = -torch.std(torch.log(vol2_stack + 1e-10))
346
+ ```
347
+ 3. **K-Depth Ablation**: Test k=1,2,3 only (remove k=4 noise floor)
348
+ 4. **Vol² Normalization**: Scale by k to equalize magnitudes
349
+ 5. **Larger Data**: WikiText-103, OpenWebText
350
+
351
+ ### Theoretical Questions
352
+
353
+ - Does the geometric structure provide better length generalization?
354
+ - Can we interpret k-level activations semantically?
355
+ - Does geometric validity correlate with generation quality?
356
+ - Can we prune k-levels without performance loss?
357
+
358
+ ---
359
+
360
+ ## Citation
361
+
362
+ ```bibtex
363
+ @misc{ksimplex-llm-2026,
364
+ author = {AbstractPhil},
365
+ title = {K-Simplex Language Model: Geometric Autoregression with Cayley-Menger Validation},
366
+ year = {2026},
367
+ publisher = {HuggingFace},
368
+ url = {https://huggingface.co/AbstractPhil/ksimplex-llm-prototype}
369
+ }
370
+ ```
371
+
372
+ ---
373
+
374
+ ## License
375
+
376
+ MIT License - Free to use, modify, and distribute.
377
+
378
+ ---
379
+
380
+ ## Acknowledgments
381
+
382
+ Built on the foundation of geometric deep learning research exploring k-simplex structures, pentachoron navigation, and Cayley-Menger determinant validation for neural network regularization.
383
+
384
+ *"The geometry is the representation."*