Upload docs/PAPER.md with huggingface_hub
Browse files- docs/PAPER.md +25 -3
docs/PAPER.md
CHANGED
|
@@ -525,7 +525,29 @@ The 6-layer, $d_\text{model}=128$ float configuration and the 4-layer, $d_\text{
|
|
| 525 |
|
| 526 |
The use of the 600-cell as a reference frame for attention is, to our knowledge, novel. Previous work on polytopes in machine learning has focused on convex optimization (Lee and Sidford, 2019) and sampling (Chevallier et al., 2018), not attention mechanisms.
|
| 527 |
|
| 528 |
-
### 6.4 Concurrent Work:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 529 |
|
| 530 |
**Lila-E8** (concurrent, 2025--2026) uses the E8 lattice structure as an attention bias, adding E8-derived geometric information to the attention score computation. This is a complementary but fundamentally different approach from ours:
|
| 531 |
|
|
@@ -539,11 +561,11 @@ The use of the 600-cell as a reference frame for attention is, to our knowledge,
|
|
| 539 |
|
| 540 |
Both approaches validate the core insight that E8 lattice geometry is useful for attention. Lila-E8 demonstrates that E8 structure improves attention quality within the standard $O(t^2)$ framework. Our work demonstrates that projecting E8 to H4 via the Coxeter eigenvalues enables a fundamentally faster attention algorithm. The approaches are complementary: Lila-E8's attention bias could potentially be applied within the candidate set that our ChamberTree selects, combining quality improvement with algorithmic speedup.
|
| 541 |
|
| 542 |
-
### 6.
|
| 543 |
|
| 544 |
**BitNet** (Wang et al., 2023) and **BitNet b1.58** (Ma et al., 2024) demonstrate that ternary weights can match float quality with 2$\times$ width. **GPTQ** (Frantar et al., 2023) and **AWQ** (Lin et al., 2024) provide post-training quantization to 4-bit or lower. Our contribution is showing that ternary quantization is specifically compatible with geometric routing because chamber assignments depend on sign patterns that ternary preserves.
|
| 545 |
|
| 546 |
-
### 6.
|
| 547 |
|
| 548 |
**Neural Architecture Search** (Zoph and Le, 2017; Liu et al., 2019) automates architecture design but typically requires GPU-hours to GPU-days. Our autoresearch loop is far more constrained (2-minute CPU experiments) and optimizes hyperparameters rather than architecture, but the key insight --- that frozen geometric structure reduces the search space to make autonomous optimization tractable --- may generalize.
|
| 549 |
|
|
|
|
| 525 |
|
| 526 |
The use of the 600-cell as a reference frame for attention is, to our knowledge, novel. Previous work on polytopes in machine learning has focused on convex optimization (Lee and Sidford, 2019) and sampling (Chevallier et al., 2018), not attention mechanisms.
|
| 527 |
|
| 528 |
+
### 6.4 Concurrent Work: Percepta — "Can LLMs Be Computers?"
|
| 529 |
+
|
| 530 |
+
**Percepta** (Tzamos et al., 2026) independently arrived at $O(\log t)$ attention through a different geometric foundation. They built a WebAssembly interpreter inside transformer weights, executing compiled C programs directly in the model's inference loop at 32,000 tokens/sec on CPU --- millions of exact execution steps with zero errors.
|
| 531 |
+
|
| 532 |
+
Their key insight is identical to ours: **geometric structure in low-dimensional attention enables sublinear lookup.** They restrict attention heads to 2D, turning each max-dot-product query into a convex hull "supporting point" query solvable in $O(\log t)$ via a HullKVCache. We project to 4D H4 space, turning each query into a Coxeter chamber navigation solvable in $O(\log t)$ via ChamberTree.
|
| 533 |
+
|
| 534 |
+
| | Percepta (2D Convex Hull) | H4 Polytopic Attention (4D Coxeter) |
|
| 535 |
+
|---|---|---|
|
| 536 |
+
| **Geometry** | 2D convex hull | 4D H4 polytope (600-cell) |
|
| 537 |
+
| **Head dimension** | 2D (strict) | 4D (H4 space) |
|
| 538 |
+
| **Complexity** | $O(\log t)$ per query | $O(\log t)$ per query |
|
| 539 |
+
| **Purpose** | Execute programs in weights | Language generation + RAG |
|
| 540 |
+
| **Computation** | Exact (zero error, millions of steps) | Approximate (language generation) |
|
| 541 |
+
| **Throughput** | 32,000 tok/s (deterministic traces) | 585 tok/s (language generation) |
|
| 542 |
+
| **Expressiveness** | Turing complete at 2D | Rich language modeling at 4D |
|
| 543 |
+
|
| 544 |
+
Two independent groups identified the same bottleneck (linear attention cost) and arrived at the same solution class (geometric sublinear lookup) through different mathematics. This convergence is strong evidence that **geometric attention is a fundamental improvement**, not a task-specific trick.
|
| 545 |
+
|
| 546 |
+
**Key difference:** Percepta targets exact computation (program execution with zero error). We target approximate computation (language generation, retrieval, ranking). Their 2D heads are sufficient for deterministic programs but limited for rich language. Our 4D heads sacrifice some lookup speed for greater expressiveness.
|
| 547 |
+
|
| 548 |
+
**Synthesis opportunity:** The two approaches are naturally complementary. A hybrid system could use Percepta's 2D fast path for exact computation (arithmetic, logic, algorithm execution) and our H4 4D path for language generation and retrieval. When a language model needs to compute $15 \times 23$, it switches to the 2D execution path (32,000 tok/s, exact), then returns to the 4D language path for the explanation. This eliminates the need for external calculator tools --- the computation happens inside the model itself.
|
| 549 |
+
|
| 550 |
+
### 6.5 Lila-E8
|
| 551 |
|
| 552 |
**Lila-E8** (concurrent, 2025--2026) uses the E8 lattice structure as an attention bias, adding E8-derived geometric information to the attention score computation. This is a complementary but fundamentally different approach from ours:
|
| 553 |
|
|
|
|
| 561 |
|
| 562 |
Both approaches validate the core insight that E8 lattice geometry is useful for attention. Lila-E8 demonstrates that E8 structure improves attention quality within the standard $O(t^2)$ framework. Our work demonstrates that projecting E8 to H4 via the Coxeter eigenvalues enables a fundamentally faster attention algorithm. The approaches are complementary: Lila-E8's attention bias could potentially be applied within the candidate set that our ChamberTree selects, combining quality improvement with algorithmic speedup.
|
| 563 |
|
| 564 |
+
### 6.6 Quantization
|
| 565 |
|
| 566 |
**BitNet** (Wang et al., 2023) and **BitNet b1.58** (Ma et al., 2024) demonstrate that ternary weights can match float quality with 2$\times$ width. **GPTQ** (Frantar et al., 2023) and **AWQ** (Lin et al., 2024) provide post-training quantization to 4-bit or lower. Our contribution is showing that ternary quantization is specifically compatible with geometric routing because chamber assignments depend on sign patterns that ternary preserves.
|
| 567 |
|
| 568 |
+
### 6.7 Autonomous Machine Learning
|
| 569 |
|
| 570 |
**Neural Architecture Search** (Zoph and Le, 2017; Liu et al., 2019) automates architecture design but typically requires GPU-hours to GPU-days. Our autoresearch loop is far more constrained (2-minute CPU experiments) and optimizes hyperparameters rather than architecture, but the key insight --- that frozen geometric structure reduces the search space to make autonomous optimization tractable --- may generalize.
|
| 571 |
|