grapheneaffiliates commited on
Commit
c4a699e
·
verified ·
1 Parent(s): ca4ccbb

Upload docs/PAPER.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. docs/PAPER.md +25 -3
docs/PAPER.md CHANGED
@@ -525,7 +525,29 @@ The 6-layer, $d_\text{model}=128$ float configuration and the 4-layer, $d_\text{
525
 
526
  The use of the 600-cell as a reference frame for attention is, to our knowledge, novel. Previous work on polytopes in machine learning has focused on convex optimization (Lee and Sidford, 2019) and sampling (Chevallier et al., 2018), not attention mechanisms.
527
 
528
- ### 6.4 Concurrent Work: Lila-E8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
529
 
530
  **Lila-E8** (concurrent, 2025--2026) uses the E8 lattice structure as an attention bias, adding E8-derived geometric information to the attention score computation. This is a complementary but fundamentally different approach from ours:
531
 
@@ -539,11 +561,11 @@ The use of the 600-cell as a reference frame for attention is, to our knowledge,
539
 
540
  Both approaches validate the core insight that E8 lattice geometry is useful for attention. Lila-E8 demonstrates that E8 structure improves attention quality within the standard $O(t^2)$ framework. Our work demonstrates that projecting E8 to H4 via the Coxeter eigenvalues enables a fundamentally faster attention algorithm. The approaches are complementary: Lila-E8's attention bias could potentially be applied within the candidate set that our ChamberTree selects, combining quality improvement with algorithmic speedup.
541
 
542
- ### 6.5 Quantization
543
 
544
  **BitNet** (Wang et al., 2023) and **BitNet b1.58** (Ma et al., 2024) demonstrate that ternary weights can match float quality with 2$\times$ width. **GPTQ** (Frantar et al., 2023) and **AWQ** (Lin et al., 2024) provide post-training quantization to 4-bit or lower. Our contribution is showing that ternary quantization is specifically compatible with geometric routing because chamber assignments depend on sign patterns that ternary preserves.
545
 
546
- ### 6.6 Autonomous Machine Learning
547
 
548
  **Neural Architecture Search** (Zoph and Le, 2017; Liu et al., 2019) automates architecture design but typically requires GPU-hours to GPU-days. Our autoresearch loop is far more constrained (2-minute CPU experiments) and optimizes hyperparameters rather than architecture, but the key insight --- that frozen geometric structure reduces the search space to make autonomous optimization tractable --- may generalize.
549
 
 
525
 
526
  The use of the 600-cell as a reference frame for attention is, to our knowledge, novel. Previous work on polytopes in machine learning has focused on convex optimization (Lee and Sidford, 2019) and sampling (Chevallier et al., 2018), not attention mechanisms.
527
 
528
+ ### 6.4 Concurrent Work: Percepta — "Can LLMs Be Computers?"
529
+
530
+ **Percepta** (Tzamos et al., 2026) independently arrived at $O(\log t)$ attention through a different geometric foundation. They built a WebAssembly interpreter inside transformer weights, executing compiled C programs directly in the model's inference loop at 32,000 tokens/sec on CPU --- millions of exact execution steps with zero errors.
531
+
532
+ Their key insight is identical to ours: **geometric structure in low-dimensional attention enables sublinear lookup.** They restrict attention heads to 2D, turning each max-dot-product query into a convex hull "supporting point" query solvable in $O(\log t)$ via a HullKVCache. We project to 4D H4 space, turning each query into a Coxeter chamber navigation solvable in $O(\log t)$ via ChamberTree.
533
+
534
+ | | Percepta (2D Convex Hull) | H4 Polytopic Attention (4D Coxeter) |
535
+ |---|---|---|
536
+ | **Geometry** | 2D convex hull | 4D H4 polytope (600-cell) |
537
+ | **Head dimension** | 2D (strict) | 4D (H4 space) |
538
+ | **Complexity** | $O(\log t)$ per query | $O(\log t)$ per query |
539
+ | **Purpose** | Execute programs in weights | Language generation + RAG |
540
+ | **Computation** | Exact (zero error, millions of steps) | Approximate (language generation) |
541
+ | **Throughput** | 32,000 tok/s (deterministic traces) | 585 tok/s (language generation) |
542
+ | **Expressiveness** | Turing complete at 2D | Rich language modeling at 4D |
543
+
544
+ Two independent groups identified the same bottleneck (linear attention cost) and arrived at the same solution class (geometric sublinear lookup) through different mathematics. This convergence is strong evidence that **geometric attention is a fundamental improvement**, not a task-specific trick.
545
+
546
+ **Key difference:** Percepta targets exact computation (program execution with zero error). We target approximate computation (language generation, retrieval, ranking). Their 2D heads are sufficient for deterministic programs but limited for rich language. Our 4D heads sacrifice some lookup speed for greater expressiveness.
547
+
548
+ **Synthesis opportunity:** The two approaches are naturally complementary. A hybrid system could use Percepta's 2D fast path for exact computation (arithmetic, logic, algorithm execution) and our H4 4D path for language generation and retrieval. When a language model needs to compute $15 \times 23$, it switches to the 2D execution path (32,000 tok/s, exact), then returns to the 4D language path for the explanation. This eliminates the need for external calculator tools --- the computation happens inside the model itself.
549
+
550
+ ### 6.5 Lila-E8
551
 
552
  **Lila-E8** (concurrent, 2025--2026) uses the E8 lattice structure as an attention bias, adding E8-derived geometric information to the attention score computation. This is a complementary but fundamentally different approach from ours:
553
 
 
561
 
562
  Both approaches validate the core insight that E8 lattice geometry is useful for attention. Lila-E8 demonstrates that E8 structure improves attention quality within the standard $O(t^2)$ framework. Our work demonstrates that projecting E8 to H4 via the Coxeter eigenvalues enables a fundamentally faster attention algorithm. The approaches are complementary: Lila-E8's attention bias could potentially be applied within the candidate set that our ChamberTree selects, combining quality improvement with algorithmic speedup.
563
 
564
+ ### 6.6 Quantization
565
 
566
  **BitNet** (Wang et al., 2023) and **BitNet b1.58** (Ma et al., 2024) demonstrate that ternary weights can match float quality with 2$\times$ width. **GPTQ** (Frantar et al., 2023) and **AWQ** (Lin et al., 2024) provide post-training quantization to 4-bit or lower. Our contribution is showing that ternary quantization is specifically compatible with geometric routing because chamber assignments depend on sign patterns that ternary preserves.
567
 
568
+ ### 6.7 Autonomous Machine Learning
569
 
570
  **Neural Architecture Search** (Zoph and Le, 2017; Liu et al., 2019) automates architecture design but typically requires GPU-hours to GPU-days. Our autoresearch loop is far more constrained (2-minute CPU experiments) and optimizes hyperparameters rather than architecture, but the key insight --- that frozen geometric structure reduces the search space to make autonomous optimization tractable --- may generalize.
571