grapheneaffiliates
/

h4-polytopic-attention

@@ -1,1350 +1,66 @@
----
-license: mit
-language:
-- en
-library_name: pytorch
-tags:
-- geometric-attention
-- h4-polytope
-- ternary-quantization
-- bitnet
-- cpu-inference
-- rag
-- coxeter-chambers
-- e8-lattice
-- lila-e8
-- project-olympus
-- smollm3
-- continuous-learning
-- percepta
-datasets:
-- roneneldan/TinyStories
-metrics:
-- perplexity
-pipeline_tag: text-generation
----
-# H4 Polytopic Attention
-**4D geometric attention with O(log t) queries via Coxeter chamber navigation and E8 lattice-indexed RAM**
-A transformer system that replaces standard softmax attention with 4D attention heads built on the H4 polytope (600-cell), using the E8 lattice as a memory backend. Includes a deterministic executor, a trainable hybrid architecture (frozen H4 geometry + learnable adapters), ternary quantization (BitNet b1.58), and a unified geometric RAG system where the same E8->H4 projection handles both document retrieval and attention. The golden ratio appears at every level of the architecture.
-**Author:** Timothy McGirl
-**Repository:** [grapheneaffiliate/h4-polytopic-attention](https://github.com/grapheneaffiliate/h4-polytopic-attention)
----
-## Table of Contents
-1. [Overview](#overview)
-2. [Why H4?](#why-h4)
-3. [Architecture](#architecture)
-4. [The Seven Phases](#the-seven-phases)
-5. [Mathematical Foundation](#mathematical-foundation)
-6. [Directory Structure](#directory-structure)
-7. [Installation](#installation)
-8. [Usage](#usage)
-9. [Instruction Set Architecture](#instruction-set-architecture)
-10. [Benchmarks](#benchmarks)
-11. [MCP Server Integration](#mcp-server-integration)
-12. [API Reference](#api-reference)
-13. [Theory Deep Dive](#theory-deep-dive)
-14. [Autoresearch Results](#autoresearch-results)
-15. [Citation](#citation)
----
-## Overview
-Standard transformers use softmax attention with O(t) cost per query over t cached tokens. H4 Polytopic Attention replaces this with a geometric data structure --- the Coxeter chamber tree of the H4 reflection group --- that achieves O(log t) max-dot-product queries in 4D.
-The system functions as a complete virtual machine:
-- **Attention heads = RAM lookup** (4D, O(log t) via ChamberTree)
-- **FFN layers = ALU** (instruction decode + execute)
-- **Execution trace = token sequence** (each step is one token)
-- **E8 lattice = hierarchical memory** (8D Voronoi cell addressing)
-The key architectural insight is that memory addressing (8D, E8 lattice) and attention queries (4D, H4 chambers) are unified through the E8 -> H4 projection, which uses the Coxeter element eigenvalues cos(pi/5) = phi/2. They are not two separate systems bolted together --- they share the same golden-ratio geometry.
-### What this IS
-A complete, working system --- not a research prototype. Phases 1-4 proved geometric attention works as a deterministic computer. Phase 5 made it trainable. Phase 6 made it ternary (1.58-bit weights, 17x compression). Phase 7 unified retrieval and generation into a single geometric RAG pipeline. An 8-hour overnight training run on CPU produced a 24M ternary parameter model that generates coherent English at perplexity 10.0, beating the published TinyStories-33M baseline at fewer parameters.
-### What this is NOT
-This is not llama.cpp. Projects like llama.cpp and ollama take GPU-designed models and run them slowly on CPU. This architecture is **designed for CPU from the ground up**. The ChamberTree replaces the operation GPUs are best at (parallel matmul) with the operation CPUs are best at (branching tree traversal). Ternary weights replace float multiply-accumulate with integer add/subtract. The frozen geometric backbone means most of the model is static lookup tables, not learned weights. The model would actually run *slower* on a GPU because GPUs are bad at tree traversal. That's the inversion nobody else has.
----
-## Why H4?
-The H4 reflection group is the largest finite reflection group in 4D, with |W(H4)| = 14,400 elements. Its associated polytope, the 600-cell, has 120 vertices on S3. This is the sweet spot for attention heads:
-| Dimension | Structure | Symmetries | Bits/query | Hull vertices |
-|-----------|-----------|------------|------------|---------------|
-| 2D (Percepta) | S1 circle | SO(2), continuous | ~1 | O(sqrt(t)) |
-| **4D (H4)** | **S3 3-sphere** | **W(H4) = 14,400** | **~2** | **14,400 chambers** |
-| 8D | E8 lattice | W(E8) = 696,729,600 | - | Memory backend |
-The 4D choice is optimal because:
-1. **S3 has the Hopf fibration** (pi_3(S3) = Z), enabling hierarchical selection that S1 cannot express
-2. **A single [f64; 4] fits in one 256-bit AVX2 register** --- zero wasted SIMD lanes
-3. **H4 connects to E8** via the projection E8 -> H4, unifying attention with memory
-4. **The golden ratio phi = (1+sqrt(5))/2** appears in every vertex coordinate, connecting the geometry to Fibonacci-spaced checkpoints and phi-recursive state encoding
----
-## Architecture
-```
-                    Program (ISA instructions)
-                            |
-                            v
-                  +-------------------+
-                  |   State Encoder   |  Encode (IP, registers, opcode, step)
-                  |   (d_model = 32)  |  as a d_model-dimensional vector
-                  +-------------------+
-                            |
-                    d_model vector
-                            |
-            +---------------+---------------+
-            |                               |
-            v                               v
-  +-------------------+          +--------------------+
-  | H4 Attention Heads |         | E8 Lattice Memory  |
-  | (4D per head)      |         | (8D Voronoi cells) |
-  | ChamberTree lookup |         | 240 kissing nbrs   |
-  | O(log t) queries   |         | O(1) bucket decode  |
-  +-------------------+          +--------------------+
-            |           cos(pi/5)           |
-            |        = phi/2 projection     |
-            +---------------+---------------+
-                            |
-                            v
-                  +-------------------+
-                  |   FFN Layers      |  Instruction decode + ALU
-                  |   (opcode -> op)  |  (ADD, SUB, MUL, STORE, ...)
-                  +-------------------+
-                            |
-                            v
-                    Next execution state
-```
-### Data flow for a single execution step
-1. **Encode** the current state (instruction pointer, register file, opcode, operands, step counter) as a d_model-dimensional vector using golden-angle spirals on S3
-2. **Attention** queries look back at the execution trace via H4 ChamberTree (O(log t) per head)
-3. **E8 memory** operations (STORE_MEM / LOAD_MEM) use Voronoi cell bucketing with 240-neighbor shell traversal
-4. **FFN** decodes the opcode and executes the instruction (ADD, SUB, MUL, etc.)
-5. **Append** the state vector to the execution trace and repeat
----
-## The Seven Phases
-### Phase 1: H4 Geometry and Attention (Python PoC + Rust)
-The foundation: 600-cell vertex generation, Coxeter chamber navigation, O(log t) max-dot-product queries via hierarchical bucketing.
-**Key files:** `h4_polytopic_attention.py`, `h4.rs`, `vec4.rs`, `chamber_tree.rs`, `attention.rs`
-**What was proven:** 4D attention heads are quadratically more expressive than 2D heads, and the ChamberTree gives O(log t) queries with (5/16)^3 ~ 3% scan ratio at 3 levels.
-### Phase 2: Weight Compiler
-Analytical construction of transformer weights that execute programs. No training --- weights are computed directly from the H4 geometry and the instruction set.
-**Key file:** `weight_compiler.py`
-**ISA:** LOAD, ADD, SUB, MUL, STORE, JMP, JNZ, HALT
-**Head allocation (8 heads):**
-- Heads 0-1: Instruction pointer lookup (find matching IP in history)
-- Heads 2-3: Register file access (find register state)
-- Heads 4-5: Operand fetch (fetch operand values from trace)
-- Heads 6-7: Control flow (branch prediction via Coxeter chamber)
-### Phase 3: MCP Server + Hybrid LLM
-Exposes the executor as an MCP server for Claude Code. Claude handles reasoning; the H4 executor handles exact computation. Uses Max plan OAuth --- zero extra cost.
-**Key files:** `h4_mcp_server.py`, `hybrid_llm.py`
-**Tools exposed:** `h4_fibonacci`, `h4_compile_and_run`, `h4_geometry_info`, `h4_benchmark`, `h4_lattice_memory`
-### Phase 4: E8 Lattice Memory
-The full implementation of lattice-indexed RAM. Memory operations use E8 Voronoi cell decoding for O(1) bucket addressing, with the E8 -> H4 projection unifying memory access and attention geometry.
-**Key files (Rust):** `vec8.rs`, `e8_lattice.rs`, `lattice_memory.rs`
-**Key files (Python):** Upgraded `E8LatticeIndex` in `h4_polytopic_attention.py`, `STORE_MEM`/`LOAD_MEM` in `weight_compiler.py`
-**New ISA opcodes:** STORE_MEM (R[a] -> E8 cell at addr R[b]), LOAD_MEM (E8 cell at addr R[a] -> R[dest])
-### Phase 5: Trainable Hybrid Attention (PyTorch)
-Extends the frozen-backbone architecture to trainable token sequence modeling. The frozen H4 geometry provides spatial partitioning; learned adapters handle projection and weighting.
-**Key files:**
-- `h4_hybrid_attention.py` --- H4AttentionLayer (drop-in attention replacement) + H4TransformerBlock
-- `h4_language_model.py` --- Full LM: token embedding + golden-angle positional encoding + N x H4TransformerBlock + LM head
-- `train_cpu.py` --- Autoresearch training script (2-min CPU budget per experiment)
-- `benchmark_h4_vs_softmax.py` --- Speed/quality comparison at various context lengths
-- `utils/phi_positional.py` --- Golden-angle positional encoding using phi-inverse spacing
-- `utils/chamber_index.py` --- PyTorch-compatible ChamberTree bridge for top-k candidate filtering
-**Architecture (per H4AttentionLayer):**
-| Component | Type | Description |
-|-----------|------|-------------|
-| 600-cell vertices (120 x 4) | Frozen buffer | H4 polytope geometry |
-| H4 simple roots (4 x 4) | Frozen buffer | Coxeter reflection hyperplanes |
-| E8->H4 projection (4 x 8) | Frozen buffer | Golden-ratio eigenvalue projection |
-| W_q_proj, W_k_proj | Trainable | Project d_model -> H4 query/key space (R^4 per head) |
-| W_v_proj | Trainable | Project d_model -> value space |
-| W_nudge (n_heads x 4 x 4) | Trainable | Per-head query rotation in H4 space |
-| chamber_bonus (n_heads x 16) | Trainable | Per-head, per-chamber attention bias on keys |
-| W_out | Trainable | Output projection back to d_model |
-**What was proven:**
-- All gradients flow through trainable components (nudge, projections, chamber bonus)
-- W_nudge dominant directions align 96.5% with 600-cell vertices after training --- geometry attracts learning
-- Chamber entropy stays high (2.33/2.77 max) --- model uses the full geometric partition, not collapsing
-- ChamberTree scan ratio scales logarithmically: 43.6% at T=128, 3.1% at T=2048 (halves per doubling)
-- Python ChamberTree has high constant factors; Rust implementation needed for wall-clock advantage
-**Autoresearch loop** (`h4_program.md`): Autonomous experiment protocol where Claude Code iterates on the trainable adapters while the frozen geometry remains fixed. 2-minute CPU budget per experiment, ~24 experiments per overnight run.
-### Phase 6: BitNet b1.58 Integration (Ternary Weights)
-Quantizes all trainable projections to ternary {-1, 0, +1} via BitNet b1.58's absmean method. The frozen geometry stays float32 (static lookup tables). Forward pass uses straight-through estimator (STE) for gradient flow.
-**Key files:**
-- `bitlinear.py` --- BitLinear drop-in replacement for nn.Linear with STE training
-- `ternary_diagnostics.py` --- Chamber preservation tests, weight structure analysis, size comparison
-- `export_ternary.py` --- Export trained model with frozen ternary weights for deployment
-**What stays float32:** 600-cell vertices, simple roots, E8 projection (frozen buffers), chamber_bonus (too small to quantize), embeddings, layer norms, LM head.
-**What becomes ternary:** W_q_proj, W_k_proj, W_v_proj, W_out (attention projections), FFN layers (the bulk of parameters).
-**What was proven (initial verification):**
-- **Chamber preservation 97.9% at initialization** --- ternary barely perturbs near-identity nudge weights
-- **Geo alignment 96.7%** --- unchanged from float (0.965 vs 0.967), geometry survives ternary
-- **STE gradients verified** --- all trainable parameters receive healthy gradients through the quantization barrier
-**What was proven (autoresearch, 30 experiments):**
-- **0.003 bpb gap** --- val_bpb 0.065 (ternary, d_model=256) vs 0.062 (float, d_model=128) after autonomous hyperparameter search
-- **~17x compression** --- trainable weights: ~310 KB ternary vs ~1.4 MB float32
-- **BitNet 2x-width scaling law confirmed** --- doubling d_model from 128 to 256 closed the gap from 0.025 to 0.003
-- **Chamber preservation 76.2% after training** --- ternary model finds its own geometric routing, different from but equally effective as float
-- **LR cliff at 70% chamber preservation** --- below this threshold, routing becomes too noisy and quality degrades
-- **Dropout=0 is optimal** --- frozen geometric backbone acts as the regularizer
-**Inference path after ternary:** Token embeddings -> ternary projections (add/sub only) -> S3 normalize -> ternary nudge -> ChamberTree (sign comparisons, 3.1% scan) -> softmax over candidates -> ternary FFN (add/sub only) -> next token. Only float multiplies: root dot products (4x4) and softmax.
-### Phase 7: Unified Geometric RAG (Retrieval + Ranking)
-The same E8->H4 projection that routes attention also indexes and retrieves documents. The H4 bi-encoder handles geometric retrieval (R@5=100%); a pre-trained cross-encoder handles precision reranking (R@1=98.5%). Combined: **98.5% accuracy on document search, no GPU, no API, $0/month.**
-**Key files:**
-- `rag/encoder.py` --- Encode documents into E8 lattice memory via golden-angle spiral embeddings
-- `rag/pipeline.py` --- End-to-end QA pipeline (retrieve + generate), CPU only
-- `rag/ranking_model.py` --- H4 bi-encoder: score (question, passage) in H4 geometric space
-- `rag/train_ranker.py` --- Bi-encoder contrastive training (InfoNCE) on SQuAD
-- `rag/cross_encoder.py` --- H4 cross-encoder reranker (joint question+passage attention)
-- `rag/train_cross_encoder.py` --- Cross-encoder fine-tuning with LM backbone
-- `rag/eval_rerankers.py` --- Head-to-head: H4 vs pre-trained MiniLM reranker
-- `rag/tokenizer.py` --- BPE tokenizer (tiktoken GPT-2 base, restricted vocab)
-- `rag/demo.py` --- Interactive CLI demo: point at documents, ask questions
-- `rag/cost_benchmark.py` --- H4 CPU vs GPU vs API cost comparison
-**How the production system works:**
-1. Documents encode into 8D E8 Voronoi cells via `H4DocumentEncoder`
-2. Questions project through E8->H4 (cos(pi/5) = phi/2) for geometric retrieval
-3. H4 bi-encoder retrieves top-5 candidates --- **R@5 = 100%, 20ms** (the answer is always in the results)
-4. Pre-trained cross-encoder (MiniLM-L6) reranks top-5 --- **R@1 = 98.5%, ~500ms**
-5. Best candidate returned with source attribution
-**Head-to-head reranker comparison (same candidates from H4 bi-encoder):**
-| Reranker | R@1 | Params | Notes |
-|----------|-----|--------|-------|
-| Random baseline | 20.0% | --- | Chance on 5 candidates |
-| H4 cross-encoder (overnight) | **80% peak** (69% final) | 25M ternary | 5.9K SQuAD pairs, 8h CPU |
-| **Pre-trained MiniLM-L6** | **98.5%** | **22M float** | **Trained on 500K+ MS MARCO pairs** |
-The H4 geometric retrieval does the hard, novel part (finding the right documents via E8 lattice + ChamberTree, O(log t), ternary, CPU). The pre-trained model does the easy, proven part (picking the best one from 5 candidates). Our contribution is the retrieval geometry; the reranking uses proven off-the-shelf technology.
-**Autoresearch findings (12 bi-encoder experiments):**
-- Temperature is the dominant hyperparameter for ternary contrastive learning (0.15 optimal, 2x float default)
-- Bi-encoder R@1 peaks at ~40% regardless of scale (architectural ceiling, not training ceiling)
-- Bi-encoder R@5 = 100% at 3.7M params --- perfect retrieval
-- E8 lattice retrieval: 7.8ms per query, 240-neighbor Voronoi search
-**Full-scale language generation (overnight, 8 hours CPU):**
-- **Perplexity 10.0** on TinyStories (24M ternary params, d_model=512, 8 layers)
-- Beats TinyStories-33M published baseline (~15 PPL) at fewer params with ternary weights
-- Generates coherent stories: *"Once upon a time, there was a lazy cat named Tom. Tom liked to sleep all day..."*
----
-## Mathematical Foundation
-### The Golden Ratio
-phi = (1 + sqrt(5)) / 2 = 1.6180339887...
-It appears at every level:
-| Where | How |
-|-------|-----|
-| 600-cell vertices | Coordinates contain phi/2 and 1/(2*phi) |
-| Coxeter eigenvalues | cos(pi/5) = phi/2, cos(2*pi/5) = 1/(2*phi) |
-| E8 -> H4 projection | 4x8 matrix built from cos(k*pi/5) rotation blocks |
-| State encoding | Golden-angle spiral for well-separated IP directions |
-| Checkpoint spacing | Fibonacci-indexed levels grow with base phi |
-| ChamberTree rotation | Level angles: 0, pi/5, pi/5 * phi |
-| Attention scaling | Queries scaled by phi: q_vec * phi |
-### The 600-Cell
-120 vertices on the unit 3-sphere S3, in three orbits:
-```
-Orbit 1:   8 vertices — permutations of (+-1, 0, 0, 0)
-Orbit 2:  16 vertices — (+-1/2, +-1/2, +-1/2, +-1/2)
-Orbit 3:  96 vertices — even permutations of (0, +-1/2, +-phi/2, +-1/(2*phi))
-Total:   120 vertices
-```
-The dot products between any two vertices take exactly 8 distinct values:
-```
-{-1, -phi/2, -1/2, -1/(2*phi), 0, 1/(2*phi), 1/2, phi/2}
-```
-These are exactly the cosines of multiples of pi/5, reflecting the pentagonal symmetry.
-### The H4 Reflection Group
-The 4 simple roots of H4 define reflection hyperplanes that partition S3 into Coxeter chambers:
-```
-alpha_1 = (1, -1, 0, 0) / sqrt(2)
-alpha_2 = (0, 1, -1, 0) / sqrt(2)
-alpha_3 = (0, 0, 1, 0)
-alpha_4 = (-1/2, -1/2, -1/2, (-1/(2*phi) + phi/2)) / norm
-```
-The sign pattern of a vector's dot products with these 4 roots gives a 4-bit bucket index (0-15), partitioning S3 into 16 regions.
-### ChamberTree: 3-Level Hierarchical Bucketing
-```
-Level 0:  16 buckets (4 root splits, original roots)
-Level 1:  16 x 16 = 256 sub-buckets (roots rotated by pi/5)
-Level 2:  256 x 16 = 4,096 leaf buckets (roots rotated by pi/5 * phi)
-```
-**Exact query:** visits all 16 buckets at each level (full scan)
-**Approximate query:** visits primary + 4 Hamming-1 neighbors = 5/16 per level
-- Over 3 levels: (5/16)^3 = 3.05% of keys scanned
-- Effective complexity: O(log t) for t cached entries
-### The E8 Lattice
-The densest sphere packing in 8D (Viazovska 2016). Decomposes as:
-```
-E8 = D8 ∪ (D8 + [1/2]^8)
-where D8 = { x in Z^8 : x_1 + x_2 + ... + x_8 = 0 (mod 2) }
-```
-**Closest-lattice-point decoder:** Given any point in R8, find the nearest E8 lattice point in O(1):
-1. Round to nearest D8 point (integers with even sum, parity correction)
-2. Round to nearest D8 + [1/2]^8 point (half-integers with even sum)
-3. Return whichever is closer
-**Kissing number = 240:** Each lattice point has exactly 240 nearest neighbors, in two orbits:
-- 112 vectors: +-e_i +- e_j for i < j (pairs of unit vectors)
-- 128 vectors: (+-1/2)^8 with even number of minus signs
-### E8 -> H4 Projection
-The 4x8 projection matrix uses rotation blocks built from the Coxeter element eigenvalues:
-```
-        [ cos(pi/5)   sin(pi/5)   cos(2pi/5)  sin(2pi/5)  0  0  0  0 ]
-    P = [-sin(pi/5)   cos(pi/5)  -sin(2pi/5)  cos(2pi/5)  0  0  0  0 ]
-        [  0           0           0            0          cos(pi/5)   sin(pi/5)   cos(2pi/5)  sin(2pi/5) ]
-        [  0           0           0            0         -sin(pi/5)   cos(pi/5)  -sin(2pi/5)  cos(2pi/5) ]
-where cos(pi/5) = phi/2 = 0.80902...
-      cos(2pi/5) = 1/(2*phi) = 0.30902...
-```
-This is the same projection that connects E8 root systems to icosahedral symmetry in the GSM physics framework. It preserves the golden-ratio structure, ensuring that memory embeddings in 8D map cleanly to 4D attention queries.
-### Phi-Recursive State Encoding
-Long execution traces are compressed using Fibonacci-indexed checkpoint levels:
-```
-Level 0: stores every step           (F_1 = 1 apart)
-Level 1: stores every phi steps      (F_2 = 1 apart)
-Level 2: stores every phi^2 steps    (F_3 = 2 apart)
-Level 3: stores every phi^3 steps    (F_4 = 3 apart)
-...
-Level k: stores every F_{k+1} steps
-```
-**Storage:** O(t * log_phi(t)) instead of O(t^2)
-**Retrieval:** O(log_phi(t)) via Zeckendorf decomposition (every positive integer is a unique sum of non-consecutive Fibonacci numbers)
----
-## Directory Structure
-```
-h4-polytopic-attention/
-+-- README.md                              This file
-+-- RESULTS.md                             Autoresearch results (30 experiments)
-+-- h4_program.md                          Autonomous research protocol (Phases 5-7)
-+-- docs/
-|   +-- PAPER.md                           Full arXiv paper draft
-|   +-- ARCHITECTURE.md                    Detailed architecture guide
-|   +-- h4_polytopic_attention_whitepaper.pdf   Original whitepaper
-+-- python/
-|   +-- h4_polytopic_attention.py          Phase 1: Frozen geometry (600-cell, ChamberTree, E8)
-|   +-- weight_compiler.py                 Phase 2: Analytical weights + H4Executor
-|   +-- h4_mcp_server.py                  Phase 3: MCP server (5 tools)
-|   +-- hybrid_llm.py                     Phase 3: Claude Agent SDK integration
-|   +-- h4_hybrid_attention.py            Phase 5: H4AttentionLayer + H4TransformerBlock
-|   +-- h4_language_model.py              Phase 5: Full LM architecture
-|   +-- train_cpu.py                      Phase 5: Autoresearch training script (2-min CPU budget)
-|   +-- benchmark_h4_vs_softmax.py        Phase 5: Scaling comparison (Rust + Python + softmax)
-|   +-- bitlinear.py                      Phase 6: BitLinear ternary {-1,0,+1} with STE
-|   +-- ternary_diagnostics.py            Phase 6: Chamber preservation + weight analysis
-|   +-- export_ternary.py                 Phase 6: Export frozen ternary model
-|   +-- prepare_data.py                   Data pipeline (synthetic, Shakespeare, TinyStories)
-|   +-- baselines.py                      Softmax + linear attention baseline models
-|   +-- compare_baselines.py              Head-to-head comparison script
-|   +-- utils/
-|       +-- phi_positional.py              Golden-angle positional encoding
-|       +-- chamber_index.py               ChamberTree bridge (Rust + Python fallback)
-|   +-- rag/                               Phase 7: Unified geometric RAG
-|       +-- encoder.py                     Document encoding into E8 lattice memory
-|       +-- pipeline.py                    End-to-end QA pipeline (retrieve + generate)
-|       +-- ranking_model.py               H4Ranker (contrastive scoring in H4 space)
-|       +-- train_ranker.py                Bi-encoder contrastive training on SQuAD
-|       +-- cross_encoder.py               Cross-encoder reranker (joint Q+P attention)
-|       +-- train_cross_encoder.py          Cross-encoder fine-tuning with LM backbone
-|       +-- tokenizer.py                   BPE tokenizer (tiktoken, 4096 vocab)
-|       +-- train_qa.py                    Generative QA training (F1 metric)
-|       +-- prepare_qa.py                  SQuAD download + QA data preparation
-|       +-- demo.py                        Interactive CLI demo
-|       +-- cost_benchmark.py              H4 CPU vs GPU vs API cost comparison
-+-- olympus/                               Project Olympus: specialist system
-|   +-- router.py                          Two-tier routing (100% accuracy)
-|   +-- h4_swap.py                         Progressive H4 attention swap
-|   +-- knowledge_index.py                 E8 lattice knowledge index (Wikipedia)
-|   +-- train_specialist.py                QLoRA training scaffold
-|   +-- train_code_specialist.py           Code specialist (training on GPU)
-|   +-- train_math_specialist.py           Math specialist (training on GPU)
-|   +-- train_qa_specialist.py             QA specialist (training on GPU)
-|   +-- demo.py                            Interactive Olympus demo
-|   +-- data/download_all.py               Download all training data
-+-- PROJECT_OLYMPUS.md                     Full Olympus plan + legal audit
-+-- OLYMPUS_CONTINUOUS_LEARNING.md         Self-improving system design
-+-- sample_docs/                           Sample documents for RAG demo
-+-- checkpoints/                           Saved model checkpoints
-+-- rust/
-|   +-- Cargo.toml                         Dependencies: rayon, pyo3, numpy
-|   +-- src/
-|       +-- lib.rs                         PyO3 bridge: h4_rust Python module
-|       +-- main.rs                        Benchmarks (Phases 1-4)
-|       +-- vec4.rs                        4D SIMD vector (AVX2-aligned)
-|       +-- vec8.rs                        Phase 4: 8D E8 vector + projection
-|       +-- h4.rs                          600-cell generation + verification
-|       +-- chamber_tree.rs                3-level ChamberTree with approx top-k
-|       +-- attention.rs                   Multi-head H4 attention (rayon)
-|       +-- e8_lattice.rs                  Phase 4: E8 decoder, 240 kissing vectors
-|       +-- lattice_memory.rs              Phase 4: Lattice-indexed RAM
-```
----
-## Installation
-### Python
-```bash
-# Clone the repository
-git clone https://github.com/grapheneaffiliate/h4-polytopic-attention.git
-cd h4-polytopic-attention
-# Install Python dependencies
-pip install numpy mcp
-# Phase 5 additionally requires PyTorch
-pip install torch
-# Run the proof-of-concept
-py python/h4_polytopic_attention.py
-# Run the weight compiler demo (Fibonacci)
-py python/weight_compiler.py
-# Run the Phase 5 training script (2-minute CPU budget)
-py python/train_cpu.py
-# Run the H4 vs softmax benchmark
-py python/benchmark_h4_vs_softmax.py
-# Run ternary diagnostics (Phase 6)
-py python/ternary_diagnostics.py
-# Train with ternary weights
-# Edit python/train_cpu.py: set USE_BITLINEAR = True
-py python/train_cpu.py
-# Phase 7: RAG — requires tiktoken for BPE tokenizer
-pip install tiktoken
-# Interactive RAG demo (point at any documents)
-py python/rag/demo.py --docs path/to/your/documents/
-# Train the passage ranker on SQuAD (10-minute CPU budget)
-py python/rag/train_ranker.py
-# Run cost benchmark (H4 CPU vs GPU vs API)
-py python/rag/cost_benchmark.py
-```
-### Rust
-```bash
-cd rust
-# Build with optimizations (required for SIMD)
-cargo build --release
-# Run benchmarks (50k steps, ~2 minutes)
-cargo run --release
-# Run E8 lattice tests
-cargo test
-# Build the Python bridge (requires maturin)
-pip install maturin
-maturin develop --release
-# This installs h4_rust module — enables 10.6x speedup in Python benchmarks
-```
-### MCP Server (Claude Code Integration)
-Add to your Claude Code MCP settings (`~/.mcp.json` or settings):
-```json
-{
-  "mcpServers": {
-    "h4-executor": {
-      "command": "py",
-      "args": ["C:/Users/atchi/h4-polytopic-attention/python/h4_mcp_server.py"]
-    }
-  }
-}
-```
-After restarting Claude Code, you'll have access to `h4_fibonacci`, `h4_compile_and_run`, `h4_geometry_info`, `h4_benchmark`, and `h4_lattice_memory` tools.
----
-## Usage
-### Running Fibonacci through the H4 Executor (Python)
-```python
-from weight_compiler import fibonacci_program, H4Executor
-prog = fibonacci_program(15)       # Compute F(0)..F(16)
-executor = H4Executor(prog, d_model=32)
-result = executor.run(max_steps=200)
-print(f"F(16) = {int(result['registers'][1])}")  # 987
-```
-### Using E8 Lattice Memory (Python)
-```python
-from weight_compiler import Program, H4Executor
-prog = Program()
-prog.add("LOAD", a=42, dest=0)         # R0 = 42
-prog.add("LOAD", a=100, dest=1)        # R1 = 100 (address)
-prog.add("STORE_MEM", a=0, b=1)        # mem[100] = 42 (via E8 lattice)
-prog.add("LOAD", a=0, dest=0)          # R0 = 0 (clear)
-prog.add("LOAD_MEM", a=1, dest=0)      # R0 = mem[100] (via E8 lookup)
-prog.add("HALT")
-executor = H4Executor(prog, d_model=32)
-result = executor.run(max_steps=50)
-print(f"R0 = {int(result['registers'][0])}")  # 42
-print(f"Lattice stats: {result['lattice_memory']}")
-```
-### Querying the E8 Lattice Directly (Python)
-```python
-from h4_polytopic_attention import E8LatticeIndex
-import numpy as np
-lattice = E8LatticeIndex()
-# Store 1000 values
-for i in range(1000):
-    emb = np.random.randn(8)
-    lattice.insert(emb, value=float(i), address=i)
-# Query: find nearest stored embedding
-query = np.random.randn(8)
-results = lattice.query_nearest(query, k=5)
-for dist, val, addr in results:
-    print(f"  dist^2={dist:.4f}, value={val}, addr={addr}")
-# Stats
-print(lattice.stats())
-# {'total_entries': 1000, 'occupied_cells': 412, 'utilization': 0.412, ...}
-```
-### Using the MCP Tools (Claude Code)
-Once the MCP server is configured, you can call tools directly:
-```
-> Compute F(20) using the H4 executor
-Claude calls h4_fibonacci(n=20) -> returns F(21) = 10946, correct,
-computed in 147 steps through the 4D H4 attention transformer.
-> Run a program that stores 42 to memory address 7, then loads it back
-Claude calls h4_compile_and_run with STORE_MEM/LOAD_MEM instructions,
-returns registers showing the value was stored and retrieved via E8 lattice.
-> Show me the E8 lattice memory stats
-Claude calls h4_lattice_memory(action="info") -> returns kissing number 240,
-projection eigenvalues cos(pi/5) = phi/2, Voronoi cell structure.
-```
-### Using the H4 Hybrid Attention Layer (Phase 5, PyTorch)
-```python
-import torch
-from h4_hybrid_attention import H4AttentionLayer, H4TransformerBlock
-# Drop-in attention layer
-layer = H4AttentionLayer(d_model=64, n_heads=8, d_value=16, top_k=32)
-x = torch.randn(2, 128, 64)  # (batch, seq_len, d_model)
-# Full attention (for short sequences)
-out = layer(x, use_tree=False)
-# Tree-accelerated attention (O(log t) candidate filtering)
-out = layer(x, use_tree=True)
-# With geometric diagnostics
-out, diag = layer(x, use_tree=False, return_diagnostics=True)
-print(f"Chamber entropy: {diag['chamber_entropy']:.3f}")
-print(f"Nudge rank (per head): {diag['nudge_rank']}")
-print(f"Geo alignment (per head): {diag['geo_alignment']}")
-print(f"Scan ratio: {diag['scan_ratio']:.4f}")
-```
-### Training an H4 Language Model (Phase 5)
-```python
-from h4_language_model import H4LanguageModel
-model = H4LanguageModel(
-    vocab_size=256,     # character-level
-    d_model=64,
-    n_heads=8,
-    n_layers=4,
-    d_value=16,
-)
-print(model.count_params())
-# {'trainable': 116480, 'frozen': 0, 'buffers': 525401, 'total': 116480}
-# Forward pass
-input_ids = torch.randint(0, 256, (4, 128))
-logits = model(input_ids)  # (4, 128, 256)
-# Autoregressive generation
-seed = torch.randint(0, 256, (1, 4))
-generated = model.generate(seed, max_new_tokens=100, temperature=0.8)
-```
-### Running the Autoresearch Loop (Phase 5)
-The `h4_program.md` defines an autonomous experiment protocol. Claude Code reads it and iterates:
-```bash
-# Single experiment (2-minute budget)
-py python/train_cpu.py
-# Output includes parseable summary:
-# ---
-# val_bpb:            2.939237
-# chamber_entropy:    2.3290
-# avg_geo_alignment:  0.9652
-# num_params:         109760
-```
-### Geometric RAG Demo (Phase 7)
-```bash
-# Interactive: point at documents, ask questions, get ranked passages
-py python/rag/demo.py --docs sample_docs/
-# Example output:
-# > What is the golden ratio?
-#
-# Rank 1 (score: 0.92): "The golden ratio, often denoted by phi..."
-#   Source: golden_ratio.txt, chunk 0
-#
-# Retrieval: 7.8ms | Ranking: 12ms | Total: 20ms
-```
-```python
-# Programmatic usage
-from rag.pipeline import H4RAGPipeline
-pipeline = H4RAGPipeline(vocab_size=4096, stoi=stoi, itos=itos,
-                          d_model=128, n_layers=2, use_bitlinear=True)
-pipeline.index_directory('path/to/docs/')
-result = pipeline.answer("What is the golden ratio?")
-print(result.answer, result.sources, f"{result.total_time_ms:.0f}ms")
-```
-### Running Baseline Comparisons
-```bash
-# Head-to-head: H4 vs softmax vs linear attention on Shakespeare
-py python/compare_baselines.py
-# Downloads Shakespeare automatically, trains all 4 configs (H4 float,
-# H4 ternary, softmax, linear) with identical model size and time budget,
-# prints ranked comparison table.
-```
----
-## Instruction Set Architecture
-| Opcode | Operands | Description | Encoding |
-|--------|----------|-------------|----------|
-| `LOAD` | a=imm, dest=reg | R[dest] = immediate value | 600-cell vertex[0] |
-| `ADD` | a=reg, b=reg, dest=reg | R[dest] = R[a] + R[b] | vertex[10] |
-| `SUB` | a=reg, b=reg, dest=reg | R[dest] = R[a] - R[b] | vertex[20] |
-| `MUL` | a=reg, b=reg, dest=reg | R[dest] = R[a] * R[b] | vertex[30] |
-| `STORE` | a=reg, dest=reg | R[dest] = R[a] (copy) | vertex[40] |
-| `JMP` | a=addr | IP = a | vertex[50] |
-| `JNZ` | a=reg, b=addr | if R[a] != 0: IP = b | vertex[60] |
-| `HALT` | - | Stop execution | vertex[70] |
-| `STORE_MEM` | a=reg, b=reg | mem[R[b]] = R[a] via E8 lattice | vertex[80] |
-| `LOAD_MEM` | a=reg, dest=reg | R[dest] = mem[R[a]] via E8 lattice | vertex[90] |
-**Registers:** 8 general-purpose (R0-R7), 64-bit floating point
-**State encoding:** Each opcode maps to a distinct 600-cell vertex, ensuring maximum angular separation between instruction types in 4D attention space.
-**Memory model:** STORE_MEM encodes the linear address as an 8D golden-angle spiral embedding, decodes it to the nearest E8 lattice point, and stores the value in that Voronoi cell. LOAD_MEM reverses the process, searching the primary cell plus 240 kissing neighbors.
----
-## Benchmarks
-### Rust (50,000 steps, release build)
-| Benchmark | steps/s | vs Python |
-|-----------|---------|-----------|
-| Random keys, exact (all 16 buckets) | 120 | 4x |
-| Random keys, approx (5/16 buckets) | 1,080 | 32x |
-| Random keys, exact + rayon parallel | 354 | 10x |
-| Random keys, approx + rayon parallel | 2,743 | 81x |
-| Structured (Wasm-like), exact | 113 | 3x |
-| Structured (Wasm-like), approx | 950 | 28x |
-| Structured (Wasm-like), approx + parallel | 2,600 | 76x |
-**Python PoC baseline:** ~34 steps/s
-**Theoretical O(log t) speedup vs O(t):** 1,765x at 50,000 steps
-### Phase 4: E8 Lattice Memory (Rust, 10,000 steps)
-| Operation | Rate | Hit rate |
-|-----------|------|----------|
-| Store (E8 decode + bucket insert + H4 project) | 189,753 ops/s | - |
-| Load (E8 decode + 240-neighbor scan) | 68,122 ops/s | 100% |
-| Unified query (E8 -> H4 projected attention) | 143,313 ops/s | 100% |
-**Lattice utilization:**
-- Occupied cells: 106 (of ~10,000 entries)
-- Max bucket size: 240 (= kissing number)
-- Primary cell hit rate: 100%
-- All kissing vector norms verified: norm^2 = 2
-### Python MCP Server (via Claude Code)
-| Operation | Latency |
-|-----------|---------|
-| Encoding throughput | ~32,550 states/s |
-| Forward pass (50 steps) | 0.001s |
-| Forward pass (100 steps) | 0.003s |
-| Forward pass (250 steps) | 0.020s |
-| Forward pass (500 steps) | 0.083s |
-### Phase 5: H4 vs Softmax Attention (Python, CPU)
-ChamberTree scan ratio (fraction of keys examined per query) at various sequence lengths:
-| seq_len | Softmax (ms) | H4 full (ms) | H4 tree (ms) | Scan ratio | Notes |
-|---------|-------------|--------------|--------------|------------|-------|
-| 64 | 0.5 | 1.1 | 1.1 | 100% | Tree not used (too short) |
-| 128 | 0.5 | 1.3 | 932 | 43.6% | Python overhead dominates |
-| 256 | 1.3 | 2.6 | 1758 | 23.4% | Scan ratio halving per doubling |
-| 512 | 5.2 | 8.0 | 3831 | 12.1% | Logarithmic pruning confirmed |
-| 1024 | 22.1 | 29.2 | 8821 | 6.2% | Continuing log scaling |
-| 2048 | 82.6 | 104.4 | 23434 | 3.1% | 97% of keys pruned |
-**Key finding:** The scan ratio scales as O(log t / t), confirming logarithmic candidate pruning works. The Python ChamberTree has high per-node overhead; the compiled Rust implementation delivers the wall-clock speedup (see below).
-### Rust ChamberTree Wall-Clock Benchmarks (256 queries, k=32, amortized tree build)
-| n_keys | Exact Brute-Force (ms) | ChamberTree Approx (ms) | Speedup | Top-k Recall |
-|--------|----------------------|------------------------|---------|-------------|
-| 1,024 | 10.2 | 2.6 | 3.9x | 82.5% |
-| 4,096 | 34.6 | 5.2 | 6.7x | 91.1% |
-| 16,384 | 155.9 | 18.0 | 8.7x | 95.4% |
-| 65,536 | 760.2 | 71.6 | **10.6x** | **98.3%** |
-**Key findings:** Speedup increases with sequence length (O(log t) vs O(t)). Recall *improves* with length --- the opposite of most approximate attention methods. At 65K keys, the tree examines 3.1% of candidates and finds 98.3% of the true top-k.
-### Shakespeare Head-to-Head (120s CPU training, same infrastructure)
-| Model | Attention | Params | Val Loss | Perplexity |
-|-------|-----------|--------|----------|-----------|
-| Softmax | O(t^2) | 797K | **2.329** | 10.3 |
-| Linear | O(t) | 797K | 2.332 | 10.3 |
-| H4 Float | O(log t) | 699K | 2.376 | 10.8 |
-| H4 Ternary | O(log t) + 1.58-bit | 699K | 2.394 | 11.0 |
-H4 is 2% behind softmax with 13% fewer parameters. The gap is largely throughput-driven at short sequences; at longer contexts the O(log t) advantage reverses this.
-### Phase 7: SQuAD Passage Ranking (870K ternary params, 10-min CPU)
-| Metric | H4 Ranker (870K) | Random Chance | Improvement |
-|--------|-------------------|--------------|-------------|
-| Recall@1 | **41.5%** | 3.1% | 12x |
-| Recall@5 | **75.9%** | 15.6% | 5x |
-| MRR | **0.57** | 0.13 | 4x |
-The 870K model was the minimum viable proof. At scale (3.7M params, overnight training):
-| Metric | 870K (10 min) | 3.7M (overnight) | Notes |
-|--------|-------------|-----------------|-------|
-| R@1 | 41.5% | ~37% | Bi-encoder ceiling |
-| R@5 | 75.9% | **100%** | Never misses the answer |
-| MRR | 0.57 | **0.93** | Answer averages rank 1-2 |
-The bi-encoder's job is retrieval, not precision ranking. R@5=100% and MRR=0.93 means the answer is always in the results. A pre-trained cross-encoder (MiniLM-L6, 22M params) reranks the top-5 candidates to **98.5% R@1**. The H4 geometry handles retrieval; the pre-trained model handles precision.
-### Phase 5: Initial Training Diagnostics (d_model=64, 2-min CPU)
-| Metric | Start | End (2 min) | Target |
-|--------|-------|-------------|--------|
-| val_loss | 3.20 | 1.96 | lower |
-| val_bpb | 4.62 | 2.94 | lower |
-| chamber_entropy | - | 2.33 / 2.77 | high (uniform chamber usage) |
-| W_nudge rank | 1.0 | 1.68 | high (rank-1 = focused direction) |
-| geo_alignment | - | 0.965 | > 0.9 (aligns with 600-cell) |
-*Note: These are initial verification results at d_model=64. The autoresearch loop subsequently found val_bpb=0.062 at d_model=128. See [RESULTS.md](RESULTS.md) for the full 30-experiment sweep.*
-### Autoresearch: Float vs Ternary (30 experiments, autonomous)
-| | Float32 best | Ternary best |
-|---|-------------|-------------|
-| val_bpb | **0.062** | **0.065** |
-| Gap | - | 0.003 (4.7%) |
-| d_model | 128 | 256 |
-| Layers | 6 | 4 |
-| Compression | 1x | ~17x |
-| Chamber preservation | - | 76.2% |
-| Experiments | 16 | 13 |
-*Full results, methodology, and findings in [RESULTS.md](RESULTS.md).*
-### Phase 6: Initial Chamber Preservation (10,000 random queries per head, at initialization)
-| Head | Preservation | Status |
-|------|-------------|--------|
-| 0-7 | 97.2% - 98.8% | All OK (>90% threshold) |
-| Mean | **97.9%** | Near-lossless at initialization |
-*Note: After training at LR=5e-3, chamber preservation drops to 76.2% as the ternary model finds its own optimal geometric routing. Quality is preserved (0.003 bpb gap). See [RESULTS.md](RESULTS.md) for the full chamber preservation analysis.*
----
-## MCP Server Integration
-The MCP server exposes 5 tools to Claude Code:
-### h4_fibonacci(n)
-Compute Fibonacci sequence F(0)..F(n+1) through the H4 transformer executor. Returns the result, correctness verification, execution steps, and full sequence.
-### h4_compile_and_run(instructions, max_steps)
-Execute a custom program. Each instruction is `{opcode, a, b, dest}`. Returns final register state, step count, halt status, and lattice memory statistics.
-### h4_geometry_info(aspect)
-Query H4 polytope geometry. Aspects: `vertices`, `chambers`, `dot_products`, `golden_ratio`, `all`.
-### h4_benchmark(n_steps)
-Profile encoding throughput and forward pass timing at different trace lengths.
-### h4_lattice_memory(action, n_entries)
-Phase 4 E8 lattice diagnostics:
-- `action="info"`: Return E8 lattice constants, projection eigenvalues, ISA opcodes
-- `action="benchmark"`: Store + load n_entries, return utilization stats
----
-## API Reference
-### Python
-#### `E8LatticeIndex`
-```python
-class E8LatticeIndex:
-    def __init__(max_cell_size=240)
-    def decode_to_lattice(point: ndarray) -> tuple     # R8 -> E8 lattice point
-    def insert(embedding_8d, value, address=None)       # Store in Voronoi cell
-    def project_to_h4(embedding_8d) -> ndarray          # 8D -> 4D projection
-    def query_nearest(query_8d, k=1, search_neighbors=True) -> List
-    def load_by_address(address) -> Optional[tuple]     # Linear address lookup
-    def stats() -> Dict                                  # Utilization statistics
-```
-#### `H4Executor`
-```python
-class H4Executor:
-    def __init__(program: Program, d_model=32)
-    def execute_instruction()      # Single ISA step
-    def run(max_steps=1000) -> Dict  # Full execution loop
-    # Attributes:
-    #   .registers: ndarray[8]     # Register file
-    #   .lattice_memory: E8LatticeIndex  # Phase 4 RAM
-    #   .trace: List[ndarray]      # Execution trace
-```
-#### `Program`
-```python
-class Program:
-    def add(opcode: str, a=0, b=0, dest=0)
-    # Opcodes: LOAD, ADD, SUB, MUL, STORE, STORE_MEM, LOAD_MEM, JMP, JNZ, HALT
-```
-#### `H4AttentionLayer` (Phase 5, PyTorch)
-```python
-class H4AttentionLayer(nn.Module):
-    def __init__(d_model, n_heads=8, d_value=16, top_k=32, dropout=0.0,
-                 use_bitlinear=False)
-    def forward(x, use_tree=True, return_diagnostics=False)
-    # x: (batch, seq_len, d_model) -> (batch, seq_len, d_model)
-    # Frozen buffers: vertices (120x4), simple_roots (4x4), e8_h4_proj (4x8)
-    # Trainable: W_q_proj, W_k_proj, W_v_proj, W_nudge, W_out, chamber_bonus
-    # use_bitlinear=True swaps all Linear->BitLinear (ternary {-1,0,+1})
-```
-#### `H4LanguageModel` (Phase 5, PyTorch)
-```python
-class H4LanguageModel(nn.Module):
-    def __init__(vocab_size, d_model=64, n_heads=8, n_layers=4, d_value=16,
-                 d_ffn=None, top_k=32, max_seq_len=8192, dropout=0.1,
-                 use_bitlinear=False)
-    def forward(input_ids, use_tree=True, return_diagnostics=False)
-    # input_ids: (batch, seq_len) -> logits: (batch, seq_len, vocab_size)
-    def generate(input_ids, max_new_tokens=100, temperature=1.0, top_k_sample=0)
-    def count_params() -> Dict   # {'trainable': int, 'frozen': int, 'buffers': int}
-    # use_bitlinear=True propagates ternary to all attention + FFN layers
-```
-#### `PhiPositionalEncoding` (Phase 5, PyTorch)
-```python
-class PhiPositionalEncoding(nn.Module):
-    def __init__(d_model, max_cached=8192)
-    def forward(seq_len, offset=0) -> Tensor    # (seq_len, d_model)
-    def encode_position(position) -> Tensor      # (d_model,)
-    # Uses golden-angle spiral: position n -> angle n * 2pi * phi^-1
-    # Beyond max_cached: Zeckendorf decomposition for O(log n) encoding
-```
-#### `BitLinear` (Phase 6, PyTorch)
-```python
-class BitLinear(nn.Module):
-    def __init__(in_features, out_features, bias=False)
-    def forward(x) -> Tensor          # STE: quantized forward, float backward
-    def freeze()                       # Lock to pure ternary for inference
-    def unfreeze()                     # Return to training mode
-    @property
-    def ternary_stats -> Dict          # {'neg1': float, 'zero': float, 'pos1': float}
-    # Weight quantization: scale = mean(|w|), w_q = RoundClip(w/scale, -1, +1)
-    # Activation quantization: per-token absmax to int8 [-127, 127]
-```
-### Rust
-#### `LatticeMemory`
-```rust
-pub struct LatticeMemory {
-    pub fn new() -> Self
-    pub fn store(&mut self, embedding: Vec8, value: [f64; 4], address: u64)
-    pub fn load(&mut self, query: Vec8) -> Option<(f64, [f64; 4], u64, u64)>
-    pub fn load_by_address(&self, address: u64) -> Option<([f64; 4], u64)>
-    pub fn query_attention_exact(&self, query_8d: Vec8) -> Option<(f64, [f64; 4], u64)>
-    pub fn query_attention_approx(&self, query_8d: Vec8) -> Option<(f64, [f64; 4], u64)>
-    pub fn project(&self, embedding: Vec8) -> Vec4
-    pub fn stats(&self) -> LatticeMemoryStats
-    pub fn utilization(&self) -> f64
-}
-```
-#### `LatticeAttention`
-```rust
-pub struct LatticeAttention {
-    pub fn new(d_model: usize) -> Self
-    pub fn insert(&mut self, embedding: &[f64])
-    pub fn store_mem(&mut self, embedding_8d: [f64; 8], value: [f64; 4], address: u64)
-    pub fn load_mem(&mut self, query_8d: [f64; 8]) -> Option<(f64, [f64; 4], u64, u64)>
-    pub fn query_exact(&self, embedding: &[f64]) -> Vec<Option<(f64, u64)>>
-    pub fn query_approx(&self, embedding: &[f64]) -> Vec<Option<(f64, u64)>>
-    pub fn query_unified(&self, query_8d: [f64; 8]) -> Option<(f64, [f64; 4], u64)>
-    pub fn memory_stats(&self) -> LatticeMemoryStats
-}
-```
-#### `ChamberTree`
-```rust
-pub struct ChamberTree {
-    pub fn new(simple_roots: [Vec4; 4]) -> Self
-    pub fn insert(&mut self, key: Vec4, value: [f64; 4], timestamp: u64)
-    pub fn query_max_exact(&self, query: Vec4) -> Option<(f64, [f64; 4], u64)>
-    pub fn query_max_approx(&self, query: Vec4) -> Option<(f64, [f64; 4], u64)>
-    pub size: u64
-}
-```
-#### E8 Lattice Functions
-```rust
-pub fn decode_to_e8(point: Vec8) -> LatticePoint     // O(1) Voronoi cell decode
-pub fn kissing_vectors() -> Vec<LatticePoint>          // 240 nearest neighbors
-pub fn lattice_add(a: LatticePoint, b: LatticePoint) -> LatticePoint
-pub fn neighbor_shell(center: LatticePoint) -> Vec<LatticePoint>
-```
----
-## Theory Deep Dive
-### Why Transformers Are Computers
-The insight from Percepta ("Can LLMs Be Computers?") is that a transformer's components map directly to a von Neumann architecture:
-| Transformer | Computer |
-|-------------|----------|
-| KV cache | RAM |
-| Attention query | Memory read (address decode) |
-| KV insert | Memory write |
-| FFN layer | ALU + instruction decode |
-| Token sequence | Execution trace / clock |
-| Softmax weights | Memory access pattern |
-H4 Polytopic Attention makes this concrete by replacing the O(t) softmax scan with O(log t) geometric lookup.
-### Why the E8 -> H4 Projection Unifies Memory and Attention
-The E8 root system has a remarkable property: its Coxeter element has eigenvalues that are roots of unity of order 30 (the Coxeter number). When projected along the eigenspaces corresponding to cos(pi/5) and cos(2*pi/5), the 240 roots of E8 map to the vertices of the H4 polytope system.
-This means:
-1. An 8D memory embedding encodes a "full" representation in E8 space
-2. The projection to 4D preserves exactly the geometric structure that the H4 attention heads use
-3. Two memory entries that are "nearby" in E8 Voronoi geometry remain nearby after projection to H4 chamber space
-4. The kissing number 240 in E8 bounds the search space: you never need to check more than 240 neighbors
-This is not an arbitrary dimensionality reduction --- it is the unique projection that preserves the golden-ratio structure shared between E8 and H4.
-### Lattice Memory vs. Hash Tables
-Traditional hash tables give O(1) lookup but destroy geometric locality --- similar keys can hash to wildly different buckets. E8 lattice decoding gives O(1) lookup while preserving locality:
-| Property | Hash Table | E8 Lattice Memory |
-|----------|------------|-------------------|
-| Lookup | O(1) | O(1) |
-| Locality preservation | None | Voronoi cells are convex |
-| Neighbor search | Not possible | 240 kissing vectors |
-| Attention integration | Separate system | Same geometry via projection |
-| Collision handling | Chaining/probing | Cell capacity (bounded by 240) |
-### The Significance of 240
-The kissing number of E8 is 240 --- this is the maximum number of non-overlapping unit spheres that can touch a central unit sphere in 8D. It is also the number of roots of the E8 root system, the number of minimal vectors in the E8 lattice, and the bound on how many neighbor cells you ever need to check for a Voronoi cell query.
-The 240 roots decompose as:
-- 112 = C(8,2) * 4 vectors of the form +-e_i +- e_j
-- 128 = 2^8 / 2 half-integer vectors with even parity
-This decomposition mirrors the two cosets of E8 = D8 + (D8 + [1/2]^8).
-### Concurrent Work: Percepta — "Can LLMs Be Computers?"
-**Percepta** (Tzamos et al., 2026) independently arrived at O(log t) attention through 2D convex hull geometry. They execute compiled C programs inside transformer weights at 32,000 tok/s on CPU --- millions of exact steps, zero errors.
-**The convergence:** Two independent groups identified the same bottleneck (linear attention cost) and arrived at the same solution (geometric sublinear lookup). They use 2D convex hull queries. We use 4D Coxeter chamber navigation. Both achieve O(log t).
-| | Percepta (2D) | H4 Polytopic (4D) |
-|---|---|---|
-| Geometry | 2D convex hull | 4D H4 polytope |
-| Complexity | O(log t) | O(log t) |
-| Purpose | Exact program execution | Language generation + RAG |
-| Throughput | 32,000 tok/s (deterministic) | 585 tok/s (language) |
-**Why this matters:** Independent validation that geometric attention enables sublinear lookup. Not a task-specific trick --- a fundamental improvement.
-**Synthesis:** Their 2D execution path (exact arithmetic at 32K tok/s) + our 4D language path (generation + retrieval) = a hybrid system where the model computes 15 x 23 exactly in its own forward pass, then explains the answer in natural language. No external calculator needed.
-### Lila-E8
-**Lila-E8** (concurrent work, 2025-2026) also uses the E8 lattice for attention, but in a fundamentally different way:
-| | Lila-E8 | H4 Polytopic Attention (this project) |
-|---|---|---|
-| **E8 role** | Attention bias (additive term in score matrix) | Memory addressing + algorithmic routing via E8->H4 projection |
-| **Complexity** | O(t^2) --- full attention matrix still computed | **O(log t)** --- ChamberTree prunes 97% of candidates |
-| **Mechanism** | E8 structure tells which tokens to upweight | E8->H4 projection partitions S^3 into navigable chambers |
-| **Speed benefit** | Better quality at same cost | **10.6x wall-clock speedup** at 65K keys |
-Both approaches validate that E8 geometry is useful for attention. Lila-E8 improves attention quality within O(t^2). We use the E8->H4 projection (cos(pi/5) = phi/2) to make attention fundamentally faster. The approaches are complementary --- Lila-E8's bias could be applied within the candidate set our ChamberTree selects, combining quality with speed.
----
-## Autoresearch Results
-Autonomous agents ran 42+ experiments across three tasks (language modeling, ternary optimization, passage ranking), all on CPU with no human intervention after launch.
-**Language modeling (30 experiments, ~56 min):**
-- **1.752 -> 0.062 bpb** (float, 16 experiments): LR was the biggest lever (10x increase), followed by depth and dropout removal
-- **0.088 -> 0.065 bpb** (ternary, 13 experiments): 2x width closed the quantization gap to 0.003
-- **Dropout=0 is optimal**: the frozen geometric backbone IS the regularizer
-- **Ternary wants 1.7x float LR**: STE quantization noise provides implicit regularization
-- **Chamber preservation cliff at ~70%**: below this, geometric routing breaks down
-**Passage ranking (12 experiments, ~2 hrs):**
-- **36.6% -> 41.5% R@1** on SQuAD: temperature was the only lever that mattered
-- **Ternary contrastive learning needs 2x higher temperature** (0.15 vs 0.07) --- noisier ternary similarities need softer distributions
-- **Throughput x quality-per-step** is the real objective on fixed time budgets (consistent finding across all three task types)
-Full methodology and experiment logs in [RESULTS.md](RESULTS.md).
----
-## How This Was Built: Claude Code as Research Partner
-This entire project --- from the first 600-cell vertex generation through the final PPL 10.0 model on Hugging Face --- was built in a single extended session using [Claude Code](https://claude.com/claude-code), Anthropic's CLI agent for software engineering. The workflow demonstrates what's possible when an AI assistant has the right tools and the right human guidance.
-### The Claude Code Workflow
-**Phase 1-6 (Architecture):** Claude Code wrote the core implementation files, verified shapes and gradients, ran benchmarks, and committed each phase with descriptive messages. Every piece was tested before moving to the next.
-**Autoresearch (42+ experiments):** Claude Code launched autonomous subagents that each ran 2-minute training experiments, parsed results, decided keep/discard, committed improvements, and moved to the next hypothesis. The float sweep (16 experiments) and ternary sweep (13 experiments) ran back-to-back. The ranking sweep (12 experiments) ran separately. No human intervention during sweeps --- the agents found LR scaling, dropout removal, temperature tuning, and the BitNet 2x-width law on their own.
-**Parallel tracks:** Three subagents ran simultaneously on non-overlapping files:
-- Track A: Rust PyO3 ChamberTree bridge (compiled, benchmarked 10.6x speedup)
-- Track B: Shakespeare baselines + data pipeline (head-to-head comparison)
-- Track C: Full arXiv paper draft (~7,500 words with LaTeX math)
-**Overnight training:** Claude Code configured and launched 8-hour CPU training runs, saving hourly checkpoints with evaluation and generated samples. The PPL 10.0 result came from an overnight run that Claude Code set up, monitored at 30 minutes to de-risk, then let complete autonomously.
-**Cross-encoder reranker:** Based on an engineer's analysis of bi-encoder limitations (R@5=100% but R@1 plateauing), Claude Code built a cross-encoder that uses the PPL 10.0 checkpoint as backbone and fine-tunes on SQuAD binary classification --- a two-phase training strategy (freeze backbone, then unfreeze) implemented and running within minutes of the suggestion.
-### Reproducibility
-Every step is reproducible:
-1. Clone the repo
-2. Install dependencies (`pip install numpy torch tiktoken`)
-3. Run any training script (`train_cpu.py`, `train_full_scale.py`, `rag/train_ranker.py`)
-4. The autoresearch protocol is documented in `h4_program.md`
-5. All experiment results are in git history with descriptive commit messages
-6. The HF model page has a 7-step guide from zero to trained model
-### Acknowledgments
-This project was made possible by [Anthropic](https://anthropic.com) and the Claude model family. The ability of Claude Code (powered by Claude Opus) to write, test, debug, and iterate on complex mathematical software --- managing Rust FFI bridges, PyTorch autograd, E8 lattice geometry, and autonomous experiment loops --- is a testament to the work of Dario Amodei and the entire Anthropic team in building AI systems that are genuinely useful for technical work.
-The autoresearch methodology was inspired by [Andrej Karpathy's autoresearch](https://github.com/karpathy/autoresearch) project, adapted for CPU-only training with frozen geometric backbones.
-**Author:** Timothy McGirl
-**AI Research Partner:** Claude Code (Claude Opus 4.6, Anthropic)
----
-## Why This Matters: The RAG Cost Elimination
-RAG (retrieval-augmented generation) is what most companies actually pay for right now. Not creative writing, not code generation --- document search. "Find the answer to this question in our 10,000 internal documents." That's the workload, and it's expensive.
-### Current enterprise RAG stack
-| Component | Service | Monthly Cost |
-|-----------|---------|-------------|
-| Embedding model | OpenAI ada-002 | $0.10/M tokens |
-| Vector database | Pinecone / Weaviate | $70-300/month |
-| LLM for generation | GPT-3.5/4 | $0.50-10/M tokens |
-| **Total (mid-size company)** | | **$500-2,000/month** |
-### H4 geometric RAG stack
-| Component | Implementation | Monthly Cost |
-|-----------|---------------|-------------|
-| Document retrieval | E8 lattice memory, 7.8ms/query | **$0** |
-| Passage ranking | H4 bi-encoder (R@5=100%) + MiniLM reranker (R@1=98.5%) | **$0** |
-| Text generation | Ternary H4 model, 585 tok/s, PPL 10.0 | **$0** |
-| Vector database | Not needed --- the E8 lattice IS the database | **$0** |
-| **Total** | Runs on existing laptop | **$0/month** |
-That's not a cost reduction. That's eliminating the cost entirely.
-### Why the architecture matters for RAG
-In a standard RAG system, retrieval and generation are completely separate systems. You pay for an embedding model to encode documents, pay for a vector database to store and search them, then pay for a different LLM to read the results and generate an answer. Three separate systems, three separate costs, three separate failure points.
-H4 Polytopic Attention unifies all three through the E8->H4 projection. Documents go into the E8 lattice. Questions project through the same geometry to find relevant documents. The H4 attention model reads those documents and generates answers. **One geometry, one system, zero external dependencies.**
-### The business case
-A company with 50,000 internal documents currently pays ~$500/month ($6,000/year) for hosted RAG. With H4 on a single office server:
-- $0/month ongoing
-- ~$1,500 one-time for a CPU server (if they don't have one)
-- **Payback in 3 months**
-For companies with 500,000+ documents or heavy query volume ($2,000-5,000/month currently), payback is under a month.
-### What's proven, what's next
-**Proven:**
-- E8 retrieval: R@5=100%, 20ms (Voronoi cell + 240-neighbor search)
-- H4 cross-encoder: **80% R@1 peak** on 5.9K SQuAD pairs (25M ternary, breakthrough)
-- MiniLM reranking: R@1=98.5% on same candidates (production accuracy)
-- Language generation: PPL 10.0 on TinyStories (24M ternary, beats 33M baseline)
-- ChamberTree: 10.6x wall-clock speedup at 65K keys, 98.3% recall
-- Router: 100% on 50 test cases (keyword classifier + ChamberTree sub-routing)
-- Ternary quantization: 0.003 bpb gap, ~17x compression
-**Currently training (3 GPU pods in parallel):**
-- Code specialist: SmolLM3-3B + QLoRA on 49K code examples (CodeAlpaca + CodeFeedback)
-- Math specialist: SmolLM3-3B + QLoRA on 49K math examples (MetaMathQA + GSM8K)
-- QA specialist: SmolLM3-3B + QLoRA on 78K QA examples (SQuAD + Natural Questions)
-- All three on RunPod RTX 4080 SUPER, ~$10 total, finishing overnight
-**The shippable system:** Router (100%, <1ms) -> specialist (SmolLM3-3B + LoRA) -> E8 retrieval (R@5=100%, 20ms) -> MiniLM reranking (R@1=98.5%) -> answer. No GPU, no API, $0/month.
-### Roadmap
-| Phase | Status | What |
-|-------|--------|------|
-| **Now** | Training overnight | 3 specialists on GPU (code, math, QA) |
-| **Tomorrow** | Next | Wire specialists into router, validate, benchmark |
-| **This week** | Planned | GGUF conversion for fast CPU inference, E8 Wikipedia index, end-to-end demo |
-| **Next week** | Planned | Add tools (web search, PDF reader, calculator) |
-| **After that** | Designed | [Continuous learning loop](OLYMPUS_CONTINUOUS_LEARNING.md) --- system identifies its own gaps and trains new specialists autonomously |
----
-## Project Olympus: Frontier-Quality AI on CPU
-Everything above is the foundation. **[Project Olympus](PROJECT_OLYMPUS.md)** is the vision: a system that approaches frontier model quality running entirely on CPU, for the billions of people who can't afford GPU compute and API subscriptions.
-**The core insight:** Claude Opus memorizes everything in 200B+ params (~400GB). We build 4 focused specialists based on **SmolLM3-3B** (3B ternary each, ~600MB) that know their domain deeply and retrieve everything else from the E8 knowledge index in 20ms. A 3B model that can look up any fact is functionally equivalent to a 200B model that memorized those facts --- for the user, the answer is the same.
-**Base model:** SmolLM3-3B-Instruct (Apache 2.0, 11.2T training tokens, 128K context, dual-mode reasoning). The strongest open-source model at this size as of 2025-2026.
-**The 4 specialists:**
-| Specialist | Fine-tuning | Purpose |
-|-----------|-------------|---------|
-| General | None (SmolLM3 as-is) | Conversation, instructions, creative |
-| Code | The Stack v2 + CodeAlpaca | Code generation, debugging |
-| Math | MetaMathQA + GSM8K | Problem solving, reasoning |
-| QA | SQuAD + NQ + TriviaQA | Factual answers from retrieved context |
-**What's already proven:**
-- E8 retrieval: R@5=100% (the answer is always in the results)
-- MiniLM reranking: R@1=98.5% (the right answer is picked first)
-- H4 geometric routing: <1ms per query (ChamberTree chamber classification)
-- Ternary quantization: 17x compression (3B model fits in ~600MB)
-- CPU training via QLoRA: 3-5 days for all specialists
-**What it enables:**
-- Factual QA at 85-90% (retrieval advantage --- looks up facts instead of hallucinating)
-- Instruction following at 75-85% (good enough for most tasks)
-- $0/month, private, local, runs on any laptop with 32GB RAM
-- 100% legally clean: Apache 2.0 models + open datasets, no distillation
-**Cost to build:** ~$50-100 in cloud compute (or $0 with a laptop and ~14 days)
-**The 14-day plan:** Download SmolLM3 (day 1) -> fine-tune 3 specialists via QLoRA (days 2-4) -> progressive H4 attention swap (days 5-10) -> router + knowledge index + integration (days 11-14).
-**Beyond the initial 4 specialists:** The system is designed to grow. **[Continuous Learning](OLYMPUS_CONTINUOUS_LEARNING.md)** describes how the system identifies its own weaknesses (low confidence scores on a domain), automatically curates training data, trains new specialists via QLoRA, validates they outperform the general model, and deploys them --- all autonomously, using the same autoresearch pattern that already ran 42+ experiments without human intervention. Each new specialist costs $2-3 in GPU time.
-See **[PROJECT_OLYMPUS.md](PROJECT_OLYMPUS.md)** for the full plan: SmolLM3-3B selection rationale, complete legal audit of every dataset, QLoRA training configs, H4 progressive swap strategy, and honest quality expectations.
-*This is not a replacement for frontier models. It's an alternative for the billions of people who can't afford them.*
----
-## Citation
-```bibtex
-@software{mcgirl2026h4polytopic,
-  author = {McGirl, Timothy},
-  title = {H4 Polytopic Attention: 4D Geometric Attention with O(log t) Queries via Coxeter Chamber Navigation and E8 Lattice-Indexed RAM},
-  year = {2026},
-  url = {https://github.com/grapheneaffiliate/h4-polytopic-attention},
-}
-```
----
-## License
-See repository for license details.

+---
+license: apache-2.0
+tags:
+- geometric-attention
+- h4-polytope
+- ternary-quantization
+- project-olympus
+- transformer-vm
+- hilbert-modular-form
+- e8-lattice
+- percepta
+language:
+- en
+pipeline_tag: text-generation
+---
+# H4 Polytopic Attention + Project Olympus
+Geometric attention using the 600-cell polytope for O(log t) token lookup,
+with ternary quantization and multi-specialist architecture.
+## What's Here
+### Checkpoints
+| Model | Size | Description |
+|-------|------|-------------|
+| h4_fullscale_final.pt | 94MB | 24M ternary H4 LM, PPL 10.0 on TinyStories |
+| h4_cross_encoder.pt | 98MB | 80% R@1 cross-encoder for reranking |
+| olympus_code/final/ | 116MB | SmolLM3-3B LoRA, code specialist (loss 0.768) |
+| olympus_math/final/ | 116MB | SmolLM3-3B LoRA, math specialist (loss 0.235) |
+| olympus_qa/final/ | 116MB | SmolLM3-3B LoRA, QA specialist (loss 1.39) |
+### Verified Results
+| Result | Value |
+|--------|-------|
+| H4 attention scan ratio | 3.1% at T=2048 (O(log t)) |
+| Rust ChamberTree speedup | 10.6x at 65K keys, 98.3% recall |
+| Ternary quantization gap | 0.003 bpb |
+| Language generation | PPL 10.0 (beats 33M baseline) |
+| Router accuracy | 100% on 50 test cases |
+| Compiled arithmetic | 30/30 exact |
+| Transformer-VM | 10.7K tok/s with OpenBLAS |
+| Code verifier | Catches DP backtracking bugs via property checking |
+### Mathematical Discovery
+**Galois Conjugation Theorem (machine-verified in Lean 4):**
+For every E8 lattice vector v, the H4 and H4' projected norms are
+Galois conjugates in Q(sqrt(5)). The E8 theta series decomposes as a
+Hilbert modular form of weight (4,4) with palindromic coefficients.
+Formal proof: [GSMLean/GaloisConjugation.lean](https://github.com/grapheneaffiliate/gsm-lean)
+### Architecture
+Three-tier compute: transformer-vm (exact, 10.7K tok/s) > compiled arithmetic (fallback) > specialist LLM (language).
+Code verification via property checking (sprint contract pattern).
+## Quick Start
+## Research Plan
+See [docs/GEOMETRIC_INFERENCE.md](docs/GEOMETRIC_INFERENCE.md) for the
+roadmap to Opus-level AI at 50 tok/s on consumer CPU via five multiplying
+geometric optimizations.
+## License
+Apache 2.0. Transformer-VM integration uses Percepta's Apache 2.0 code.