Architecture Guide
System Overview
H4 Polytopic Attention is a transformer executor where:
- Programs compile into analytically constructed transformer weights
- Attention heads operate in 4D using H4 (600-cell) geometry
- Memory is indexed via E8 lattice Voronoi cells
- The E8 -> H4 projection unifies memory and attention into one geometric system
This document covers the internal architecture in detail.
1. Execution Model
Von Neumann Mapping
+------------------+ +------------------+
| FETCH | | StateEncoder |
| Read instr at | --> | Encode IP, regs, |
| current IP | | opcode as d_model|
+------------------+ | vector on S^3 |
+------------------+
|
v
+------------------+ +------------------+
| DECODE | | Attention Heads |
| Determine | --> | H4 ChamberTree |
| operation | | lookup in trace |
+------------------+ +------------------+
|
v
+------------------+ +------------------+
| EXECUTE | | FFN Layers |
| Perform the | --> | Opcode decode + |
| operation | | ALU operations |
+------------------+ +------------------+
|
v
+------------------+ +------------------+
| WRITEBACK | | Update trace, |
| Store results | --> | registers, and |
| | | lattice memory |
+------------------+ +------------------+
Execution Loop (per step)
while not halted:
instr = program[ip]
state_vec = encoder.encode_state(ip, registers, instr, step)
trace.append(state_vec)
# Transformer forward pass over full trace
output = transformer.forward(trace)
# Execute the instruction (updates registers, ip)
execute_instruction(instr)
The transformer's forward pass is run at every step, giving it access to the full execution history. Attention heads look back at prior states to resolve memory lookups and control flow patterns.
2. State Encoding
Vector Layout (d_model = 32)
Byte range Contents Encoding method
--------- -------- ---------------
[0:4] Instruction pointer Golden-angle spiral on S^3
[4:8] Opcode 600-cell vertex (10 vertices apart)
[8:16] Register file (8 regs) tanh(reg / 100) normalization
[16:20] Operand A Golden-angle spiral on S^3
[20:24] Operand B Golden-angle spiral on S^3
[24:28] Destination register Golden-angle spiral on S^3
[28:32] Step counter / phase phi-scaled trigonometric
Golden-Angle Spiral
Each integer (IP, register index, address) maps to a well-separated direction on S^3:
theta1 = n * 2 * pi * phi_inv # golden angle in first plane
theta2 = n * pi * phi_inv * 0.7 # golden angle in second plane
r1, r2 = cos(theta2), sin(theta2)
vec = [r1*cos(theta1), r1*sin(theta1), r2*cos(theta1*phi), r2*sin(theta1*phi)]
The golden angle (2piphi_inv ~ 137.5 degrees) ensures maximum angular separation between consecutive indices, so nearby IPs don't interfere in attention.
Opcode Encoding
Each opcode maps to a specific 600-cell vertex, spaced 10 apart:
LOAD -> vertex[0] STORE_MEM -> vertex[80]
ADD -> vertex[10] LOAD_MEM -> vertex[90]
SUB -> vertex[20]
MUL -> vertex[30]
STORE -> vertex[40]
JMP -> vertex[50]
JNZ -> vertex[60]
HALT -> vertex[70]
With 120 vertices, spacing of 10 ensures ~36 degrees angular separation between opcodes --- well above the Coxeter chamber resolution.
3. Attention Architecture
Head Allocation
With d_model = 32 and d_head = 4, we get 8 attention heads:
Heads 0-1: IP Lookup
Keys encode instruction pointers
Queries find matching IPs in the execution history
Purpose: "What happened last time we were at this IP?"
Heads 2-3: Register Lookup
Keys encode register indices
Queries find register values at prior steps
Purpose: "What was the value of R[a] when we last wrote it?"
Heads 4-5: Operand Fetch
Keys encode operand values
Queries fetch operand data from the trace
Purpose: "What operands were used with this instruction?"
Heads 6-7: Control Flow
Keys encode branch conditions
Queries predict branch direction
Purpose: "Was the branch taken last time?"
ChamberTree Structure
Each head maintains a ChamberTree for its 4D key-value pairs:
Root (4 simple roots of H4)
/ | | \
[0] [1] ... [14] [15] <- 16 buckets (level 0)
|
Rotated roots (by pi/5)
/ | | \
[0] [1] ... [14] [15] <- 16 sub-buckets (level 1)
|
Rotated roots (by pi/5 * phi)
/ | | \
[0] [1] ... [14] [15] <- 16 leaf buckets (level 2)
|
[k1, v1, ts1] <- Leaf entries (max 64 per leaf)
[k2, v2, ts2]
...
Bucket index computation:
fn bucket_index(roots: &[Vec4; 4], key: Vec4) -> usize {
let mut idx = 0;
if key.dot(roots[0]) >= 0.0 { idx |= 1; }
if key.dot(roots[1]) >= 0.0 { idx |= 2; }
if key.dot(roots[2]) >= 0.0 { idx |= 4; }
if key.dot(roots[3]) >= 0.0 { idx |= 8; }
idx // 0..15
}
Approximate query: At each level, visit the primary bucket plus 4 Hamming-1 neighbors (flip one bit each). This gives 5/16 = 31.25% scan per level. Over 3 levels: (5/16)^3 = 3.05%.
4. E8 Lattice Memory (Phase 4)
Address Translation Pipeline
Linear address (u64)
|
| _address_to_embedding()
v
8D golden-angle embedding (Vec8)
|
| decode_to_e8()
v
E8 lattice point (LatticePoint, [i32; 8])
|
| hash
v
HashMap bucket (Vec<MemoryEntry>)
|
| project_to_h4()
v
4D H4 vector (Vec4)
|
| ChamberTree.insert()
v
Attention-queryable cache
E8 Closest-Point Decoder
The E8 lattice decomposes as D8 + (D8 + [1/2]^8). The decoder finds which coset point is closest:
Input: point p in R^8
Coset 1 (D8): integers with even sum
1. f1 = round(p) componentwise
2. If sum(f1) is odd:
- Find component with largest rounding error
- Flip it to the other nearest integer
3. dist1 = ||p - f1||^2
Coset 2 (D8 + [1/2]^8): half-integers with even sum
1. f2 = floor(p) + 0.5 componentwise
2. If sum(f2) is not an integer (parity check):
- Find component with largest error
- Shift it by +-1
3. dist2 = ||p - f2||^2
Return: f1 if dist1 <= dist2, else f2
Representation: All coordinates stored as 2x integers ([i32; 8]) so both integer and half-integer lattice points use the same type. Integer point (1, 0, ...) becomes (2, 0, ...); half-integer (1/2, 1/2, ...) becomes (1, 1, ...).
Memory Operations
STORE_MEM (R[a] -> mem[R[b]]):
- Convert address R[b] to 8D embedding via golden-angle spiral
- Decode embedding to E8 lattice point
- Insert (embedding, value, address, timestamp) into that cell's bucket
- Project embedding to 4D via E8->H4 and insert into ChamberTree
- If bucket exceeds 240 entries (kissing number), evict oldest (LRU)
LOAD_MEM (mem[R[a]] -> R[dest]):
- Convert address R[a] to 8D embedding
- Decode to E8 lattice point (primary cell)
- Search primary cell for closest embedding match
- Search 240 kissing neighbors for closer matches
- Return value of closest match
Utilization Characteristics
With structured (Wasm-like) embeddings at 10,000 steps:
- 106 occupied cells out of ~10,000 entries
- Max bucket size: 240 (saturated cells hit the kissing number bound)
- Primary cell hit rate: 100% (exact queries always find their target in the primary cell)
- Average bucket size: 83.7 entries
The clustering behavior is expected: structured execution traces have correlated embeddings that map to nearby E8 lattice points. The 240 cap prevents any cell from growing unbounded.
5. Phi-Recursive State Encoding
Fibonacci Checkpoint Levels
Long traces are compressed using Fibonacci-spaced checkpoints:
Level 0: checkpoint at every step spacing = F(1) = 1
Level 1: checkpoint at every phi steps spacing = F(2) = 1
Level 2: checkpoint at every phi^2 steps spacing = F(3) = 2
Level 3: checkpoint at every phi^3 steps spacing = F(4) = 3
Level 4: checkpoint at every phi^4 steps spacing = F(5) = 5
...
Level k: checkpoint at every phi^k steps spacing = F(k+1)
Zeckendorf Retrieval
To retrieve state at step t, use Zeckendorf decomposition:
- Every positive integer is a unique sum of non-consecutive Fibonacci numbers
- Decompose distance to target step into Fibonacci components
- Traverse checkpoint levels to reconstruct state
Example: Retrieve step 20 = F(8) + F(4) + F(2) = 21 - 1? No: 20 = 13 + 5 + 2 = F(7) + F(5) + F(3). Jump to level 7 checkpoint closest to target, then level 5, then level 3.
6. Weight Construction
Analytical (No Training)
Transformer weights are computed directly from the geometry:
Attention weights:
# W_K for IP-lookup heads: project to first 4 dims (IP encoding)
W_K[0] = I_4 (4x32, first 4 columns of identity)
W_K[1] = I_4 (duplicate for robustness)
# W_K for register-lookup heads: project to dims 8:16 (register file)
W_K[2][8:12, :] = I_4
W_K[3][12:16, :] = I_4
FFN weights:
# Layer 1: opcode detection
# Each opcode vertex direction becomes a row in W1
# ReLU activation selects the matching opcode
# Layer 2: operation execution
# Maps detected opcode to register update
The key insight is that H4 geometry provides enough angular resolution (14,400 chambers) that each instruction type occupies a distinct Coxeter chamber, making the attention lookup exact rather than approximate.
7. Rust Implementation Details
SIMD Alignment
#[repr(align(32))] // Vec4: one 256-bit AVX2 register
pub struct Vec4(pub [f64; 4]);
#[repr(align(64))] // Vec8: two 256-bit AVX2 registers
pub struct Vec8(pub [f64; 8]);
Dot products are written as single expressions so LLVM can emit vmulpd + horizontal add:
pub fn dot(self, other: Self) -> f64 {
(a[0]*b[0] + a[1]*b[1]) + (a[2]*b[2] + a[3]*b[3])
}
Thread Safety
ChamberTree is Send + Sync for rayon parallelism across attention heads:
unsafe impl Send for ChamberTree {}
unsafe impl Sync for ChamberTree {}
Parallel queries distribute heads across threads:
pub fn query_approx_par(&self, embedding: &[f64]) -> Vec<Option<(f64, u64)>> {
self.heads.par_iter().enumerate().map(|(h, head)| {
head.query_approx(query_vec)
}).collect()
}
Build Configuration
[profile.release]
opt-level = 3 # Maximum optimization (enables auto-vectorization)
lto = true # Link-time optimization (enables cross-crate inlining)
8. MCP Server Architecture
Claude Code <--stdio--> h4_mcp_server.py <--import--> weight_compiler.py
<--import--> h4_polytopic_attention.py
The server runs as a subprocess of Claude Code, communicating via stdin/stdout JSON-RPC (MCP protocol). It uses Max plan OAuth --- the compute happens locally, and Claude's API calls use the existing subscription.
Tool Dispatch
@server.call_tool()
async def call_tool(name: str, arguments: dict):
if name == "h4_fibonacci": return await _h4_fibonacci(arguments)
elif name == "h4_compile_and_run": return await _h4_compile_and_run(arguments)
elif name == "h4_geometry_info": return await _h4_geometry_info(arguments)
elif name == "h4_benchmark": return await _h4_benchmark(arguments)
elif name == "h4_lattice_memory": return await _h4_lattice_memory(arguments)
Each tool creates a fresh H4Executor instance, suppresses stdout (the executor is chatty), runs the program, and returns structured JSON results.
9. Phase 5: Trainable Hybrid Attention (PyTorch)
Frozen + Trainable Architecture
The key insight from the Fibonacci proof-of-concept (26 trainable params on a frozen H4 backbone): the geometric backbone provides such strong inductive bias that only small trainable adapters are needed. The learned W_nudge matrix converged to rank-1, aligning 93% with a 600-cell vertex. Phase 5 generalizes this to full language modeling.
Input tokens
|
v
Token Embedding + Golden-Angle Positional Encoding (phi-inverse spacing)
|
v
+-------------------------------------------------------------------+
| H4TransformerBlock (repeated N times) |
| |
| LayerNorm -> H4AttentionLayer -> Residual |
| | |
| +-- FROZEN: 600-cell vertices, simple roots, |
| | E8->H4 projection, ChamberTree |
| | |
| +-- TRAINABLE: W_q/k/v_proj, W_nudge (4x4/head), |
| | chamber_bonus (16/head), W_out |
| | |
| +-- Forward path: |
| 1. Project input -> Q,K,V per head |
| 2. Normalize Q,K to S^3 (unit 4-sphere) |
| 3. Apply W_nudge (small query rotation) |
| 4. ChamberTree top-k lookup OR full attention |
| 5. Soft chamber bonus on keys (differentiable) |
| 6. Softmax over candidates -> weighted V sum |
| 7. Concatenate heads -> W_out |
| |
| LayerNorm -> FFN (Linear -> GELU -> Linear) -> Residual |
+-------------------------------------------------------------------+
|
v
LayerNorm -> LM Head (weight-tied with token embedding)
|
v
Logits (vocab_size)
ChamberTree as Preprocessing Filter
The ChamberTree is not differentiable --- it's a discrete tree traversal. The trick:
- Build: Insert all K vectors into ChamberTree (numpy, per sequence)
- Query: For each Q, tree returns top-k candidate key indices in O(log t)
- Gather: Use indices to select candidate K, V tensors in PyTorch
- Attend: Compute differentiable softmax attention over k candidates only
- Gradients: Flow through W_q_proj, W_k_proj, W_nudge, chamber_bonus, W_v_proj normally
This gives O(k) attention per query where k << t. The tree is just a fast filter.
Soft Chamber Bonus
The chamber_bonus parameter adds a per-key attention bias based on which Coxeter chamber the key lands in. Since softmax is shift-invariant, the bonus must be per-key (not per-query).
For gradient flow, chamber membership uses soft assignment:
k_dots = K @ simple_roots^T # (B, T, H, 4)
soft_signs = sigmoid(k_dots * 3.0) # soft bit per root
chamber_weights = prod over 4 bits # (B, T, H, 16) soft membership
k_bonus = sum(chamber_weights * bonus) # (B, T, H) per-key bonus
scores += k_bonus # differentiable
Golden-Angle Positional Encoding
Instead of sinusoidal or RoPE, positions are encoded using phi-inverse spacing:
Position n -> angle = n * 2*pi * phi^-1
phi^-1 ~ 0.618 is the most irrational number (hardest to approximate by rationals). This guarantees:
- Consecutive positions are maximally separated (~137.5 degrees apart)
- No position vectors repeat or nearly repeat at any scale
- Long-range positions compress via Zeckendorf decomposition (sum of Fibonacci-indexed embeddings)
Frequency scales across dimension pairs use phi powers: phi^(-k/n_pairs), giving geometrically spaced frequencies anchored to the golden ratio.
Diagnostic Metrics
Phase 5 tracks geometric health during training:
| Metric | What it measures | Healthy range |
|---|---|---|
chamber_entropy |
Shannon entropy of chamber utilization | > 2.0 (max ~2.77 for 16 chambers) |
nudge_rank |
Ratio sigma_1/sigma_2 of W_nudge SVD | > 2.0 (rank-1 = focused direction) |
geo_alignment |
Max dot of nudge direction with 600-cell vertex | > 0.9 (geometry attracts learning) |
scan_ratio |
Fraction of keys examined per query (tree mode) | << 1.0 (lower = more pruning) |
Autoresearch Protocol
h4_program.md defines an autonomous experiment loop:
- Agent modifies ONLY
train_cpu.py(trainable adapters + hyperparameters) - Frozen geometry is off-limits
- 2-minute CPU budget per experiment
- Keep/discard based on val_bpb improvement
- Git commit on keep, checkout on discard
- Results tracked in
results.tsv(untracked)
10. Phase 6: BitNet b1.58 Ternary Weights
Why Ternary Fits H4 Geometry
The ChamberTree routing already performs 1-bit quantization: sign(dot(query, root_i)) for 4 roots gives a 4-bit chamber index. BitNet b1.58 extends this to ternary {-1, 0, +1} for all trainable weights.
The key insight: ternary preserves signs, and chamber assignments only depend on signs. Therefore ternary-quantized attention heads navigate the same Coxeter chambers. At initialization this was confirmed at 97.9% chamber preservation. After training, preservation drops to ~76% as the model finds its own geometric routing in the ternary-constrained space --- different from the float routing but equally effective (0.003 bpb gap).
BitLinear: Straight-Through Estimator
Training forward pass:
1. Quantize weights: w_q = RoundClip(w / mean(|w|), -1, +1)
2. STE: w_ste = w + (w_q * scale - w).detach()
β forward sees quantized value, backward sees float shadow
3. Quantize activations: x_q = RoundClip(x * 127 / max(|x|), -127, 127)
4. STE: x_ste = x + (x_q * scale / 127 - x).detach()
5. y = Linear(x_ste, w_ste) β full gradient flow to w and x
Inference forward pass (after freeze()):
1. Load frozen int8 weights (only {-1, 0, +1} values)
2. y = x @ w_frozen.float() * scale β matmul is add/sub only
What Stays Float32
| Component | Reason |
|---|---|
| 600-cell vertices (120 x 4) | Frozen buffer, static lookup |
| H4 simple roots (4 x 4) | Frozen buffer, need exact signs for chamber routing |
| E8->H4 projection (4 x 8) | Frozen buffer |
| chamber_bonus (n_heads x 16) | Too small to quantize (128 values), needs continuous gradients |
| Token embeddings | Lookup table, not a matmul |
| LayerNorm parameters | Small, need float precision |
| LM head | Weight-tied with embeddings |
What Becomes Ternary
| Component | Params (per layer) | Ternary Size (per layer) |
|---|---|---|
| W_q_proj, W_k_proj | 2x d_model*32 | varies by d_model |
| W_v_proj, W_out | 2x d_model*128 | varies by d_model |
| FFN up + down | 2x d_model*d_ffn | varies by d_model |
Example configs from autoresearch (best results):
- d_model=128, 6 layers (float best): ~348K params, ~1.4 MB float32
- d_model=256, 4 layers (ternary best):
1.1M params, ~310 KB at 1.58 bits (17x compression)
Ternary Inference Path
Token β Embedding (float16 lookup)
β BitLinear Q/K/V (ternary: add/sub only)
β Normalize to SΒ³ (float, 4-dim per head)
β W_nudge rotation (float, 4x4 per head β only 16 values)
β ChamberTree lookup (integer sign comparisons, 3.1% scan)
β Softmax over k candidates (float, k << t)
β Weighted sum of values (float x ternary)
β BitLinear output (ternary: add/sub only)
β BitLinear FFN (ternary: add/sub only)
β LM head β next token
Float multiplies in the critical path: root dot products (4x4), softmax, dequant scales (one per layer). Everything else is integer addition/subtraction.
11. Phase 7: Unified Geometric RAG
E8 Retrieval + H4 Ranking + Ternary Generation
Phase 7 unifies document retrieval and generation through the same E8βH4 projection. In standard RAG systems, retrieval (embedding model + vector database) and generation (LLM) are separate systems with different geometries. H4 RAG uses one geometry for both:
- Encode: Documents chunk into 8D embeddings via golden-angle spiral placement, stored in E8 Voronoi cells (
rag/encoder.py) - Retrieve: Questions project to 8D, E8 lattice lookup finds nearest chunks in O(1) + 240 neighbors (
h4_polytopic_attention.py: E8LatticeIndex) - Rank: Question and passage both encode to 4D vectors on SΒ³ via H4 attention. Relevance = dot product in H4 space β the same metric as attention routing (
rag/ranking_model.py) - Generate: Ternary H4 model reads retrieved context and produces answers (
h4_language_model.py)
Contrastive Ranking in H4 Space
The ranker uses InfoNCE loss with in-batch negatives. For a batch of B questions, each paired with its correct passage, the similarity matrix is BΓB where entry [i,j] = dot(q_i, p_j) on SΒ³. The diagonal entries (correct pairs) should dominate each row.
Key finding: ternary models need ~2x higher contrastive temperature (0.15 vs 0.07) because BitLinear produces noisier similarity scores that require softer probability distributions for gradient flow.
Two-Stage Retrieval Pipeline
The production pipeline uses both architectures:
Bi-encoder (fast, broad): Encodes question and passage separately, scores by H4 dot product. At 3.7M params: R@5=100%, MRR=0.93. The answer is always in the top 5 results. Cost: ~20ms for all documents.
Pre-trained cross-encoder (precise, narrow): A pre-trained reranker (ms-marco-MiniLM-L-6-v2, 22M params) scores each candidate by reading question + passage jointly. Achieves 98.5% R@1 on the bi-encoder's top-5 candidates. Cost: ~500ms for 5 candidates.
Pipeline: H4 bi-encoder retrieves top-5 (100% recall, 20ms) β MiniLM reranks top-5 (98.5% precision, ~500ms) β return best.
Our H4 cross-encoder (25M ternary params, 8h overnight on 5.9K SQuAD pairs) reached 80% R@1 peak (69% final eval) β a breakthrough showing H4 cross-attention learns question-to-passage alignment through Coxeter chambers. Trajectory: 29% (1hr) β 52% (3hr) β 80% (7hr). The system ships with MiniLM (98.5%) for production accuracy, with the H4 cross-encoder as a fully geometric, zero-dependency alternative.
Full-Scale Results
- Language modeling: 24M ternary params, PPL 10.0 on TinyStories (8h CPU), beats published 33M baseline
- Bi-encoder retrieval: R@5=100%, MRR=0.93 at 3.7M params (answer always in results)
- Combined reranking: 98.5% R@1 (H4 bi-encoder retrieval + pre-trained MiniLM reranker)
- Passage ranking (min viable): 41.5% R@1 at 870K params (12x random chance)
- Retrieval: 7.8ms per query via E8 lattice
- Cost: $0/month (replaces $500-2,000/month enterprise RAG stack)