Upload docs/ARCHITECTURE.md with huggingface_hub

bc11ae0 verified 18 days ago

23.5 kB

	# Architecture Guide

	## System Overview

	H4 Polytopic Attention is a transformer executor where:
	- Programs compile into analytically constructed transformer weights
	- Attention heads operate in 4D using H4 (600-cell) geometry
	- Memory is indexed via E8 lattice Voronoi cells
	- The E8 -> H4 projection unifies memory and attention into one geometric system

	This document covers the internal architecture in detail.

	---

	## 1. Execution Model

	### Von Neumann Mapping

	```
	+------------------+ +------------------+
	\| FETCH \| \| StateEncoder \|
	\| Read instr at \| --> \| Encode IP, regs, \|
	\| current IP \| \| opcode as d_model\|
	+------------------+ \| vector on S^3 \|
	+------------------+
	\|
	v
	+------------------+ +------------------+
	\| DECODE \| \| Attention Heads \|
	\| Determine \| --> \| H4 ChamberTree \|
	\| operation \| \| lookup in trace \|
	+------------------+ +------------------+
	\|
	v
	+------------------+ +------------------+
	\| EXECUTE \| \| FFN Layers \|
	\| Perform the \| --> \| Opcode decode + \|
	\| operation \| \| ALU operations \|
	+------------------+ +------------------+
	\|
	v
	+------------------+ +------------------+
	\| WRITEBACK \| \| Update trace, \|
	\| Store results \| --> \| registers, and \|
	\| \| \| lattice memory \|
	+------------------+ +------------------+
	```

	### Execution Loop (per step)

	```python
	while not halted:
	instr = program[ip]
	state_vec = encoder.encode_state(ip, registers, instr, step)
	trace.append(state_vec)

	# Transformer forward pass over full trace
	output = transformer.forward(trace)

	# Execute the instruction (updates registers, ip)
	execute_instruction(instr)
	```

	The transformer's forward pass is run at every step, giving it access to the full execution history. Attention heads look back at prior states to resolve memory lookups and control flow patterns.

	---

	## 2. State Encoding

	### Vector Layout (d_model = 32)

	```
	Byte range Contents Encoding method
	--------- -------- ---------------
	[0:4] Instruction pointer Golden-angle spiral on S^3
	[4:8] Opcode 600-cell vertex (10 vertices apart)
	[8:16] Register file (8 regs) tanh(reg / 100) normalization
	[16:20] Operand A Golden-angle spiral on S^3
	[20:24] Operand B Golden-angle spiral on S^3
	[24:28] Destination register Golden-angle spiral on S^3
	[28:32] Step counter / phase phi-scaled trigonometric
	```

	### Golden-Angle Spiral

	Each integer (IP, register index, address) maps to a well-separated direction on S^3:

	```python
	theta1 = n * 2 * pi * phi_inv # golden angle in first plane
	theta2 = n * pi * phi_inv * 0.7 # golden angle in second plane
	r1, r2 = cos(theta2), sin(theta2)
	vec = [r1cos(theta1), r1sin(theta1), r2cos(theta1phi), r2sin(theta1phi)]
	```

	The golden angle (2piphi_inv ~ 137.5 degrees) ensures maximum angular separation between consecutive indices, so nearby IPs don't interfere in attention.

	### Opcode Encoding

	Each opcode maps to a specific 600-cell vertex, spaced 10 apart:

	```
	LOAD -> vertex[0] STORE_MEM -> vertex[80]
	ADD -> vertex[10] LOAD_MEM -> vertex[90]
	SUB -> vertex[20]
	MUL -> vertex[30]
	STORE -> vertex[40]
	JMP -> vertex[50]
	JNZ -> vertex[60]
	HALT -> vertex[70]
	```

	With 120 vertices, spacing of 10 ensures ~36 degrees angular separation between opcodes --- well above the Coxeter chamber resolution.

	---

	## 3. Attention Architecture

	### Head Allocation

	With d_model = 32 and d_head = 4, we get 8 attention heads:

	```
	Heads 0-1: IP Lookup
	Keys encode instruction pointers
	Queries find matching IPs in the execution history
	Purpose: "What happened last time we were at this IP?"

	Heads 2-3: Register Lookup
	Keys encode register indices
	Queries find register values at prior steps
	Purpose: "What was the value of R[a] when we last wrote it?"

	Heads 4-5: Operand Fetch
	Keys encode operand values
	Queries fetch operand data from the trace
	Purpose: "What operands were used with this instruction?"

	Heads 6-7: Control Flow
	Keys encode branch conditions
	Queries predict branch direction
	Purpose: "Was the branch taken last time?"
	```

	### ChamberTree Structure

	Each head maintains a ChamberTree for its 4D key-value pairs:

	```
	Root (4 simple roots of H4)
	/ \| \| \
	[0] [1] ... [14] [15] <- 16 buckets (level 0)
	\|
	Rotated roots (by pi/5)
	/ \| \| \
	[0] [1] ... [14] [15] <- 16 sub-buckets (level 1)
	\|
	Rotated roots (by pi/5 * phi)
	/ \| \| \
	[0] [1] ... [14] [15] <- 16 leaf buckets (level 2)
	\|
	[k1, v1, ts1] <- Leaf entries (max 64 per leaf)
	[k2, v2, ts2]
	...
	```

	Bucket index computation:
	```rust
	fn bucket_index(roots: &[Vec4; 4], key: Vec4) -> usize {
	let mut idx = 0;
	if key.dot(roots[0]) >= 0.0 { idx \|= 1; }
	if key.dot(roots[1]) >= 0.0 { idx \|= 2; }
	if key.dot(roots[2]) >= 0.0 { idx \|= 4; }
	if key.dot(roots[3]) >= 0.0 { idx \|= 8; }
	idx // 0..15
	}
	```

	Approximate query: At each level, visit the primary bucket plus 4 Hamming-1 neighbors (flip one bit each). This gives 5/16 = 31.25% scan per level. Over 3 levels: (5/16)^3 = 3.05%.

	---

	## 4. E8 Lattice Memory (Phase 4)

	### Address Translation Pipeline

	```
	Linear address (u64)
	\|
	\| _address_to_embedding()
	v
	8D golden-angle embedding (Vec8)
	\|
	\| decode_to_e8()
	v
	E8 lattice point (LatticePoint, [i32; 8])
	\|
	\| hash
	v
	HashMap bucket (Vec<MemoryEntry>)
	\|
	\| project_to_h4()
	v
	4D H4 vector (Vec4)
	\|
	\| ChamberTree.insert()
	v
	Attention-queryable cache
	```

	### E8 Closest-Point Decoder

	The E8 lattice decomposes as D8 + (D8 + [1/2]^8). The decoder finds which coset point is closest:

	```
	Input: point p in R^8

	Coset 1 (D8): integers with even sum
	1. f1 = round(p) componentwise
	2. If sum(f1) is odd:
	- Find component with largest rounding error
	- Flip it to the other nearest integer
	3. dist1 = \|\|p - f1\|\|^2

	Coset 2 (D8 + [1/2]^8): half-integers with even sum
	1. f2 = floor(p) + 0.5 componentwise
	2. If sum(f2) is not an integer (parity check):
	- Find component with largest error
	- Shift it by +-1
	3. dist2 = \|\|p - f2\|\|^2

	Return: f1 if dist1 <= dist2, else f2
	```

	Representation: All coordinates stored as 2x integers ([i32; 8]) so both integer and half-integer lattice points use the same type. Integer point (1, 0, ...) becomes (2, 0, ...); half-integer (1/2, 1/2, ...) becomes (1, 1, ...).

	### Memory Operations

	STORE_MEM (R[a] -> mem[R[b]]):
	1. Convert address R[b] to 8D embedding via golden-angle spiral
	2. Decode embedding to E8 lattice point
	3. Insert (embedding, value, address, timestamp) into that cell's bucket
	4. Project embedding to 4D via E8->H4 and insert into ChamberTree
	5. If bucket exceeds 240 entries (kissing number), evict oldest (LRU)

	LOAD_MEM (mem[R[a]] -> R[dest]):
	1. Convert address R[a] to 8D embedding
	2. Decode to E8 lattice point (primary cell)
	3. Search primary cell for closest embedding match
	4. Search 240 kissing neighbors for closer matches
	5. Return value of closest match

	### Utilization Characteristics

	With structured (Wasm-like) embeddings at 10,000 steps:
	- 106 occupied cells out of ~10,000 entries
	- Max bucket size: 240 (saturated cells hit the kissing number bound)
	- Primary cell hit rate: 100% (exact queries always find their target in the primary cell)
	- Average bucket size: 83.7 entries

	The clustering behavior is expected: structured execution traces have correlated embeddings that map to nearby E8 lattice points. The 240 cap prevents any cell from growing unbounded.

	---

	## 5. Phi-Recursive State Encoding

	### Fibonacci Checkpoint Levels

	Long traces are compressed using Fibonacci-spaced checkpoints:

	```
	Level 0: checkpoint at every step spacing = F(1) = 1
	Level 1: checkpoint at every phi steps spacing = F(2) = 1
	Level 2: checkpoint at every phi^2 steps spacing = F(3) = 2
	Level 3: checkpoint at every phi^3 steps spacing = F(4) = 3
	Level 4: checkpoint at every phi^4 steps spacing = F(5) = 5
	...
	Level k: checkpoint at every phi^k steps spacing = F(k+1)
	```

	### Zeckendorf Retrieval

	To retrieve state at step t, use Zeckendorf decomposition:
	- Every positive integer is a unique sum of non-consecutive Fibonacci numbers
	- Decompose distance to target step into Fibonacci components
	- Traverse checkpoint levels to reconstruct state

	Example: Retrieve step 20 = F(8) + F(4) + F(2) = 21 - 1? No: 20 = 13 + 5 + 2 = F(7) + F(5) + F(3). Jump to level 7 checkpoint closest to target, then level 5, then level 3.

	---

	## 6. Weight Construction

	### Analytical (No Training)

	Transformer weights are computed directly from the geometry:

	Attention weights:
	```python
	# W_K for IP-lookup heads: project to first 4 dims (IP encoding)
	W_K[0] = I_4 (4x32, first 4 columns of identity)
	W_K[1] = I_4 (duplicate for robustness)

	# W_K for register-lookup heads: project to dims 8:16 (register file)
	W_K[2][8:12, :] = I_4
	W_K[3][12:16, :] = I_4
	```

	FFN weights:
	```python
	# Layer 1: opcode detection
	# Each opcode vertex direction becomes a row in W1
	# ReLU activation selects the matching opcode

	# Layer 2: operation execution
	# Maps detected opcode to register update
	```

	The key insight is that H4 geometry provides enough angular resolution (14,400 chambers) that each instruction type occupies a distinct Coxeter chamber, making the attention lookup exact rather than approximate.

	---

	## 7. Rust Implementation Details

	### SIMD Alignment

	```rust
	#[repr(align(32))] // Vec4: one 256-bit AVX2 register
	pub struct Vec4(pub [f64; 4]);

	#[repr(align(64))] // Vec8: two 256-bit AVX2 registers
	pub struct Vec8(pub [f64; 8]);
	```

	Dot products are written as single expressions so LLVM can emit `vmulpd` + horizontal add:
	```rust
	pub fn dot(self, other: Self) -> f64 {
	(a[0]b[0] + a[1]b[1]) + (a[2]b[2] + a[3]b[3])
	}
	```

	### Thread Safety

	ChamberTree is `Send + Sync` for rayon parallelism across attention heads:
	```rust
	unsafe impl Send for ChamberTree {}
	unsafe impl Sync for ChamberTree {}
	```

	Parallel queries distribute heads across threads:
	```rust
	pub fn query_approx_par(&self, embedding: &[f64]) -> Vec<Option<(f64, u64)>> {
	self.heads.par_iter().enumerate().map(\|(h, head)\| {
	head.query_approx(query_vec)
	}).collect()
	}
	```

	### Build Configuration

	```toml
	[profile.release]
	opt-level = 3 # Maximum optimization (enables auto-vectorization)
	lto = true # Link-time optimization (enables cross-crate inlining)
	```

	---

	## 8. MCP Server Architecture

	```
	Claude Code <--stdio--> h4_mcp_server.py <--import--> weight_compiler.py
	<--import--> h4_polytopic_attention.py
	```

	The server runs as a subprocess of Claude Code, communicating via stdin/stdout JSON-RPC (MCP protocol). It uses Max plan OAuth --- the compute happens locally, and Claude's API calls use the existing subscription.

	### Tool Dispatch

	```python
	@server.call_tool()
	async def call_tool(name: str, arguments: dict):
	if name == "h4_fibonacci": return await _h4_fibonacci(arguments)
	elif name == "h4_compile_and_run": return await _h4_compile_and_run(arguments)
	elif name == "h4_geometry_info": return await _h4_geometry_info(arguments)
	elif name == "h4_benchmark": return await _h4_benchmark(arguments)
	elif name == "h4_lattice_memory": return await _h4_lattice_memory(arguments)
	```

	Each tool creates a fresh H4Executor instance, suppresses stdout (the executor is chatty), runs the program, and returns structured JSON results.

	---

	## 9. Phase 5: Trainable Hybrid Attention (PyTorch)

	### Frozen + Trainable Architecture

	The key insight from the Fibonacci proof-of-concept (26 trainable params on a frozen H4 backbone): the geometric backbone provides such strong inductive bias that only small trainable adapters are needed. The learned W_nudge matrix converged to rank-1, aligning 93% with a 600-cell vertex. Phase 5 generalizes this to full language modeling.

	```
	Input tokens
	\|
	v
	Token Embedding + Golden-Angle Positional Encoding (phi-inverse spacing)
	\|
	v
	+-------------------------------------------------------------------+
	\| H4TransformerBlock (repeated N times) \|
	\| \|
	\| LayerNorm -> H4AttentionLayer -> Residual \|
	\| \| \|
	\| +-- FROZEN: 600-cell vertices, simple roots, \|
	\| \| E8->H4 projection, ChamberTree \|
	\| \| \|
	\| +-- TRAINABLE: W_q/k/v_proj, W_nudge (4x4/head), \|
	\| \| chamber_bonus (16/head), W_out \|
	\| \| \|
	\| +-- Forward path: \|
	\| 1. Project input -> Q,K,V per head \|
	\| 2. Normalize Q,K to S^3 (unit 4-sphere) \|
	\| 3. Apply W_nudge (small query rotation) \|
	\| 4. ChamberTree top-k lookup OR full attention \|
	\| 5. Soft chamber bonus on keys (differentiable) \|
	\| 6. Softmax over candidates -> weighted V sum \|
	\| 7. Concatenate heads -> W_out \|
	\| \|
	\| LayerNorm -> FFN (Linear -> GELU -> Linear) -> Residual \|
	+-------------------------------------------------------------------+
	\|
	v
	LayerNorm -> LM Head (weight-tied with token embedding)
	\|
	v
	Logits (vocab_size)
	```

	### ChamberTree as Preprocessing Filter

	The ChamberTree is not differentiable --- it's a discrete tree traversal. The trick:

	1. Build: Insert all K vectors into ChamberTree (numpy, per sequence)
	2. Query: For each Q, tree returns top-k candidate key indices in O(log t)
	3. Gather: Use indices to select candidate K, V tensors in PyTorch
	4. Attend: Compute differentiable softmax attention over k candidates only
	5. Gradients: Flow through W_q_proj, W_k_proj, W_nudge, chamber_bonus, W_v_proj normally

	This gives O(k) attention per query where k << t. The tree is just a fast filter.

	### Soft Chamber Bonus

	The chamber_bonus parameter adds a per-key attention bias based on which Coxeter chamber the key lands in. Since softmax is shift-invariant, the bonus must be per-key (not per-query).

	For gradient flow, chamber membership uses soft assignment:
	```
	k_dots = K @ simple_roots^T # (B, T, H, 4)
	soft_signs = sigmoid(k_dots * 3.0) # soft bit per root
	chamber_weights = prod over 4 bits # (B, T, H, 16) soft membership
	k_bonus = sum(chamber_weights * bonus) # (B, T, H) per-key bonus
	scores += k_bonus # differentiable
	```

	### Golden-Angle Positional Encoding

	Instead of sinusoidal or RoPE, positions are encoded using phi-inverse spacing:

	```
	Position n -> angle = n * 2pi phi^-1
	```

	phi^-1 ~ 0.618 is the most irrational number (hardest to approximate by rationals). This guarantees:
	- Consecutive positions are maximally separated (~137.5 degrees apart)
	- No position vectors repeat or nearly repeat at any scale
	- Long-range positions compress via Zeckendorf decomposition (sum of Fibonacci-indexed embeddings)

	Frequency scales across dimension pairs use phi powers: phi^(-k/n_pairs), giving geometrically spaced frequencies anchored to the golden ratio.

	### Diagnostic Metrics

	Phase 5 tracks geometric health during training:

	\| Metric \| What it measures \| Healthy range \|
	\|--------\|-----------------\|---------------\|
	\| `chamber_entropy` \| Shannon entropy of chamber utilization \| > 2.0 (max ~2.77 for 16 chambers) \|
	\| `nudge_rank` \| Ratio sigma_1/sigma_2 of W_nudge SVD \| > 2.0 (rank-1 = focused direction) \|
	\| `geo_alignment` \| Max dot of nudge direction with 600-cell vertex \| > 0.9 (geometry attracts learning) \|
	\| `scan_ratio` \| Fraction of keys examined per query (tree mode) \| << 1.0 (lower = more pruning) \|

	### Autoresearch Protocol

	`h4_program.md` defines an autonomous experiment loop:
	- Agent modifies ONLY `train_cpu.py` (trainable adapters + hyperparameters)
	- Frozen geometry is off-limits
	- 2-minute CPU budget per experiment
	- Keep/discard based on val_bpb improvement
	- Git commit on keep, checkout on discard
	- Results tracked in `results.tsv` (untracked)

	---

	## 10. Phase 6: BitNet b1.58 Ternary Weights

	### Why Ternary Fits H4 Geometry

	The ChamberTree routing already performs 1-bit quantization: `sign(dot(query, root_i))` for 4 roots gives a 4-bit chamber index. BitNet b1.58 extends this to ternary {-1, 0, +1} for all trainable weights.

	The key insight: ternary preserves signs, and chamber assignments only depend on signs. Therefore ternary-quantized attention heads navigate the same Coxeter chambers. At initialization this was confirmed at 97.9% chamber preservation. After training, preservation drops to ~76% as the model finds its own geometric routing in the ternary-constrained space --- different from the float routing but equally effective (0.003 bpb gap).

	### BitLinear: Straight-Through Estimator

	```
	Training forward pass:
	1. Quantize weights: w_q = RoundClip(w / mean(\|w\|), -1, +1)
	2. STE: w_ste = w + (w_q * scale - w).detach()
	→ forward sees quantized value, backward sees float shadow
	3. Quantize activations: x_q = RoundClip(x * 127 / max(\|x\|), -127, 127)
	4. STE: x_ste = x + (x_q * scale / 127 - x).detach()
	5. y = Linear(x_ste, w_ste) → full gradient flow to w and x

	Inference forward pass (after freeze()):
	1. Load frozen int8 weights (only {-1, 0, +1} values)
	2. y = x @ w_frozen.float() * scale → matmul is add/sub only
	```

	### What Stays Float32

	\| Component \| Reason \|
	\|-----------\|--------\|
	\| 600-cell vertices (120 x 4) \| Frozen buffer, static lookup \|
	\| H4 simple roots (4 x 4) \| Frozen buffer, need exact signs for chamber routing \|
	\| E8->H4 projection (4 x 8) \| Frozen buffer \|
	\| chamber_bonus (n_heads x 16) \| Too small to quantize (128 values), needs continuous gradients \|
	\| Token embeddings \| Lookup table, not a matmul \|
	\| LayerNorm parameters \| Small, need float precision \|
	\| LM head \| Weight-tied with embeddings \|

	### What Becomes Ternary

	\| Component \| Params (per layer) \| Ternary Size (per layer) \|
	\|-----------\|-------------------\|------------------------\|
	\| W_q_proj, W_k_proj \| 2x d_model*32 \| varies by d_model \|
	\| W_v_proj, W_out \| 2x d_model*128 \| varies by d_model \|
	\| FFN up + down \| 2x d_model*d_ffn \| varies by d_model \|

	Example configs from autoresearch (best results):
	- d_model=128, 6 layers (float best): ~348K params, ~1.4 MB float32
	- d_model=256, 4 layers (ternary best): ~1.1M params, ~310 KB at 1.58 bits (~17x compression)

	### Ternary Inference Path

	```
	Token → Embedding (float16 lookup)
	→ BitLinear Q/K/V (ternary: add/sub only)
	→ Normalize to S³ (float, 4-dim per head)
	→ W_nudge rotation (float, 4x4 per head — only 16 values)
	→ ChamberTree lookup (integer sign comparisons, 3.1% scan)
	→ Softmax over k candidates (float, k << t)
	→ Weighted sum of values (float x ternary)
	→ BitLinear output (ternary: add/sub only)
	→ BitLinear FFN (ternary: add/sub only)
	→ LM head → next token
	```

	Float multiplies in the critical path: root dot products (4x4), softmax, dequant scales (one per layer). Everything else is integer addition/subtraction.

	---

	## 11. Phase 7: Unified Geometric RAG

	### E8 Retrieval + H4 Ranking + Ternary Generation

	Phase 7 unifies document retrieval and generation through the same E8→H4 projection. In standard RAG systems, retrieval (embedding model + vector database) and generation (LLM) are separate systems with different geometries. H4 RAG uses one geometry for both:

	1. Encode: Documents chunk into 8D embeddings via golden-angle spiral placement, stored in E8 Voronoi cells (`rag/encoder.py`)
	2. Retrieve: Questions project to 8D, E8 lattice lookup finds nearest chunks in O(1) + 240 neighbors (`h4_polytopic_attention.py: E8LatticeIndex`)
	3. Rank: Question and passage both encode to 4D vectors on S³ via H4 attention. Relevance = dot product in H4 space — the same metric as attention routing (`rag/ranking_model.py`)
	4. Generate: Ternary H4 model reads retrieved context and produces answers (`h4_language_model.py`)

	### Contrastive Ranking in H4 Space

	The ranker uses InfoNCE loss with in-batch negatives. For a batch of B questions, each paired with its correct passage, the similarity matrix is B×B where entry [i,j] = dot(q_i, p_j) on S³. The diagonal entries (correct pairs) should dominate each row.

	Key finding: ternary models need ~2x higher contrastive temperature (0.15 vs 0.07) because BitLinear produces noisier similarity scores that require softer probability distributions for gradient flow.

	### Two-Stage Retrieval Pipeline

	The production pipeline uses both architectures:

	1. Bi-encoder (fast, broad): Encodes question and passage separately, scores by H4 dot product. At 3.7M params: R@5=100%, MRR=0.93. The answer is always in the top 5 results. Cost: ~20ms for all documents.

	2. Pre-trained cross-encoder (precise, narrow): A pre-trained reranker (ms-marco-MiniLM-L-6-v2, 22M params) scores each candidate by reading question + passage jointly. Achieves 98.5% R@1 on the bi-encoder's top-5 candidates. Cost: ~500ms for 5 candidates.

	Pipeline: H4 bi-encoder retrieves top-5 (100% recall, 20ms) → MiniLM reranks top-5 (98.5% precision, ~500ms) → return best.

	Our H4 cross-encoder (25M ternary params, 8h overnight on 5.9K SQuAD pairs) reached 80% R@1 peak (69% final eval) — a breakthrough showing H4 cross-attention learns question-to-passage alignment through Coxeter chambers. Trajectory: 29% (1hr) → 52% (3hr) → 80% (7hr). The system ships with MiniLM (98.5%) for production accuracy, with the H4 cross-encoder as a fully geometric, zero-dependency alternative.

	### Full-Scale Results

	- Language modeling: 24M ternary params, PPL 10.0 on TinyStories (8h CPU), beats published 33M baseline
	- Bi-encoder retrieval: R@5=100%, MRR=0.93 at 3.7M params (answer always in results)
	- Combined reranking: 98.5% R@1 (H4 bi-encoder retrieval + pre-trained MiniLM reranker)
	- Passage ranking (min viable): 41.5% R@1 at 870K params (12x random chance)
	- Retrieval: 7.8ms per query via E8 lattice
	- Cost: $0/month (replaces $500-2,000/month enterprise RAG stack)