Upload docs/ARCHITECTURE.md with huggingface_hub

bc11ae0 verified 18 days ago

preview code

raw

history blame contribute delete

23.5 kB

Architecture Guide

System Overview

H4 Polytopic Attention is a transformer executor where:

Programs compile into analytically constructed transformer weights
Attention heads operate in 4D using H4 (600-cell) geometry
Memory is indexed via E8 lattice Voronoi cells
The E8 -> H4 projection unifies memory and attention into one geometric system

This document covers the internal architecture in detail.

1. Execution Model

Von Neumann Mapping

+------------------+     +------------------+
|    FETCH         |     |   StateEncoder   |
|  Read instr at   | --> | Encode IP, regs, |
|  current IP      |     | opcode as d_model|
+------------------+     | vector on S^3    |
                          +------------------+
                                  |
                                  v
+------------------+     +------------------+
|    DECODE        |     |  Attention Heads  |
|  Determine       | --> | H4 ChamberTree   |
|  operation       |     | lookup in trace   |
+------------------+     +------------------+
                                  |
                                  v
+------------------+     +------------------+
|    EXECUTE       |     |    FFN Layers     |
|  Perform the     | --> | Opcode decode +  |
|  operation       |     | ALU operations    |
+------------------+     +------------------+
                                  |
                                  v
+------------------+     +------------------+
|    WRITEBACK     |     |  Update trace,   |
|  Store results   | --> | registers, and   |
|                  |     | lattice memory   |
+------------------+     +------------------+

Execution Loop (per step)

while not halted:
    instr = program[ip]
    state_vec = encoder.encode_state(ip, registers, instr, step)
    trace.append(state_vec)

    # Transformer forward pass over full trace
    output = transformer.forward(trace)

    # Execute the instruction (updates registers, ip)
    execute_instruction(instr)

The transformer's forward pass is run at every step, giving it access to the full execution history. Attention heads look back at prior states to resolve memory lookups and control flow patterns.

2. State Encoding

Vector Layout (d_model = 32)

Byte range    Contents                    Encoding method
---------     --------                    ---------------
[0:4]         Instruction pointer         Golden-angle spiral on S^3
[4:8]         Opcode                      600-cell vertex (10 vertices apart)
[8:16]        Register file (8 regs)      tanh(reg / 100) normalization
[16:20]       Operand A                   Golden-angle spiral on S^3
[20:24]       Operand B                   Golden-angle spiral on S^3
[24:28]       Destination register        Golden-angle spiral on S^3
[28:32]       Step counter / phase        phi-scaled trigonometric

Golden-Angle Spiral

Each integer (IP, register index, address) maps to a well-separated direction on S^3:

theta1 = n * 2 * pi * phi_inv        # golden angle in first plane
theta2 = n * pi * phi_inv * 0.7      # golden angle in second plane
r1, r2 = cos(theta2), sin(theta2)
vec = [r1*cos(theta1), r1*sin(theta1), r2*cos(theta1*phi), r2*sin(theta1*phi)]

The golden angle (2piphi_inv ~ 137.5 degrees) ensures maximum angular separation between consecutive indices, so nearby IPs don't interfere in attention.

Opcode Encoding

Each opcode maps to a specific 600-cell vertex, spaced 10 apart:

LOAD      -> vertex[0]      STORE_MEM -> vertex[80]
ADD       -> vertex[10]     LOAD_MEM  -> vertex[90]
SUB       -> vertex[20]
MUL       -> vertex[30]
STORE     -> vertex[40]
JMP       -> vertex[50]
JNZ       -> vertex[60]
HALT      -> vertex[70]

With 120 vertices, spacing of 10 ensures ~36 degrees angular separation between opcodes --- well above the Coxeter chamber resolution.

3. Attention Architecture

Head Allocation

With d_model = 32 and d_head = 4, we get 8 attention heads:

Heads 0-1:  IP Lookup
            Keys encode instruction pointers
            Queries find matching IPs in the execution history
            Purpose: "What happened last time we were at this IP?"

Heads 2-3:  Register Lookup
            Keys encode register indices
            Queries find register values at prior steps
            Purpose: "What was the value of R[a] when we last wrote it?"

Heads 4-5:  Operand Fetch
            Keys encode operand values
            Queries fetch operand data from the trace
            Purpose: "What operands were used with this instruction?"

Heads 6-7:  Control Flow
            Keys encode branch conditions
            Queries predict branch direction
            Purpose: "Was the branch taken last time?"

ChamberTree Structure

Each head maintains a ChamberTree for its 4D key-value pairs:

             Root (4 simple roots of H4)
              /    |    |    \
           [0] [1] ... [14] [15]     <- 16 buckets (level 0)
            |
         Rotated roots (by pi/5)
          /    |    |    \
       [0] [1] ... [14] [15]        <- 16 sub-buckets (level 1)
            |
         Rotated roots (by pi/5 * phi)
          /    |    |    \
       [0] [1] ... [14] [15]        <- 16 leaf buckets (level 2)
            |
         [k1, v1, ts1]              <- Leaf entries (max 64 per leaf)
         [k2, v2, ts2]
         ...

Bucket index computation:

fn bucket_index(roots: &[Vec4; 4], key: Vec4) -> usize {
    let mut idx = 0;
    if key.dot(roots[0]) >= 0.0 { idx |= 1; }
    if key.dot(roots[1]) >= 0.0 { idx |= 2; }
    if key.dot(roots[2]) >= 0.0 { idx |= 4; }
    if key.dot(roots[3]) >= 0.0 { idx |= 8; }
    idx  // 0..15
}

Approximate query: At each level, visit the primary bucket plus 4 Hamming-1 neighbors (flip one bit each). This gives 5/16 = 31.25% scan per level. Over 3 levels: (5/16)^3 = 3.05%.

4. E8 Lattice Memory (Phase 4)

Address Translation Pipeline

Linear address (u64)
    |
    | _address_to_embedding()
    v
8D golden-angle embedding (Vec8)
    |
    | decode_to_e8()
    v
E8 lattice point (LatticePoint, [i32; 8])
    |
    | hash
    v
HashMap bucket (Vec<MemoryEntry>)
    |
    | project_to_h4()
    v
4D H4 vector (Vec4)
    |
    | ChamberTree.insert()
    v
Attention-queryable cache

E8 Closest-Point Decoder

The E8 lattice decomposes as D8 + (D8 + [1/2]^8). The decoder finds which coset point is closest:

Input: point p in R^8

Coset 1 (D8): integers with even sum
  1. f1 = round(p) componentwise
  2. If sum(f1) is odd:
     - Find component with largest rounding error
     - Flip it to the other nearest integer
  3. dist1 = ||p - f1||^2

Coset 2 (D8 + [1/2]^8): half-integers with even sum
  1. f2 = floor(p) + 0.5 componentwise
  2. If sum(f2) is not an integer (parity check):
     - Find component with largest error
     - Shift it by +-1
  3. dist2 = ||p - f2||^2

Return: f1 if dist1 <= dist2, else f2

Representation: All coordinates stored as 2x integers ([i32; 8]) so both integer and half-integer lattice points use the same type. Integer point (1, 0, ...) becomes (2, 0, ...); half-integer (1/2, 1/2, ...) becomes (1, 1, ...).

Memory Operations

STORE_MEM (R[a] -> mem[R[b]]):

Convert address R[b] to 8D embedding via golden-angle spiral
Decode embedding to E8 lattice point
Insert (embedding, value, address, timestamp) into that cell's bucket
Project embedding to 4D via E8->H4 and insert into ChamberTree
If bucket exceeds 240 entries (kissing number), evict oldest (LRU)

LOAD_MEM (mem[R[a]] -> R[dest]):

Convert address R[a] to 8D embedding
Decode to E8 lattice point (primary cell)
Search primary cell for closest embedding match
Search 240 kissing neighbors for closer matches
Return value of closest match

Utilization Characteristics

With structured (Wasm-like) embeddings at 10,000 steps:

106 occupied cells out of ~10,000 entries
Max bucket size: 240 (saturated cells hit the kissing number bound)
Primary cell hit rate: 100% (exact queries always find their target in the primary cell)
Average bucket size: 83.7 entries

The clustering behavior is expected: structured execution traces have correlated embeddings that map to nearby E8 lattice points. The 240 cap prevents any cell from growing unbounded.

5. Phi-Recursive State Encoding

Fibonacci Checkpoint Levels

Long traces are compressed using Fibonacci-spaced checkpoints:

Level 0: checkpoint at every step          spacing = F(1) = 1
Level 1: checkpoint at every phi steps     spacing = F(2) = 1
Level 2: checkpoint at every phi^2 steps   spacing = F(3) = 2
Level 3: checkpoint at every phi^3 steps   spacing = F(4) = 3
Level 4: checkpoint at every phi^4 steps   spacing = F(5) = 5
...
Level k: checkpoint at every phi^k steps   spacing = F(k+1)

Zeckendorf Retrieval

To retrieve state at step t, use Zeckendorf decomposition:

Every positive integer is a unique sum of non-consecutive Fibonacci numbers
Decompose distance to target step into Fibonacci components
Traverse checkpoint levels to reconstruct state

Example: Retrieve step 20 = F(8) + F(4) + F(2) = 21 - 1? No: 20 = 13 + 5 + 2 = F(7) + F(5) + F(3). Jump to level 7 checkpoint closest to target, then level 5, then level 3.

6. Weight Construction

Analytical (No Training)

Transformer weights are computed directly from the geometry:

Attention weights:

# W_K for IP-lookup heads: project to first 4 dims (IP encoding)
W_K[0] = I_4   (4x32, first 4 columns of identity)
W_K[1] = I_4   (duplicate for robustness)

# W_K for register-lookup heads: project to dims 8:16 (register file)
W_K[2][8:12, :] = I_4
W_K[3][12:16, :] = I_4

FFN weights:

# Layer 1: opcode detection
# Each opcode vertex direction becomes a row in W1
# ReLU activation selects the matching opcode

# Layer 2: operation execution
# Maps detected opcode to register update

The key insight is that H4 geometry provides enough angular resolution (14,400 chambers) that each instruction type occupies a distinct Coxeter chamber, making the attention lookup exact rather than approximate.

7. Rust Implementation Details

SIMD Alignment

#[repr(align(32))]   // Vec4: one 256-bit AVX2 register
pub struct Vec4(pub [f64; 4]);

#[repr(align(64))]   // Vec8: two 256-bit AVX2 registers
pub struct Vec8(pub [f64; 8]);

Dot products are written as single expressions so LLVM can emit vmulpd + horizontal add:

pub fn dot(self, other: Self) -> f64 {
    (a[0]*b[0] + a[1]*b[1]) + (a[2]*b[2] + a[3]*b[3])
}

Thread Safety

ChamberTree is Send + Sync for rayon parallelism across attention heads:

unsafe impl Send for ChamberTree {}
unsafe impl Sync for ChamberTree {}

Parallel queries distribute heads across threads:

pub fn query_approx_par(&self, embedding: &[f64]) -> Vec<Option<(f64, u64)>> {
    self.heads.par_iter().enumerate().map(|(h, head)| {
        head.query_approx(query_vec)
    }).collect()
}

Build Configuration

[profile.release]
opt-level = 3    # Maximum optimization (enables auto-vectorization)
lto = true       # Link-time optimization (enables cross-crate inlining)

8. MCP Server Architecture

Claude Code <--stdio--> h4_mcp_server.py <--import--> weight_compiler.py
                                          <--import--> h4_polytopic_attention.py

The server runs as a subprocess of Claude Code, communicating via stdin/stdout JSON-RPC (MCP protocol). It uses Max plan OAuth --- the compute happens locally, and Claude's API calls use the existing subscription.

Tool Dispatch

@server.call_tool()
async def call_tool(name: str, arguments: dict):
    if name == "h4_fibonacci":       return await _h4_fibonacci(arguments)
    elif name == "h4_compile_and_run": return await _h4_compile_and_run(arguments)
    elif name == "h4_geometry_info":   return await _h4_geometry_info(arguments)
    elif name == "h4_benchmark":       return await _h4_benchmark(arguments)
    elif name == "h4_lattice_memory":  return await _h4_lattice_memory(arguments)

Each tool creates a fresh H4Executor instance, suppresses stdout (the executor is chatty), runs the program, and returns structured JSON results.

9. Phase 5: Trainable Hybrid Attention (PyTorch)

Frozen + Trainable Architecture

The key insight from the Fibonacci proof-of-concept (26 trainable params on a frozen H4 backbone): the geometric backbone provides such strong inductive bias that only small trainable adapters are needed. The learned W_nudge matrix converged to rank-1, aligning 93% with a 600-cell vertex. Phase 5 generalizes this to full language modeling.

Input tokens
    |
    v
Token Embedding + Golden-Angle Positional Encoding (phi-inverse spacing)
    |
    v
+-------------------------------------------------------------------+
|  H4TransformerBlock (repeated N times)                            |
|                                                                    |
|  LayerNorm -> H4AttentionLayer -> Residual                        |
|                |                                                   |
|                +-- FROZEN: 600-cell vertices, simple roots,       |
|                |           E8->H4 projection, ChamberTree         |
|                |                                                   |
|                +-- TRAINABLE: W_q/k/v_proj, W_nudge (4x4/head), |
|                |              chamber_bonus (16/head), W_out      |
|                |                                                   |
|                +-- Forward path:                                  |
|                    1. Project input -> Q,K,V per head             |
|                    2. Normalize Q,K to S^3 (unit 4-sphere)        |
|                    3. Apply W_nudge (small query rotation)        |
|                    4. ChamberTree top-k lookup OR full attention  |
|                    5. Soft chamber bonus on keys (differentiable) |
|                    6. Softmax over candidates -> weighted V sum   |
|                    7. Concatenate heads -> W_out                  |
|                                                                    |
|  LayerNorm -> FFN (Linear -> GELU -> Linear) -> Residual         |
+-------------------------------------------------------------------+
    |
    v
LayerNorm -> LM Head (weight-tied with token embedding)
    |
    v
Logits (vocab_size)

ChamberTree as Preprocessing Filter

The ChamberTree is not differentiable --- it's a discrete tree traversal. The trick:

Build: Insert all K vectors into ChamberTree (numpy, per sequence)
Query: For each Q, tree returns top-k candidate key indices in O(log t)
Gather: Use indices to select candidate K, V tensors in PyTorch
Attend: Compute differentiable softmax attention over k candidates only
Gradients: Flow through W_q_proj, W_k_proj, W_nudge, chamber_bonus, W_v_proj normally

This gives O(k) attention per query where k << t. The tree is just a fast filter.

Soft Chamber Bonus

The chamber_bonus parameter adds a per-key attention bias based on which Coxeter chamber the key lands in. Since softmax is shift-invariant, the bonus must be per-key (not per-query).

For gradient flow, chamber membership uses soft assignment:

k_dots = K @ simple_roots^T            # (B, T, H, 4)
soft_signs = sigmoid(k_dots * 3.0)     # soft bit per root
chamber_weights = prod over 4 bits     # (B, T, H, 16) soft membership
k_bonus = sum(chamber_weights * bonus) # (B, T, H) per-key bonus
scores += k_bonus                      # differentiable

Golden-Angle Positional Encoding

Instead of sinusoidal or RoPE, positions are encoded using phi-inverse spacing:

Position n -> angle = n * 2*pi * phi^-1

phi^-1 ~ 0.618 is the most irrational number (hardest to approximate by rationals). This guarantees:

Consecutive positions are maximally separated (~137.5 degrees apart)
No position vectors repeat or nearly repeat at any scale
Long-range positions compress via Zeckendorf decomposition (sum of Fibonacci-indexed embeddings)

Frequency scales across dimension pairs use phi powers: phi^(-k/n_pairs), giving geometrically spaced frequencies anchored to the golden ratio.

Diagnostic Metrics

Phase 5 tracks geometric health during training:

Metric	What it measures	Healthy range
`chamber_entropy`	Shannon entropy of chamber utilization	> 2.0 (max ~2.77 for 16 chambers)
`nudge_rank`	Ratio sigma_1/sigma_2 of W_nudge SVD	> 2.0 (rank-1 = focused direction)
`geo_alignment`	Max dot of nudge direction with 600-cell vertex	> 0.9 (geometry attracts learning)
`scan_ratio`	Fraction of keys examined per query (tree mode)	<< 1.0 (lower = more pruning)

Autoresearch Protocol

h4_program.md defines an autonomous experiment loop:

Agent modifies ONLY train_cpu.py (trainable adapters + hyperparameters)
Frozen geometry is off-limits
2-minute CPU budget per experiment
Keep/discard based on val_bpb improvement
Git commit on keep, checkout on discard
Results tracked in results.tsv (untracked)

10. Phase 6: BitNet b1.58 Ternary Weights

Why Ternary Fits H4 Geometry

The ChamberTree routing already performs 1-bit quantization: sign(dot(query, root_i)) for 4 roots gives a 4-bit chamber index. BitNet b1.58 extends this to ternary {-1, 0, +1} for all trainable weights.

The key insight: ternary preserves signs, and chamber assignments only depend on signs. Therefore ternary-quantized attention heads navigate the same Coxeter chambers. At initialization this was confirmed at 97.9% chamber preservation. After training, preservation drops to ~76% as the model finds its own geometric routing in the ternary-constrained space --- different from the float routing but equally effective (0.003 bpb gap).

BitLinear: Straight-Through Estimator

Training forward pass:
    1. Quantize weights:  w_q = RoundClip(w / mean(|w|), -1, +1)
    2. STE:  w_ste = w + (w_q * scale - w).detach()
       → forward sees quantized value, backward sees float shadow
    3. Quantize activations: x_q = RoundClip(x * 127 / max(|x|), -127, 127)
    4. STE:  x_ste = x + (x_q * scale / 127 - x).detach()
    5. y = Linear(x_ste, w_ste)  → full gradient flow to w and x

Inference forward pass (after freeze()):
    1. Load frozen int8 weights (only {-1, 0, +1} values)
    2. y = x @ w_frozen.float() * scale  → matmul is add/sub only

What Stays Float32

Component	Reason
600-cell vertices (120 x 4)	Frozen buffer, static lookup
H4 simple roots (4 x 4)	Frozen buffer, need exact signs for chamber routing
E8->H4 projection (4 x 8)	Frozen buffer
chamber_bonus (n_heads x 16)	Too small to quantize (128 values), needs continuous gradients
Token embeddings	Lookup table, not a matmul
LayerNorm parameters	Small, need float precision
LM head	Weight-tied with embeddings

What Becomes Ternary

Component	Params (per layer)	Ternary Size (per layer)
W_q_proj, W_k_proj	2x d_model*32	varies by d_model
W_v_proj, W_out	2x d_model*128	varies by d_model
FFN up + down	2x d_model*d_ffn	varies by d_model

Example configs from autoresearch (best results):

d_model=128, 6 layers (float best): ~348K params, ~1.4 MB float32
d_model=256, 4 layers (ternary best): ~~1.1M params, ~310 KB at 1.58 bits (~~17x compression)

Ternary Inference Path

Token → Embedding (float16 lookup)
      → BitLinear Q/K/V (ternary: add/sub only)
      → Normalize to S³ (float, 4-dim per head)
      → W_nudge rotation (float, 4x4 per head — only 16 values)
      → ChamberTree lookup (integer sign comparisons, 3.1% scan)
      → Softmax over k candidates (float, k << t)
      → Weighted sum of values (float x ternary)
      → BitLinear output (ternary: add/sub only)
      → BitLinear FFN (ternary: add/sub only)
      → LM head → next token

Float multiplies in the critical path: root dot products (4x4), softmax, dequant scales (one per layer). Everything else is integer addition/subtraction.

11. Phase 7: Unified Geometric RAG

E8 Retrieval + H4 Ranking + Ternary Generation

Phase 7 unifies document retrieval and generation through the same E8→H4 projection. In standard RAG systems, retrieval (embedding model + vector database) and generation (LLM) are separate systems with different geometries. H4 RAG uses one geometry for both:

Encode: Documents chunk into 8D embeddings via golden-angle spiral placement, stored in E8 Voronoi cells (rag/encoder.py)
Retrieve: Questions project to 8D, E8 lattice lookup finds nearest chunks in O(1) + 240 neighbors (h4_polytopic_attention.py: E8LatticeIndex)
Rank: Question and passage both encode to 4D vectors on S³ via H4 attention. Relevance = dot product in H4 space — the same metric as attention routing (rag/ranking_model.py)
Generate: Ternary H4 model reads retrieved context and produces answers (h4_language_model.py)

Contrastive Ranking in H4 Space

The ranker uses InfoNCE loss with in-batch negatives. For a batch of B questions, each paired with its correct passage, the similarity matrix is B×B where entry [i,j] = dot(q_i, p_j) on S³. The diagonal entries (correct pairs) should dominate each row.

Key finding: ternary models need ~2x higher contrastive temperature (0.15 vs 0.07) because BitLinear produces noisier similarity scores that require softer probability distributions for gradient flow.

Two-Stage Retrieval Pipeline

The production pipeline uses both architectures:

Bi-encoder (fast, broad): Encodes question and passage separately, scores by H4 dot product. At 3.7M params: R@5=100%, MRR=0.93. The answer is always in the top 5 results. Cost: ~20ms for all documents.
Pre-trained cross-encoder (precise, narrow): A pre-trained reranker (ms-marco-MiniLM-L-6-v2, 22M params) scores each candidate by reading question + passage jointly. Achieves 98.5% R@1 on the bi-encoder's top-5 candidates. Cost: ~500ms for 5 candidates.

Pipeline: H4 bi-encoder retrieves top-5 (100% recall, 20ms) → MiniLM reranks top-5 (98.5% precision, ~500ms) → return best.

Our H4 cross-encoder (25M ternary params, 8h overnight on 5.9K SQuAD pairs) reached 80% R@1 peak (69% final eval) — a breakthrough showing H4 cross-attention learns question-to-passage alignment through Coxeter chambers. Trajectory: 29% (1hr) → 52% (3hr) → 80% (7hr). The system ships with MiniLM (98.5%) for production accuracy, with the H4 cross-encoder as a fully geometric, zero-dependency alternative.

Full-Scale Results

Language modeling: 24M ternary params, PPL 10.0 on TinyStories (8h CPU), beats published 33M baseline
Bi-encoder retrieval: R@5=100%, MRR=0.93 at 3.7M params (answer always in results)
Combined reranking: 98.5% R@1 (H4 bi-encoder retrieval + pre-trained MiniLM reranker)
Passage ranking (min viable): 41.5% R@1 at 870K params (12x random chance)
Retrieval: 7.8ms per query via E8 lattice
Cost: $0/month (replaces $500-2,000/month enterprise RAG stack)