File size: 23,517 Bytes
f87774b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
83665c1
f87774b
83665c1
 
bc11ae0
f87774b
 
 
 
 
83665c1
f87774b
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
# Architecture Guide

## System Overview

H4 Polytopic Attention is a transformer executor where:
- Programs compile into analytically constructed transformer weights
- Attention heads operate in 4D using H4 (600-cell) geometry
- Memory is indexed via E8 lattice Voronoi cells
- The E8 -> H4 projection unifies memory and attention into one geometric system

This document covers the internal architecture in detail.

---

## 1. Execution Model

### Von Neumann Mapping

```
+------------------+     +------------------+
|    FETCH         |     |   StateEncoder   |
|  Read instr at   | --> | Encode IP, regs, |
|  current IP      |     | opcode as d_model|
+------------------+     | vector on S^3    |
                          +------------------+
                                  |
                                  v
+------------------+     +------------------+
|    DECODE        |     |  Attention Heads  |
|  Determine       | --> | H4 ChamberTree   |
|  operation       |     | lookup in trace   |
+------------------+     +------------------+
                                  |
                                  v
+------------------+     +------------------+
|    EXECUTE       |     |    FFN Layers     |
|  Perform the     | --> | Opcode decode +  |
|  operation       |     | ALU operations    |
+------------------+     +------------------+
                                  |
                                  v
+------------------+     +------------------+
|    WRITEBACK     |     |  Update trace,   |
|  Store results   | --> | registers, and   |
|                  |     | lattice memory   |
+------------------+     +------------------+
```

### Execution Loop (per step)

```python
while not halted:
    instr = program[ip]
    state_vec = encoder.encode_state(ip, registers, instr, step)
    trace.append(state_vec)

    # Transformer forward pass over full trace
    output = transformer.forward(trace)

    # Execute the instruction (updates registers, ip)
    execute_instruction(instr)
```

The transformer's forward pass is run at every step, giving it access to the full execution history. Attention heads look back at prior states to resolve memory lookups and control flow patterns.

---

## 2. State Encoding

### Vector Layout (d_model = 32)

```
Byte range    Contents                    Encoding method
---------     --------                    ---------------
[0:4]         Instruction pointer         Golden-angle spiral on S^3
[4:8]         Opcode                      600-cell vertex (10 vertices apart)
[8:16]        Register file (8 regs)      tanh(reg / 100) normalization
[16:20]       Operand A                   Golden-angle spiral on S^3
[20:24]       Operand B                   Golden-angle spiral on S^3
[24:28]       Destination register        Golden-angle spiral on S^3
[28:32]       Step counter / phase        phi-scaled trigonometric
```

### Golden-Angle Spiral

Each integer (IP, register index, address) maps to a well-separated direction on S^3:

```python
theta1 = n * 2 * pi * phi_inv        # golden angle in first plane
theta2 = n * pi * phi_inv * 0.7      # golden angle in second plane
r1, r2 = cos(theta2), sin(theta2)
vec = [r1*cos(theta1), r1*sin(theta1), r2*cos(theta1*phi), r2*sin(theta1*phi)]
```

The golden angle (2*pi*phi_inv ~ 137.5 degrees) ensures maximum angular separation between consecutive indices, so nearby IPs don't interfere in attention.

### Opcode Encoding

Each opcode maps to a specific 600-cell vertex, spaced 10 apart:

```
LOAD      -> vertex[0]      STORE_MEM -> vertex[80]
ADD       -> vertex[10]     LOAD_MEM  -> vertex[90]
SUB       -> vertex[20]
MUL       -> vertex[30]
STORE     -> vertex[40]
JMP       -> vertex[50]
JNZ       -> vertex[60]
HALT      -> vertex[70]
```

With 120 vertices, spacing of 10 ensures ~36 degrees angular separation between opcodes --- well above the Coxeter chamber resolution.

---

## 3. Attention Architecture

### Head Allocation

With d_model = 32 and d_head = 4, we get 8 attention heads:

```
Heads 0-1:  IP Lookup
            Keys encode instruction pointers
            Queries find matching IPs in the execution history
            Purpose: "What happened last time we were at this IP?"

Heads 2-3:  Register Lookup
            Keys encode register indices
            Queries find register values at prior steps
            Purpose: "What was the value of R[a] when we last wrote it?"

Heads 4-5:  Operand Fetch
            Keys encode operand values
            Queries fetch operand data from the trace
            Purpose: "What operands were used with this instruction?"

Heads 6-7:  Control Flow
            Keys encode branch conditions
            Queries predict branch direction
            Purpose: "Was the branch taken last time?"
```

### ChamberTree Structure

Each head maintains a ChamberTree for its 4D key-value pairs:

```
             Root (4 simple roots of H4)
              /    |    |    \
           [0] [1] ... [14] [15]     <- 16 buckets (level 0)
            |
         Rotated roots (by pi/5)
          /    |    |    \
       [0] [1] ... [14] [15]        <- 16 sub-buckets (level 1)
            |
         Rotated roots (by pi/5 * phi)
          /    |    |    \
       [0] [1] ... [14] [15]        <- 16 leaf buckets (level 2)
            |
         [k1, v1, ts1]              <- Leaf entries (max 64 per leaf)
         [k2, v2, ts2]
         ...
```

**Bucket index computation:**
```rust
fn bucket_index(roots: &[Vec4; 4], key: Vec4) -> usize {
    let mut idx = 0;
    if key.dot(roots[0]) >= 0.0 { idx |= 1; }
    if key.dot(roots[1]) >= 0.0 { idx |= 2; }
    if key.dot(roots[2]) >= 0.0 { idx |= 4; }
    if key.dot(roots[3]) >= 0.0 { idx |= 8; }
    idx  // 0..15
}
```

**Approximate query:** At each level, visit the primary bucket plus 4 Hamming-1 neighbors (flip one bit each). This gives 5/16 = 31.25% scan per level. Over 3 levels: (5/16)^3 = 3.05%.

---

## 4. E8 Lattice Memory (Phase 4)

### Address Translation Pipeline

```
Linear address (u64)
    |
    | _address_to_embedding()
    v
8D golden-angle embedding (Vec8)
    |
    | decode_to_e8()
    v
E8 lattice point (LatticePoint, [i32; 8])
    |
    | hash
    v
HashMap bucket (Vec<MemoryEntry>)
    |
    | project_to_h4()
    v
4D H4 vector (Vec4)
    |
    | ChamberTree.insert()
    v
Attention-queryable cache
```

### E8 Closest-Point Decoder

The E8 lattice decomposes as D8 + (D8 + [1/2]^8). The decoder finds which coset point is closest:

```
Input: point p in R^8

Coset 1 (D8): integers with even sum
  1. f1 = round(p) componentwise
  2. If sum(f1) is odd:
     - Find component with largest rounding error
     - Flip it to the other nearest integer
  3. dist1 = ||p - f1||^2

Coset 2 (D8 + [1/2]^8): half-integers with even sum
  1. f2 = floor(p) + 0.5 componentwise
  2. If sum(f2) is not an integer (parity check):
     - Find component with largest error
     - Shift it by +-1
  3. dist2 = ||p - f2||^2

Return: f1 if dist1 <= dist2, else f2
```

**Representation:** All coordinates stored as 2x integers ([i32; 8]) so both integer and half-integer lattice points use the same type. Integer point (1, 0, ...) becomes (2, 0, ...); half-integer (1/2, 1/2, ...) becomes (1, 1, ...).

### Memory Operations

**STORE_MEM (R[a] -> mem[R[b]]):**
1. Convert address R[b] to 8D embedding via golden-angle spiral
2. Decode embedding to E8 lattice point
3. Insert (embedding, value, address, timestamp) into that cell's bucket
4. Project embedding to 4D via E8->H4 and insert into ChamberTree
5. If bucket exceeds 240 entries (kissing number), evict oldest (LRU)

**LOAD_MEM (mem[R[a]] -> R[dest]):**
1. Convert address R[a] to 8D embedding
2. Decode to E8 lattice point (primary cell)
3. Search primary cell for closest embedding match
4. Search 240 kissing neighbors for closer matches
5. Return value of closest match

### Utilization Characteristics

With structured (Wasm-like) embeddings at 10,000 steps:
- **106 occupied cells** out of ~10,000 entries
- **Max bucket size: 240** (saturated cells hit the kissing number bound)
- **Primary cell hit rate: 100%** (exact queries always find their target in the primary cell)
- **Average bucket size: 83.7** entries

The clustering behavior is expected: structured execution traces have correlated embeddings that map to nearby E8 lattice points. The 240 cap prevents any cell from growing unbounded.

---

## 5. Phi-Recursive State Encoding

### Fibonacci Checkpoint Levels

Long traces are compressed using Fibonacci-spaced checkpoints:

```
Level 0: checkpoint at every step          spacing = F(1) = 1
Level 1: checkpoint at every phi steps     spacing = F(2) = 1
Level 2: checkpoint at every phi^2 steps   spacing = F(3) = 2
Level 3: checkpoint at every phi^3 steps   spacing = F(4) = 3
Level 4: checkpoint at every phi^4 steps   spacing = F(5) = 5
...
Level k: checkpoint at every phi^k steps   spacing = F(k+1)
```

### Zeckendorf Retrieval

To retrieve state at step t, use Zeckendorf decomposition:
- Every positive integer is a unique sum of non-consecutive Fibonacci numbers
- Decompose distance to target step into Fibonacci components
- Traverse checkpoint levels to reconstruct state

**Example:** Retrieve step 20 = F(8) + F(4) + F(2) = 21 - 1? No: 20 = 13 + 5 + 2 = F(7) + F(5) + F(3). Jump to level 7 checkpoint closest to target, then level 5, then level 3.

---

## 6. Weight Construction

### Analytical (No Training)

Transformer weights are computed directly from the geometry:

**Attention weights:**
```python
# W_K for IP-lookup heads: project to first 4 dims (IP encoding)
W_K[0] = I_4   (4x32, first 4 columns of identity)
W_K[1] = I_4   (duplicate for robustness)

# W_K for register-lookup heads: project to dims 8:16 (register file)
W_K[2][8:12, :] = I_4
W_K[3][12:16, :] = I_4
```

**FFN weights:**
```python
# Layer 1: opcode detection
# Each opcode vertex direction becomes a row in W1
# ReLU activation selects the matching opcode

# Layer 2: operation execution
# Maps detected opcode to register update
```

The key insight is that H4 geometry provides enough angular resolution (14,400 chambers) that each instruction type occupies a distinct Coxeter chamber, making the attention lookup exact rather than approximate.

---

## 7. Rust Implementation Details

### SIMD Alignment

```rust
#[repr(align(32))]   // Vec4: one 256-bit AVX2 register
pub struct Vec4(pub [f64; 4]);

#[repr(align(64))]   // Vec8: two 256-bit AVX2 registers
pub struct Vec8(pub [f64; 8]);
```

Dot products are written as single expressions so LLVM can emit `vmulpd` + horizontal add:
```rust
pub fn dot(self, other: Self) -> f64 {
    (a[0]*b[0] + a[1]*b[1]) + (a[2]*b[2] + a[3]*b[3])
}
```

### Thread Safety

ChamberTree is `Send + Sync` for rayon parallelism across attention heads:
```rust
unsafe impl Send for ChamberTree {}
unsafe impl Sync for ChamberTree {}
```

Parallel queries distribute heads across threads:
```rust
pub fn query_approx_par(&self, embedding: &[f64]) -> Vec<Option<(f64, u64)>> {
    self.heads.par_iter().enumerate().map(|(h, head)| {
        head.query_approx(query_vec)
    }).collect()
}
```

### Build Configuration

```toml
[profile.release]
opt-level = 3    # Maximum optimization (enables auto-vectorization)
lto = true       # Link-time optimization (enables cross-crate inlining)
```

---

## 8. MCP Server Architecture

```
Claude Code <--stdio--> h4_mcp_server.py <--import--> weight_compiler.py
                                          <--import--> h4_polytopic_attention.py
```

The server runs as a subprocess of Claude Code, communicating via stdin/stdout JSON-RPC (MCP protocol). It uses Max plan OAuth --- the compute happens locally, and Claude's API calls use the existing subscription.

### Tool Dispatch

```python
@server.call_tool()
async def call_tool(name: str, arguments: dict):
    if name == "h4_fibonacci":       return await _h4_fibonacci(arguments)
    elif name == "h4_compile_and_run": return await _h4_compile_and_run(arguments)
    elif name == "h4_geometry_info":   return await _h4_geometry_info(arguments)
    elif name == "h4_benchmark":       return await _h4_benchmark(arguments)
    elif name == "h4_lattice_memory":  return await _h4_lattice_memory(arguments)
```

Each tool creates a fresh H4Executor instance, suppresses stdout (the executor is chatty), runs the program, and returns structured JSON results.

---

## 9. Phase 5: Trainable Hybrid Attention (PyTorch)

### Frozen + Trainable Architecture

The key insight from the Fibonacci proof-of-concept (26 trainable params on a frozen H4 backbone): the geometric backbone provides such strong inductive bias that only small trainable adapters are needed. The learned W_nudge matrix converged to rank-1, aligning 93% with a 600-cell vertex. Phase 5 generalizes this to full language modeling.

```
Input tokens
    |
    v
Token Embedding + Golden-Angle Positional Encoding (phi-inverse spacing)
    |
    v
+-------------------------------------------------------------------+
|  H4TransformerBlock (repeated N times)                            |
|                                                                    |
|  LayerNorm -> H4AttentionLayer -> Residual                        |
|                |                                                   |
|                +-- FROZEN: 600-cell vertices, simple roots,       |
|                |           E8->H4 projection, ChamberTree         |
|                |                                                   |
|                +-- TRAINABLE: W_q/k/v_proj, W_nudge (4x4/head), |
|                |              chamber_bonus (16/head), W_out      |
|                |                                                   |
|                +-- Forward path:                                  |
|                    1. Project input -> Q,K,V per head             |
|                    2. Normalize Q,K to S^3 (unit 4-sphere)        |
|                    3. Apply W_nudge (small query rotation)        |
|                    4. ChamberTree top-k lookup OR full attention  |
|                    5. Soft chamber bonus on keys (differentiable) |
|                    6. Softmax over candidates -> weighted V sum   |
|                    7. Concatenate heads -> W_out                  |
|                                                                    |
|  LayerNorm -> FFN (Linear -> GELU -> Linear) -> Residual         |
+-------------------------------------------------------------------+
    |
    v
LayerNorm -> LM Head (weight-tied with token embedding)
    |
    v
Logits (vocab_size)
```

### ChamberTree as Preprocessing Filter

The ChamberTree is not differentiable --- it's a discrete tree traversal. The trick:

1. **Build:** Insert all K vectors into ChamberTree (numpy, per sequence)
2. **Query:** For each Q, tree returns top-k candidate key indices in O(log t)
3. **Gather:** Use indices to select candidate K, V tensors in PyTorch
4. **Attend:** Compute differentiable softmax attention over k candidates only
5. **Gradients:** Flow through W_q_proj, W_k_proj, W_nudge, chamber_bonus, W_v_proj normally

This gives O(k) attention per query where k << t. The tree is just a fast filter.

### Soft Chamber Bonus

The chamber_bonus parameter adds a per-key attention bias based on which Coxeter chamber the key lands in. Since softmax is shift-invariant, the bonus must be per-key (not per-query).

For gradient flow, chamber membership uses soft assignment:
```
k_dots = K @ simple_roots^T            # (B, T, H, 4)
soft_signs = sigmoid(k_dots * 3.0)     # soft bit per root
chamber_weights = prod over 4 bits     # (B, T, H, 16) soft membership
k_bonus = sum(chamber_weights * bonus) # (B, T, H) per-key bonus
scores += k_bonus                      # differentiable
```

### Golden-Angle Positional Encoding

Instead of sinusoidal or RoPE, positions are encoded using phi-inverse spacing:

```
Position n -> angle = n * 2*pi * phi^-1
```

phi^-1 ~ 0.618 is the most irrational number (hardest to approximate by rationals). This guarantees:
- Consecutive positions are maximally separated (~137.5 degrees apart)
- No position vectors repeat or nearly repeat at any scale
- Long-range positions compress via Zeckendorf decomposition (sum of Fibonacci-indexed embeddings)

Frequency scales across dimension pairs use phi powers: phi^(-k/n_pairs), giving geometrically spaced frequencies anchored to the golden ratio.

### Diagnostic Metrics

Phase 5 tracks geometric health during training:

| Metric | What it measures | Healthy range |
|--------|-----------------|---------------|
| `chamber_entropy` | Shannon entropy of chamber utilization | > 2.0 (max ~2.77 for 16 chambers) |
| `nudge_rank` | Ratio sigma_1/sigma_2 of W_nudge SVD | > 2.0 (rank-1 = focused direction) |
| `geo_alignment` | Max dot of nudge direction with 600-cell vertex | > 0.9 (geometry attracts learning) |
| `scan_ratio` | Fraction of keys examined per query (tree mode) | << 1.0 (lower = more pruning) |

### Autoresearch Protocol

`h4_program.md` defines an autonomous experiment loop:
- Agent modifies ONLY `train_cpu.py` (trainable adapters + hyperparameters)
- Frozen geometry is off-limits
- 2-minute CPU budget per experiment
- Keep/discard based on val_bpb improvement
- Git commit on keep, checkout on discard
- Results tracked in `results.tsv` (untracked)

---

## 10. Phase 6: BitNet b1.58 Ternary Weights

### Why Ternary Fits H4 Geometry

The ChamberTree routing already performs 1-bit quantization: `sign(dot(query, root_i))` for 4 roots gives a 4-bit chamber index. BitNet b1.58 extends this to ternary {-1, 0, +1} for all trainable weights.

The key insight: ternary preserves signs, and chamber assignments only depend on signs. Therefore ternary-quantized attention heads navigate the same Coxeter chambers. At initialization this was confirmed at **97.9% chamber preservation**. After training, preservation drops to ~76% as the model finds its own geometric routing in the ternary-constrained space --- different from the float routing but equally effective (0.003 bpb gap).

### BitLinear: Straight-Through Estimator

```
Training forward pass:
    1. Quantize weights:  w_q = RoundClip(w / mean(|w|), -1, +1)
    2. STE:  w_ste = w + (w_q * scale - w).detach()
       β†’ forward sees quantized value, backward sees float shadow
    3. Quantize activations: x_q = RoundClip(x * 127 / max(|x|), -127, 127)
    4. STE:  x_ste = x + (x_q * scale / 127 - x).detach()
    5. y = Linear(x_ste, w_ste)  β†’ full gradient flow to w and x

Inference forward pass (after freeze()):
    1. Load frozen int8 weights (only {-1, 0, +1} values)
    2. y = x @ w_frozen.float() * scale  β†’ matmul is add/sub only
```

### What Stays Float32

| Component | Reason |
|-----------|--------|
| 600-cell vertices (120 x 4) | Frozen buffer, static lookup |
| H4 simple roots (4 x 4) | Frozen buffer, need exact signs for chamber routing |
| E8->H4 projection (4 x 8) | Frozen buffer |
| chamber_bonus (n_heads x 16) | Too small to quantize (128 values), needs continuous gradients |
| Token embeddings | Lookup table, not a matmul |
| LayerNorm parameters | Small, need float precision |
| LM head | Weight-tied with embeddings |

### What Becomes Ternary

| Component | Params (per layer) | Ternary Size (per layer) |
|-----------|-------------------|------------------------|
| W_q_proj, W_k_proj | 2x d_model*32 | varies by d_model |
| W_v_proj, W_out | 2x d_model*128 | varies by d_model |
| FFN up + down | 2x d_model*d_ffn | varies by d_model |

Example configs from autoresearch (best results):
- **d_model=128, 6 layers** (float best): ~348K params, ~1.4 MB float32
- **d_model=256, 4 layers** (ternary best): ~1.1M params, ~310 KB at 1.58 bits (~17x compression)

### Ternary Inference Path

```
Token β†’ Embedding (float16 lookup)
      β†’ BitLinear Q/K/V (ternary: add/sub only)
      β†’ Normalize to SΒ³ (float, 4-dim per head)
      β†’ W_nudge rotation (float, 4x4 per head β€” only 16 values)
      β†’ ChamberTree lookup (integer sign comparisons, 3.1% scan)
      β†’ Softmax over k candidates (float, k << t)
      β†’ Weighted sum of values (float x ternary)
      β†’ BitLinear output (ternary: add/sub only)
      β†’ BitLinear FFN (ternary: add/sub only)
      β†’ LM head β†’ next token
```

Float multiplies in the critical path: root dot products (4x4), softmax, dequant scales (one per layer). Everything else is integer addition/subtraction.

---

## 11. Phase 7: Unified Geometric RAG

### E8 Retrieval + H4 Ranking + Ternary Generation

Phase 7 unifies document retrieval and generation through the same E8β†’H4 projection. In standard RAG systems, retrieval (embedding model + vector database) and generation (LLM) are separate systems with different geometries. H4 RAG uses one geometry for both:

1. **Encode:** Documents chunk into 8D embeddings via golden-angle spiral placement, stored in E8 Voronoi cells (`rag/encoder.py`)
2. **Retrieve:** Questions project to 8D, E8 lattice lookup finds nearest chunks in O(1) + 240 neighbors (`h4_polytopic_attention.py: E8LatticeIndex`)
3. **Rank:** Question and passage both encode to 4D vectors on SΒ³ via H4 attention. Relevance = dot product in H4 space β€” the same metric as attention routing (`rag/ranking_model.py`)
4. **Generate:** Ternary H4 model reads retrieved context and produces answers (`h4_language_model.py`)

### Contrastive Ranking in H4 Space

The ranker uses InfoNCE loss with in-batch negatives. For a batch of B questions, each paired with its correct passage, the similarity matrix is BΓ—B where entry [i,j] = dot(q_i, p_j) on SΒ³. The diagonal entries (correct pairs) should dominate each row.

Key finding: ternary models need ~2x higher contrastive temperature (0.15 vs 0.07) because BitLinear produces noisier similarity scores that require softer probability distributions for gradient flow.

### Two-Stage Retrieval Pipeline

The production pipeline uses both architectures:

1. **Bi-encoder** (fast, broad): Encodes question and passage separately, scores by H4 dot product. At 3.7M params: R@5=100%, MRR=0.93. The answer is always in the top 5 results. Cost: ~20ms for all documents.

2. **Pre-trained cross-encoder** (precise, narrow): A pre-trained reranker (ms-marco-MiniLM-L-6-v2, 22M params) scores each candidate by reading question + passage jointly. Achieves **98.5% R@1** on the bi-encoder's top-5 candidates. Cost: ~500ms for 5 candidates.

Pipeline: H4 bi-encoder retrieves top-5 (100% recall, 20ms) β†’ MiniLM reranks top-5 (98.5% precision, ~500ms) β†’ return best.

Our H4 cross-encoder (25M ternary params, 8h overnight on 5.9K SQuAD pairs) reached **80% R@1 peak** (69% final eval) β€” a breakthrough showing H4 cross-attention learns question-to-passage alignment through Coxeter chambers. Trajectory: 29% (1hr) β†’ 52% (3hr) β†’ 80% (7hr). The system ships with MiniLM (98.5%) for production accuracy, with the H4 cross-encoder as a fully geometric, zero-dependency alternative.

### Full-Scale Results

- **Language modeling:** 24M ternary params, PPL 10.0 on TinyStories (8h CPU), beats published 33M baseline
- **Bi-encoder retrieval:** R@5=100%, MRR=0.93 at 3.7M params (answer always in results)
- **Combined reranking:** 98.5% R@1 (H4 bi-encoder retrieval + pre-trained MiniLM reranker)
- **Passage ranking (min viable):** 41.5% R@1 at 870K params (12x random chance)
- **Retrieval:** 7.8ms per query via E8 lattice
- **Cost:** $0/month (replaces $500-2,000/month enterprise RAG stack)