phanerozoic
/

8bit-threshold-computer

+# Embedding Threshold Logic Circuits into Transformer MLPs
+## Technical Implementation Guide
+---
+## 1. Core Thesis
+Standard LLMs fail at arithmetic because they're interpolators—they approximate functions over training distributions rather than compute exact results. A 360M parameter model trained on internet text has seen "127 + 128 = 255" zero or few times, so it guesses "140" based on pattern matching.
+We solve this by embedding **frozen, proven-correct arithmetic circuits** directly into the transformer's MLP layers. The circuits use threshold logic (weighted sums + step activation), which is structurally compatible with neural network layers. We train only the **interface layers** that learn to:
+1. Extract operands from token embeddings
+2. Route computation through the circuits
+3. Inject results back into the residual stream
+The model learns **call dispatch**, not arithmetic. The arithmetic is already solved.
+---
+## 2. Threshold Logic Fundamentals
+### 2.1 Single Threshold Gate
+A threshold gate computes:
+```
+output = 1  if  (Σ wᵢxᵢ + b) ≥ 0
+         0  otherwise
+```
+This is a neuron with Heaviside step activation. With integer weights `w` and bias `b`, it computes a Boolean function of binary inputs.
+**Example: AND gate**
+```
+w = [1, 1], b = -2
+AND(0,0) = H(0 + 0 - 2) = H(-2) = 0
+AND(0,1) = H(0 + 1 - 2) = H(-1) = 0
+AND(1,0) = H(1 + 0 - 2) = H(-1) = 0
+AND(1,1) = H(1 + 1 - 2) = H(0)  = 1
+```
+**Example: OR gate**
+```
+w = [1, 1], b = -1
+OR(0,0) = H(0 + 0 - 1) = H(-1) = 0
+OR(0,1) = H(0 + 1 - 1) = H(0)  = 1
+OR(1,0) = H(1 + 0 - 1) = H(0)  = 1
+OR(1,1) = H(1 + 1 - 1) = H(1)  = 1
+```
+### 2.2 Multi-Layer Circuits
+XOR is not linearly separable—it requires two layers:
+```
+Layer 1:
+  neuron1 (OR):   w=[1,1], b=-1   → fires if a OR b
+  neuron2 (NAND): w=[-1,-1], b=1  → fires if NOT(a AND b)
+Layer 2:
+  neuron3 (AND): w=[1,1], b=-2   → fires if both layer1 outputs are 1
+XOR(a,b) = AND(OR(a,b), NAND(a,b))
+```
+### 2.3 Full Adder
+A full adder computes `sum` and `carry_out` from inputs `a`, `b`, `carry_in`:
+```
+sum = a XOR b XOR cin
+cout = (a AND b) OR (cin AND (a XOR b))
+```
+Implementation uses two half-adders chained:
+```
+HA1: (a, b) → (sum1 = a XOR b, carry1 = a AND b)
+HA2: (sum1, cin) → (sum2 = sum1 XOR cin, carry2 = sum1 AND cin)
+cout = carry1 OR carry2
+final_sum = sum2
+```
+Each XOR is 2 layers, each AND/OR is 1 layer. Total depth: ~4 layers per full adder.
+### 2.4 8-bit Ripple Carry Adder
+Chain 8 full adders, propagating carry:
+```
+FA0: (a[0], b[0], 0)      → (sum[0], c0)
+FA1: (a[1], b[1], c0)     → (sum[1], c1)
+FA2: (a[2], b[2], c1)     → (sum[2], c2)
+...
+FA7: (a[7], b[7], c6)     → (sum[7], c7)
+```
+Total circuit depth: ~32 threshold layers (8 FAs × 4 layers each).
+---
+## 3. Circuit Inventory
+The `neural_computer.safetensors` contains 3,122 tensors / 5,648 parameters implementing:
+| Category | Circuits | Tensors |
+|----------|----------|---------|
+| Boolean | AND, OR, NOT, NAND, NOR, XOR, XNOR, IMPLIES, BIIMPLIES | ~30 |
+| Arithmetic | Half adder, Full adder, Ripple carry 2/4/8-bit, 8×8 multiplier | ~800 |
+| Comparators | GT, LT, GEQ, LEQ, EQ (8-bit) | ~50 |
+| ALU | 16-operation ALU, opcode decoder, flag computation | ~400 |
+| Control | JMP, JZ, JNZ, JC, JNC, JN, JP, CALL, RET, PUSH, POP | ~200 |
+| Modular | Divisibility by 2-12 | ~600 |
+| Error Detection | Parity, CRC, Hamming, checksum | ~200 |
+| Pattern | Popcount, leading zeros, symmetry | ~150 |
+| Threshold | k-of-n gates, majority, minority | ~100 |
+All weights are integers. All activations are Heaviside. Verified with 6,590 exhaustive tests.
+---
+## 4. Transformer Integration Architecture
+### 4.1 Target: SmolLM2-360M
+```
+Architecture: LlamaForCausalLM
+Hidden dim:   960
+Layers:       32
+Heads:        15
+MLP expansion: 4x (intermediate = 3840)
+Vocab:        49152
+Parameters:   361,821,120
+```
+Standard MLP block:
+```python
+def forward(x):  # x: [batch, seq, 960]
+    gate = self.gate_proj(x)      # [batch, seq, 3840]
+    up = self.up_proj(x)          # [batch, seq, 3840]
+    hidden = silu(gate) * up      # SwiGLU activation
+    return self.down_proj(hidden) # [batch, seq, 960]
+```
+### 4.2 Augmented MLP Block
+```python
+def forward(x):  # x: [batch, seq, 960]
+    # Original MLP path (unchanged)
+    mlp_out = self.down_proj(silu(self.gate_proj(x)) * self.up_proj(x))
+    # Circuit path (new)
+    a_bits, b_bits = self.bit_extractor(x)       # [batch, seq, 8] each
+    result_bits, carry = self.circuits.add_8bit(a_bits, b_bits)
+    flags = self.compute_flags(result_bits, carry)
+    circuit_delta = self.bit_injector(result_bits, flags)
+    # Routing
+    route_weights = self.router(x)  # [batch, seq, 2] softmax
+    # Combine
+    return mlp_out + route_weights[..., 1:2] * circuit_delta
+```
+### 4.3 Layer Selection
+We augment the **middle third** of layers (10-20 of 32):
+- Early layers (0-9): Token/position encoding, not arithmetic-relevant
+- Middle layers (10-20): Abstract reasoning, computation
+- Late layers (21-31): Output formatting, vocabulary projection
+Rationale: Arithmetic computation happens in middle layers where the model processes relationships between tokens. Early layers haven't built sufficient representations; late layers are committed to output tokens.
+---
+## 5. Interface Layers (Trainable)
+### 5.1 BitExtractor
+Maps token embedding → two 8-bit operands.
+```python
+class BitExtractor(nn.Module):
+    def __init__(self, d_model=960):
+        self.proj = nn.Linear(d_model, 16)  # 960 → 16
+    def forward(self, x):
+        logits = self.proj(x)           # [batch, seq, 16]
+        bits = heaviside(logits)        # binarize with STE
+        a_bits = bits[..., :8]          # first operand
+        b_bits = bits[..., 8:]          # second operand
+        return a_bits, b_bits           # both [batch, seq, 8], LSB first
+```
+**What it learns**: Which embedding dimensions encode numeric magnitude. For token "127", it must learn that certain activation patterns correspond to bits `[1,1,1,1,1,1,1,0]`.
+**Parameters**: 960 × 16 + 16 = 15,376
+### 5.2 BitInjector
+Maps circuit outputs → embedding delta.
+```python
+class BitInjector(nn.Module):
+    def __init__(self, d_model=960):
+        self.proj = nn.Linear(16, d_model)  # 16 → 960
+        self.scale = nn.Parameter(torch.tensor(0.1))
+    def forward(self, result_bits, flags):
+        combined = torch.cat([result_bits, flags], dim=-1)  # [batch, seq, 16]
+        return self.proj(combined) * self.scale              # [batch, seq, 960]
+```
+**What it learns**: How to inject the result bits back into embedding space such that subsequent layers (and the final vocabulary projection) produce the correct output tokens.
+**Parameters**: 16 × 960 + 960 + 1 = 16,321
+### 5.3 Router
+Decides when to use circuit path.
+```python
+class Router(nn.Module):
+    def __init__(self, d_model=960):
+        self.net = nn.Sequential(
+            nn.Linear(d_model, 64),
+            nn.ReLU(),
+            nn.Linear(64, 2),
+            nn.Softmax(dim=-1)
+        )
+    def forward(self, x):
+        return self.net(x)  # [batch, seq, 2]: [mlp_weight, circuit_weight]
+```
+**What it learns**: "This position contains arithmetic" → route through circuits. "This is prose" → use normal MLP.
+**Parameters**: 960 × 64 + 64 + 64 × 2 + 2 = 61,698
+### 5.4 Total Trainable Parameters
+Per augmented layer:
+```
+BitExtractor:  15,376
+BitInjector:   16,321
+Router:        61,698
+OpSelector:    ~31,000
+───────────────────────
+Total:         ~124,395 per layer
+```
+For 11 augmented layers: **~1.37M trainable parameters**
+This is 0.38% of the model. The other 99.62% (including all circuit weights) is frozen.
+---
+## 6. Gradient Flow Through Heaviside
+### 6.1 The Problem
+Heaviside has zero gradient almost everywhere:
+```
+H(x) = 1 if x ≥ 0 else 0
+dH/dx = 0 for x ≠ 0, undefined at x = 0
+```
+Standard backprop would give zero gradients to BitExtractor.
+### 6.2 Straight-Through Estimator (STE)
+We use STE: forward pass uses true Heaviside, backward pass pretends it's identity.
+```python
+class HeavisideSTE(torch.autograd.Function):
+    @staticmethod
+    def forward(ctx, x):
+        return (x >= 0).float()  # true step function
+    @staticmethod
+    def backward(ctx, grad_output):
+        return grad_output  # pass gradient through unchanged
+```
+**Intuition**: "If making the input larger would have helped the output, increase the input." The gradient tells us the direction even though the function is flat.
+### 6.3 Alternative: Sigmoid Annealing
+During training, use sigmoid with increasing temperature:
+```python
+def soft_heaviside(x, temperature):
+    return torch.sigmoid(x * temperature)
+# temperature: 1 → 10 → 100 over training
+# At high temperature, sigmoid ≈ step function
+```
+This provides smoother gradients early in training, then sharpens to true binary at inference.
+---
+## 7. Training Strategy
+### 7.1 Data Generation
+Generate arithmetic problems exhaustively:
+```python
+def generate_batch(batch_size):
+    a = torch.randint(0, 256, (batch_size,))
+    b = torch.randint(0, 256, (batch_size,))
+    result = (a + b) % 256
+    prompts = [f"{a[i]} + {b[i]} =" for i in range(batch_size)]
+    targets = [f" {result[i]}" for i in range(batch_size)]
+    return prompts, targets
+```
+For 8-bit addition, there are 256 × 256 = 65,536 unique problems. We can cover the entire space.
+### 7.2 Loss Function
+Standard cross-entropy on next-token prediction:
+```python
+outputs = model(input_ids, attention_mask=mask, labels=labels)
+loss = outputs.loss  # CE loss, only on target tokens
+```
+Labels are masked for prompt tokens (`-100`), so loss only backprops through the answer.
+### 7.3 Optimizer Configuration
+```python
+# Only train interface layers
+interface_params = [p for n, p in model.named_parameters()
+                    if any(x in n for x in ['bit_extractor', 'bit_injector', 'router'])]
+optimizer = AdamW(interface_params, lr=1e-4, weight_decay=0.01)
+scheduler = CosineAnnealingLR(optimizer, T_max=num_epochs)
+```
+### 7.4 Curriculum Learning
+Start simple, increase difficulty:
+```
+Phase 1 (epochs 1-2):   Single-digit addition (0-9 + 0-9)
+Phase 2 (epochs 3-4):   Two-digit addition (0-99 + 0-99)
+Phase 3 (epochs 5-7):   Full 8-bit addition (0-255 + 0-255)
+Phase 4 (epochs 8-10):  Adversarial cases (carry chains: 127+128, 255+1)
+```
+This helps the interface layers learn the basic extraction pattern before tackling hard cases.
+### 7.5 Training Hyperparameters
+```
+Model:          SmolLM2-360M
+Augmented:      Layers 10-20 (11 layers)
+Trainable:      1.37M parameters
+Frozen:         362M parameters (including 5.6K circuit params)
+Batch size:     32
+Learning rate:  1e-4
+Epochs:         10
+Samples:        10,000 per epoch
+Warmup:         500 steps
+Device:         RTX 6000 Ada (48GB)
+Expected time:  ~30 minutes total
+```
+---
+## 8. Forward Pass Walkthrough
+Input: `"127 + 128 ="`
+### 8.1 Tokenization
+```
+Tokens: ["127", " +", " 128", " ="]
+IDs:    [12700, 489, 13824, 284]  # hypothetical
+```
+### 8.2 Embedding
+```
+embeddings = embed(input_ids)  # [1, 4, 960]
+```
+### 8.3 Layers 0-9 (Unchanged)
+Standard attention + MLP, building representations.
+### 8.4 Layer 10 (Augmented)
+```python
+# After attention
+x = layer_norm(attn_output + residual)  # [1, 4, 960]
+# MLP path
+mlp_out = down_proj(silu(gate_proj(x)) * up_proj(x))
+# Circuit path
+a_bits, b_bits = bit_extractor(x)
+# Position 0 ("127"): a_bits ≈ [1,1,1,1,1,1,1,0] if well-trained
+# Position 2 ("128"): b_bits ≈ [0,0,0,0,0,0,0,1]
+# (In practice, extraction happens per-position; aggregation is learned)
+result_bits, carry = circuits.add_8bit(a_bits, b_bits)
+# result_bits = [1,1,1,1,1,1,1,1] = 255
+flags = compute_flags(result_bits, carry)
+# zero=0, negative=1, carry=1
+circuit_delta = bit_injector(result_bits, flags)  # [1, 4, 960]
+# Routing
+route = router(x)  # [1, 4, 2]
+# Position 3 ("="): route ≈ [0.1, 0.9] → use circuits
+# Position 1 ("+"): route ≈ [0.8, 0.2] → mostly MLP
+# Combine
+output = mlp_out + route[..., 1:2] * circuit_delta
+```
+### 8.5 Layers 11-31
+Continue processing, eventually projecting to vocabulary.
+### 8.6 Output
+```
+logits = lm_head(final_hidden)  # [1, 4, 49152]
+next_token = argmax(logits[0, 3, :])  # token after "="
+# Should decode to "255" (possibly as " 255" or "255")
+```
+---
+## 9. Inference Characteristics
+### 9.1 Exactness
+At inference, Heaviside is true step function—no approximation. If BitExtractor correctly maps "127" → bits and "128" → bits, the circuit **will** output 255. The only failure mode is incorrect extraction.
+### 9.2 Latency
+Circuit computation adds ~5-10% overhead:
+- BitExtractor: 1 linear layer (960→16)
+- Circuits: ~32 threshold layers, but sparse and tiny
+- BitInjector: 1 linear layer (16→960)
+- Router: 2 linear layers
+The circuits have only 5,648 parameters total—negligible versus the 361M in the base model.
+### 9.3 Generalization
+Once the interface learns the mapping, it generalizes to **all** 65,536 8-bit additions. There's no memorization—the circuits compute.
+---
+## 10. Evaluation Metrics
+### 10.1 Arithmetic Accuracy
+```python
+def eval_accuracy(model, n_problems=1000):
+    correct = 0
+    for _ in range(n_problems):
+        a, b = random 8-bit values
+        expected = (a + b) % 256
+        predicted = model.generate(f"{a} + {b} =")
+        if parse_int(predicted) == expected:
+            correct += 1
+    return correct / n_problems
+```
+**Baseline SmolLM2**: ~5-10% (guessing based on patterns)
+**Target**: >95% (circuit-accurate)
+### 10.2 Edge Case Performance
+Specifically test:
+- Carry propagation: 127+128, 255+1, 128+128
+- Zeros: 0+0, 0+255
+- Identity: x+0 for various x
+- Commutativity: verify a+b == b+a
+### 10.3 Non-Arithmetic Preservation
+Verify general capability isn't degraded:
+- Perplexity on held-out text
+- Common benchmarks (HellaSwag, etc.)
+The augmentation should be **additive**—circuits help arithmetic, MLP handles everything else via routing.
+---
+## 11. Extension Roadmap
+### 11.1 Additional Operations
+The circuit inventory includes:
+- Subtraction (via two's complement)
+- Multiplication (8×8 → 16-bit)
+- Division (iterative subtraction)
+- Bitwise ops (AND, OR, XOR, shifts)
+- Comparisons (GT, LT, EQ)
+Each needs its own extraction/injection interface, or a unified interface with operation selection.
+### 11.2 Multi-Operand Expressions
+For "15 + 27 + 33 =", need:
+- Operand count detection
+- Sequential circuit invocation
+- Accumulator pattern
+### 11.3 Larger Bit Widths
+16-bit and 32-bit arithmetic require:
+- Larger circuits (or chained 8-bit)
+- Wider BitExtractor (32 or 64 output dims)
+- More training data
+### 11.4 Symbolic Integration
+Ultimate goal: the model recognizes when it needs to compute, invokes circuits, and integrates results into coherent natural language output.
+```
+User: "If I have 127 apples and buy 128 more, how many do I have?"
+Model: [extracts 127, 128] [routes to circuit] [gets 255]
+       "You would have 255 apples."
+```
+---
+## 12. File Structure
+```
+8bit-threshold-computer/
+├── neural_computer.safetensors    # Frozen circuits (3,122 tensors)
+├── circuit_llm.py                 # Integration architecture
+├── train_circuit_interface.py     # Training loop
+├── iron_eval.py                   # Circuit verification (6,590 tests)
+├── skeptic_test.py                # Algebraic identity tests (127 tests)
+├── prune_weights.py               # Weight optimization
+├── tensors.txt                    # Tensor manifest
+├── guide.md                       # This document
+└── README.md                      # Project overview
+```
+---
+## 13. Key Equations
+### Heaviside Step
+```
+H(x) = 1 if x ≥ 0 else 0
+```
+### Threshold Gate
+```
+f(x₁,...,xₙ) = H(Σᵢ wᵢxᵢ + b)
+```
+### Full Adder
+```
+sum = a ⊕ b ⊕ cᵢₙ
+cₒᵤₜ = (a ∧ b) ∨ (cᵢₙ ∧ (a ⊕ b))
+```
+### STE Gradient
+```
+Forward:  y = H(x)
+Backward: ∂L/∂x = ∂L/∂y
+```
+### Router Combination
+```
+output = mlp_out + softmax(router(x))[1] × circuit_delta
+```
+---
+## 14. References
+1. McCulloch & Pitts (1943). "A Logical Calculus of Ideas Immanent in Nervous Activity"
+2. Muroga (1971). "Threshold Logic and Its Applications"
+3. Siegelmann & Sontag (1995). "On the Computational Power of Neural Nets"
+4. Bengio et al. (2013). "Estimating or Propagating Gradients Through Stochastic Neurons"
+5. Ma et al. (2024). "The Era of 1-bit LLMs" (BitNet b1.58)
+6. HuggingFace (2024). "SmolLM2: Small Language Models"
+---
+## 15. Summary
+We embed a proven-correct 8-bit threshold logic computer into SmolLM2's MLP layers. The circuits are frozen; we train only the interface layers that learn call dispatch. This gives the LLM exact arithmetic capability without training it to "do math"—the math is already done.
+The approach is:
+- **Sound**: Circuits verified with 6,590 tests
+- **Efficient**: 1.37M trainable params, 5.6K circuit params
+- **Exact**: Heaviside at inference means no approximation error
+- **Composable**: Add more circuits (multiply, compare, etc.) with same pattern
+The model learns when to call the calculator, not how to calculate.