phanerozoic commited on
Commit
ca9504e
·
verified ·
1 Parent(s): b3392d0

Upload guide.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. guide.md +615 -0
guide.md ADDED
@@ -0,0 +1,615 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Embedding Threshold Logic Circuits into Transformer MLPs
2
+
3
+ ## Technical Implementation Guide
4
+
5
+ ---
6
+
7
+ ## 1. Core Thesis
8
+
9
+ Standard LLMs fail at arithmetic because they're interpolators—they approximate functions over training distributions rather than compute exact results. A 360M parameter model trained on internet text has seen "127 + 128 = 255" zero or few times, so it guesses "140" based on pattern matching.
10
+
11
+ We solve this by embedding **frozen, proven-correct arithmetic circuits** directly into the transformer's MLP layers. The circuits use threshold logic (weighted sums + step activation), which is structurally compatible with neural network layers. We train only the **interface layers** that learn to:
12
+
13
+ 1. Extract operands from token embeddings
14
+ 2. Route computation through the circuits
15
+ 3. Inject results back into the residual stream
16
+
17
+ The model learns **call dispatch**, not arithmetic. The arithmetic is already solved.
18
+
19
+ ---
20
+
21
+ ## 2. Threshold Logic Fundamentals
22
+
23
+ ### 2.1 Single Threshold Gate
24
+
25
+ A threshold gate computes:
26
+
27
+ ```
28
+ output = 1 if (Σ wᵢxᵢ + b) ≥ 0
29
+ 0 otherwise
30
+ ```
31
+
32
+ This is a neuron with Heaviside step activation. With integer weights `w` and bias `b`, it computes a Boolean function of binary inputs.
33
+
34
+ **Example: AND gate**
35
+ ```
36
+ w = [1, 1], b = -2
37
+ AND(0,0) = H(0 + 0 - 2) = H(-2) = 0
38
+ AND(0,1) = H(0 + 1 - 2) = H(-1) = 0
39
+ AND(1,0) = H(1 + 0 - 2) = H(-1) = 0
40
+ AND(1,1) = H(1 + 1 - 2) = H(0) = 1
41
+ ```
42
+
43
+ **Example: OR gate**
44
+ ```
45
+ w = [1, 1], b = -1
46
+ OR(0,0) = H(0 + 0 - 1) = H(-1) = 0
47
+ OR(0,1) = H(0 + 1 - 1) = H(0) = 1
48
+ OR(1,0) = H(1 + 0 - 1) = H(0) = 1
49
+ OR(1,1) = H(1 + 1 - 1) = H(1) = 1
50
+ ```
51
+
52
+ ### 2.2 Multi-Layer Circuits
53
+
54
+ XOR is not linearly separable—it requires two layers:
55
+
56
+ ```
57
+ Layer 1:
58
+ neuron1 (OR): w=[1,1], b=-1 → fires if a OR b
59
+ neuron2 (NAND): w=[-1,-1], b=1 → fires if NOT(a AND b)
60
+
61
+ Layer 2:
62
+ neuron3 (AND): w=[1,1], b=-2 → fires if both layer1 outputs are 1
63
+
64
+ XOR(a,b) = AND(OR(a,b), NAND(a,b))
65
+ ```
66
+
67
+ ### 2.3 Full Adder
68
+
69
+ A full adder computes `sum` and `carry_out` from inputs `a`, `b`, `carry_in`:
70
+
71
+ ```
72
+ sum = a XOR b XOR cin
73
+ cout = (a AND b) OR (cin AND (a XOR b))
74
+ ```
75
+
76
+ Implementation uses two half-adders chained:
77
+
78
+ ```
79
+ HA1: (a, b) → (sum1 = a XOR b, carry1 = a AND b)
80
+ HA2: (sum1, cin) → (sum2 = sum1 XOR cin, carry2 = sum1 AND cin)
81
+ cout = carry1 OR carry2
82
+ final_sum = sum2
83
+ ```
84
+
85
+ Each XOR is 2 layers, each AND/OR is 1 layer. Total depth: ~4 layers per full adder.
86
+
87
+ ### 2.4 8-bit Ripple Carry Adder
88
+
89
+ Chain 8 full adders, propagating carry:
90
+
91
+ ```
92
+ FA0: (a[0], b[0], 0) → (sum[0], c0)
93
+ FA1: (a[1], b[1], c0) → (sum[1], c1)
94
+ FA2: (a[2], b[2], c1) → (sum[2], c2)
95
+ ...
96
+ FA7: (a[7], b[7], c6) → (sum[7], c7)
97
+ ```
98
+
99
+ Total circuit depth: ~32 threshold layers (8 FAs × 4 layers each).
100
+
101
+ ---
102
+
103
+ ## 3. Circuit Inventory
104
+
105
+ The `neural_computer.safetensors` contains 3,122 tensors / 5,648 parameters implementing:
106
+
107
+ | Category | Circuits | Tensors |
108
+ |----------|----------|---------|
109
+ | Boolean | AND, OR, NOT, NAND, NOR, XOR, XNOR, IMPLIES, BIIMPLIES | ~30 |
110
+ | Arithmetic | Half adder, Full adder, Ripple carry 2/4/8-bit, 8×8 multiplier | ~800 |
111
+ | Comparators | GT, LT, GEQ, LEQ, EQ (8-bit) | ~50 |
112
+ | ALU | 16-operation ALU, opcode decoder, flag computation | ~400 |
113
+ | Control | JMP, JZ, JNZ, JC, JNC, JN, JP, CALL, RET, PUSH, POP | ~200 |
114
+ | Modular | Divisibility by 2-12 | ~600 |
115
+ | Error Detection | Parity, CRC, Hamming, checksum | ~200 |
116
+ | Pattern | Popcount, leading zeros, symmetry | ~150 |
117
+ | Threshold | k-of-n gates, majority, minority | ~100 |
118
+
119
+ All weights are integers. All activations are Heaviside. Verified with 6,590 exhaustive tests.
120
+
121
+ ---
122
+
123
+ ## 4. Transformer Integration Architecture
124
+
125
+ ### 4.1 Target: SmolLM2-360M
126
+
127
+ ```
128
+ Architecture: LlamaForCausalLM
129
+ Hidden dim: 960
130
+ Layers: 32
131
+ Heads: 15
132
+ MLP expansion: 4x (intermediate = 3840)
133
+ Vocab: 49152
134
+ Parameters: 361,821,120
135
+ ```
136
+
137
+ Standard MLP block:
138
+ ```python
139
+ def forward(x): # x: [batch, seq, 960]
140
+ gate = self.gate_proj(x) # [batch, seq, 3840]
141
+ up = self.up_proj(x) # [batch, seq, 3840]
142
+ hidden = silu(gate) * up # SwiGLU activation
143
+ return self.down_proj(hidden) # [batch, seq, 960]
144
+ ```
145
+
146
+ ### 4.2 Augmented MLP Block
147
+
148
+ ```python
149
+ def forward(x): # x: [batch, seq, 960]
150
+ # Original MLP path (unchanged)
151
+ mlp_out = self.down_proj(silu(self.gate_proj(x)) * self.up_proj(x))
152
+
153
+ # Circuit path (new)
154
+ a_bits, b_bits = self.bit_extractor(x) # [batch, seq, 8] each
155
+ result_bits, carry = self.circuits.add_8bit(a_bits, b_bits)
156
+ flags = self.compute_flags(result_bits, carry)
157
+ circuit_delta = self.bit_injector(result_bits, flags)
158
+
159
+ # Routing
160
+ route_weights = self.router(x) # [batch, seq, 2] softmax
161
+
162
+ # Combine
163
+ return mlp_out + route_weights[..., 1:2] * circuit_delta
164
+ ```
165
+
166
+ ### 4.3 Layer Selection
167
+
168
+ We augment the **middle third** of layers (10-20 of 32):
169
+
170
+ - Early layers (0-9): Token/position encoding, not arithmetic-relevant
171
+ - Middle layers (10-20): Abstract reasoning, computation
172
+ - Late layers (21-31): Output formatting, vocabulary projection
173
+
174
+ Rationale: Arithmetic computation happens in middle layers where the model processes relationships between tokens. Early layers haven't built sufficient representations; late layers are committed to output tokens.
175
+
176
+ ---
177
+
178
+ ## 5. Interface Layers (Trainable)
179
+
180
+ ### 5.1 BitExtractor
181
+
182
+ Maps token embedding → two 8-bit operands.
183
+
184
+ ```python
185
+ class BitExtractor(nn.Module):
186
+ def __init__(self, d_model=960):
187
+ self.proj = nn.Linear(d_model, 16) # 960 → 16
188
+
189
+ def forward(self, x):
190
+ logits = self.proj(x) # [batch, seq, 16]
191
+ bits = heaviside(logits) # binarize with STE
192
+ a_bits = bits[..., :8] # first operand
193
+ b_bits = bits[..., 8:] # second operand
194
+ return a_bits, b_bits # both [batch, seq, 8], LSB first
195
+ ```
196
+
197
+ **What it learns**: Which embedding dimensions encode numeric magnitude. For token "127", it must learn that certain activation patterns correspond to bits `[1,1,1,1,1,1,1,0]`.
198
+
199
+ **Parameters**: 960 × 16 + 16 = 15,376
200
+
201
+ ### 5.2 BitInjector
202
+
203
+ Maps circuit outputs → embedding delta.
204
+
205
+ ```python
206
+ class BitInjector(nn.Module):
207
+ def __init__(self, d_model=960):
208
+ self.proj = nn.Linear(16, d_model) # 16 → 960
209
+ self.scale = nn.Parameter(torch.tensor(0.1))
210
+
211
+ def forward(self, result_bits, flags):
212
+ combined = torch.cat([result_bits, flags], dim=-1) # [batch, seq, 16]
213
+ return self.proj(combined) * self.scale # [batch, seq, 960]
214
+ ```
215
+
216
+ **What it learns**: How to inject the result bits back into embedding space such that subsequent layers (and the final vocabulary projection) produce the correct output tokens.
217
+
218
+ **Parameters**: 16 × 960 + 960 + 1 = 16,321
219
+
220
+ ### 5.3 Router
221
+
222
+ Decides when to use circuit path.
223
+
224
+ ```python
225
+ class Router(nn.Module):
226
+ def __init__(self, d_model=960):
227
+ self.net = nn.Sequential(
228
+ nn.Linear(d_model, 64),
229
+ nn.ReLU(),
230
+ nn.Linear(64, 2),
231
+ nn.Softmax(dim=-1)
232
+ )
233
+
234
+ def forward(self, x):
235
+ return self.net(x) # [batch, seq, 2]: [mlp_weight, circuit_weight]
236
+ ```
237
+
238
+ **What it learns**: "This position contains arithmetic" → route through circuits. "This is prose" → use normal MLP.
239
+
240
+ **Parameters**: 960 × 64 + 64 + 64 × 2 + 2 = 61,698
241
+
242
+ ### 5.4 Total Trainable Parameters
243
+
244
+ Per augmented layer:
245
+ ```
246
+ BitExtractor: 15,376
247
+ BitInjector: 16,321
248
+ Router: 61,698
249
+ OpSelector: ~31,000
250
+ ───────────────────────
251
+ Total: ~124,395 per layer
252
+ ```
253
+
254
+ For 11 augmented layers: **~1.37M trainable parameters**
255
+
256
+ This is 0.38% of the model. The other 99.62% (including all circuit weights) is frozen.
257
+
258
+ ---
259
+
260
+ ## 6. Gradient Flow Through Heaviside
261
+
262
+ ### 6.1 The Problem
263
+
264
+ Heaviside has zero gradient almost everywhere:
265
+
266
+ ```
267
+ H(x) = 1 if x ≥ 0 else 0
268
+ dH/dx = 0 for x ≠ 0, undefined at x = 0
269
+ ```
270
+
271
+ Standard backprop would give zero gradients to BitExtractor.
272
+
273
+ ### 6.2 Straight-Through Estimator (STE)
274
+
275
+ We use STE: forward pass uses true Heaviside, backward pass pretends it's identity.
276
+
277
+ ```python
278
+ class HeavisideSTE(torch.autograd.Function):
279
+ @staticmethod
280
+ def forward(ctx, x):
281
+ return (x >= 0).float() # true step function
282
+
283
+ @staticmethod
284
+ def backward(ctx, grad_output):
285
+ return grad_output # pass gradient through unchanged
286
+ ```
287
+
288
+ **Intuition**: "If making the input larger would have helped the output, increase the input." The gradient tells us the direction even though the function is flat.
289
+
290
+ ### 6.3 Alternative: Sigmoid Annealing
291
+
292
+ During training, use sigmoid with increasing temperature:
293
+
294
+ ```python
295
+ def soft_heaviside(x, temperature):
296
+ return torch.sigmoid(x * temperature)
297
+
298
+ # temperature: 1 → 10 → 100 over training
299
+ # At high temperature, sigmoid ≈ step function
300
+ ```
301
+
302
+ This provides smoother gradients early in training, then sharpens to true binary at inference.
303
+
304
+ ---
305
+
306
+ ## 7. Training Strategy
307
+
308
+ ### 7.1 Data Generation
309
+
310
+ Generate arithmetic problems exhaustively:
311
+
312
+ ```python
313
+ def generate_batch(batch_size):
314
+ a = torch.randint(0, 256, (batch_size,))
315
+ b = torch.randint(0, 256, (batch_size,))
316
+ result = (a + b) % 256
317
+
318
+ prompts = [f"{a[i]} + {b[i]} =" for i in range(batch_size)]
319
+ targets = [f" {result[i]}" for i in range(batch_size)]
320
+
321
+ return prompts, targets
322
+ ```
323
+
324
+ For 8-bit addition, there are 256 × 256 = 65,536 unique problems. We can cover the entire space.
325
+
326
+ ### 7.2 Loss Function
327
+
328
+ Standard cross-entropy on next-token prediction:
329
+
330
+ ```python
331
+ outputs = model(input_ids, attention_mask=mask, labels=labels)
332
+ loss = outputs.loss # CE loss, only on target tokens
333
+ ```
334
+
335
+ Labels are masked for prompt tokens (`-100`), so loss only backprops through the answer.
336
+
337
+ ### 7.3 Optimizer Configuration
338
+
339
+ ```python
340
+ # Only train interface layers
341
+ interface_params = [p for n, p in model.named_parameters()
342
+ if any(x in n for x in ['bit_extractor', 'bit_injector', 'router'])]
343
+
344
+ optimizer = AdamW(interface_params, lr=1e-4, weight_decay=0.01)
345
+ scheduler = CosineAnnealingLR(optimizer, T_max=num_epochs)
346
+ ```
347
+
348
+ ### 7.4 Curriculum Learning
349
+
350
+ Start simple, increase difficulty:
351
+
352
+ ```
353
+ Phase 1 (epochs 1-2): Single-digit addition (0-9 + 0-9)
354
+ Phase 2 (epochs 3-4): Two-digit addition (0-99 + 0-99)
355
+ Phase 3 (epochs 5-7): Full 8-bit addition (0-255 + 0-255)
356
+ Phase 4 (epochs 8-10): Adversarial cases (carry chains: 127+128, 255+1)
357
+ ```
358
+
359
+ This helps the interface layers learn the basic extraction pattern before tackling hard cases.
360
+
361
+ ### 7.5 Training Hyperparameters
362
+
363
+ ```
364
+ Model: SmolLM2-360M
365
+ Augmented: Layers 10-20 (11 layers)
366
+ Trainable: 1.37M parameters
367
+ Frozen: 362M parameters (including 5.6K circuit params)
368
+
369
+ Batch size: 32
370
+ Learning rate: 1e-4
371
+ Epochs: 10
372
+ Samples: 10,000 per epoch
373
+ Warmup: 500 steps
374
+ Device: RTX 6000 Ada (48GB)
375
+
376
+ Expected time: ~30 minutes total
377
+ ```
378
+
379
+ ---
380
+
381
+ ## 8. Forward Pass Walkthrough
382
+
383
+ Input: `"127 + 128 ="`
384
+
385
+ ### 8.1 Tokenization
386
+
387
+ ```
388
+ Tokens: ["127", " +", " 128", " ="]
389
+ IDs: [12700, 489, 13824, 284] # hypothetical
390
+ ```
391
+
392
+ ### 8.2 Embedding
393
+
394
+ ```
395
+ embeddings = embed(input_ids) # [1, 4, 960]
396
+ ```
397
+
398
+ ### 8.3 Layers 0-9 (Unchanged)
399
+
400
+ Standard attention + MLP, building representations.
401
+
402
+ ### 8.4 Layer 10 (Augmented)
403
+
404
+ ```python
405
+ # After attention
406
+ x = layer_norm(attn_output + residual) # [1, 4, 960]
407
+
408
+ # MLP path
409
+ mlp_out = down_proj(silu(gate_proj(x)) * up_proj(x))
410
+
411
+ # Circuit path
412
+ a_bits, b_bits = bit_extractor(x)
413
+ # Position 0 ("127"): a_bits ≈ [1,1,1,1,1,1,1,0] if well-trained
414
+ # Position 2 ("128"): b_bits ≈ [0,0,0,0,0,0,0,1]
415
+ # (In practice, extraction happens per-position; aggregation is learned)
416
+
417
+ result_bits, carry = circuits.add_8bit(a_bits, b_bits)
418
+ # result_bits = [1,1,1,1,1,1,1,1] = 255
419
+
420
+ flags = compute_flags(result_bits, carry)
421
+ # zero=0, negative=1, carry=1
422
+
423
+ circuit_delta = bit_injector(result_bits, flags) # [1, 4, 960]
424
+
425
+ # Routing
426
+ route = router(x) # [1, 4, 2]
427
+ # Position 3 ("="): route ≈ [0.1, 0.9] → use circuits
428
+ # Position 1 ("+"): route ≈ [0.8, 0.2] → mostly MLP
429
+
430
+ # Combine
431
+ output = mlp_out + route[..., 1:2] * circuit_delta
432
+ ```
433
+
434
+ ### 8.5 Layers 11-31
435
+
436
+ Continue processing, eventually projecting to vocabulary.
437
+
438
+ ### 8.6 Output
439
+
440
+ ```
441
+ logits = lm_head(final_hidden) # [1, 4, 49152]
442
+ next_token = argmax(logits[0, 3, :]) # token after "="
443
+ # Should decode to "255" (possibly as " 255" or "255")
444
+ ```
445
+
446
+ ---
447
+
448
+ ## 9. Inference Characteristics
449
+
450
+ ### 9.1 Exactness
451
+
452
+ At inference, Heaviside is true step function—no approximation. If BitExtractor correctly maps "127" → bits and "128" → bits, the circuit **will** output 255. The only failure mode is incorrect extraction.
453
+
454
+ ### 9.2 Latency
455
+
456
+ Circuit computation adds ~5-10% overhead:
457
+ - BitExtractor: 1 linear layer (960→16)
458
+ - Circuits: ~32 threshold layers, but sparse and tiny
459
+ - BitInjector: 1 linear layer (16→960)
460
+ - Router: 2 linear layers
461
+
462
+ The circuits have only 5,648 parameters total—negligible versus the 361M in the base model.
463
+
464
+ ### 9.3 Generalization
465
+
466
+ Once the interface learns the mapping, it generalizes to **all** 65,536 8-bit additions. There's no memorization—the circuits compute.
467
+
468
+ ---
469
+
470
+ ## 10. Evaluation Metrics
471
+
472
+ ### 10.1 Arithmetic Accuracy
473
+
474
+ ```python
475
+ def eval_accuracy(model, n_problems=1000):
476
+ correct = 0
477
+ for _ in range(n_problems):
478
+ a, b = random 8-bit values
479
+ expected = (a + b) % 256
480
+ predicted = model.generate(f"{a} + {b} =")
481
+ if parse_int(predicted) == expected:
482
+ correct += 1
483
+ return correct / n_problems
484
+ ```
485
+
486
+ **Baseline SmolLM2**: ~5-10% (guessing based on patterns)
487
+ **Target**: >95% (circuit-accurate)
488
+
489
+ ### 10.2 Edge Case Performance
490
+
491
+ Specifically test:
492
+ - Carry propagation: 127+128, 255+1, 128+128
493
+ - Zeros: 0+0, 0+255
494
+ - Identity: x+0 for various x
495
+ - Commutativity: verify a+b == b+a
496
+
497
+ ### 10.3 Non-Arithmetic Preservation
498
+
499
+ Verify general capability isn't degraded:
500
+ - Perplexity on held-out text
501
+ - Common benchmarks (HellaSwag, etc.)
502
+
503
+ The augmentation should be **additive**—circuits help arithmetic, MLP handles everything else via routing.
504
+
505
+ ---
506
+
507
+ ## 11. Extension Roadmap
508
+
509
+ ### 11.1 Additional Operations
510
+
511
+ The circuit inventory includes:
512
+ - Subtraction (via two's complement)
513
+ - Multiplication (8×8 → 16-bit)
514
+ - Division (iterative subtraction)
515
+ - Bitwise ops (AND, OR, XOR, shifts)
516
+ - Comparisons (GT, LT, EQ)
517
+
518
+ Each needs its own extraction/injection interface, or a unified interface with operation selection.
519
+
520
+ ### 11.2 Multi-Operand Expressions
521
+
522
+ For "15 + 27 + 33 =", need:
523
+ - Operand count detection
524
+ - Sequential circuit invocation
525
+ - Accumulator pattern
526
+
527
+ ### 11.3 Larger Bit Widths
528
+
529
+ 16-bit and 32-bit arithmetic require:
530
+ - Larger circuits (or chained 8-bit)
531
+ - Wider BitExtractor (32 or 64 output dims)
532
+ - More training data
533
+
534
+ ### 11.4 Symbolic Integration
535
+
536
+ Ultimate goal: the model recognizes when it needs to compute, invokes circuits, and integrates results into coherent natural language output.
537
+
538
+ ```
539
+ User: "If I have 127 apples and buy 128 more, how many do I have?"
540
+ Model: [extracts 127, 128] [routes to circuit] [gets 255]
541
+ "You would have 255 apples."
542
+ ```
543
+
544
+ ---
545
+
546
+ ## 12. File Structure
547
+
548
+ ```
549
+ 8bit-threshold-computer/
550
+ ├── neural_computer.safetensors # Frozen circuits (3,122 tensors)
551
+ ├── circuit_llm.py # Integration architecture
552
+ ├── train_circuit_interface.py # Training loop
553
+ ├── iron_eval.py # Circuit verification (6,590 tests)
554
+ ├── skeptic_test.py # Algebraic identity tests (127 tests)
555
+ ├── prune_weights.py # Weight optimization
556
+ ├── tensors.txt # Tensor manifest
557
+ ├── guide.md # This document
558
+ └── README.md # Project overview
559
+ ```
560
+
561
+ ---
562
+
563
+ ## 13. Key Equations
564
+
565
+ ### Heaviside Step
566
+ ```
567
+ H(x) = 1 if x ≥ 0 else 0
568
+ ```
569
+
570
+ ### Threshold Gate
571
+ ```
572
+ f(x₁,...,xₙ) = H(Σᵢ wᵢxᵢ + b)
573
+ ```
574
+
575
+ ### Full Adder
576
+ ```
577
+ sum = a ⊕ b ⊕ cᵢₙ
578
+ cₒᵤₜ = (a ∧ b) ∨ (cᵢₙ ∧ (a ⊕ b))
579
+ ```
580
+
581
+ ### STE Gradient
582
+ ```
583
+ Forward: y = H(x)
584
+ Backward: ∂L/∂x = ∂L/∂y
585
+ ```
586
+
587
+ ### Router Combination
588
+ ```
589
+ output = mlp_out + softmax(router(x))[1] × circuit_delta
590
+ ```
591
+
592
+ ---
593
+
594
+ ## 14. References
595
+
596
+ 1. McCulloch & Pitts (1943). "A Logical Calculus of Ideas Immanent in Nervous Activity"
597
+ 2. Muroga (1971). "Threshold Logic and Its Applications"
598
+ 3. Siegelmann & Sontag (1995). "On the Computational Power of Neural Nets"
599
+ 4. Bengio et al. (2013). "Estimating or Propagating Gradients Through Stochastic Neurons"
600
+ 5. Ma et al. (2024). "The Era of 1-bit LLMs" (BitNet b1.58)
601
+ 6. HuggingFace (2024). "SmolLM2: Small Language Models"
602
+
603
+ ---
604
+
605
+ ## 15. Summary
606
+
607
+ We embed a proven-correct 8-bit threshold logic computer into SmolLM2's MLP layers. The circuits are frozen; we train only the interface layers that learn call dispatch. This gives the LLM exact arithmetic capability without training it to "do math"—the math is already done.
608
+
609
+ The approach is:
610
+ - **Sound**: Circuits verified with 6,590 tests
611
+ - **Efficient**: 1.37M trainable params, 5.6K circuit params
612
+ - **Exact**: Heaviside at inference means no approximation error
613
+ - **Composable**: Add more circuits (multiply, compare, etc.) with same pattern
614
+
615
+ The model learns when to call the calculator, not how to calculate.