Add 32-bit ALU support with 1KB memory profile

- build.py: Add --bits {8,16,32} flag for N-bit circuit generation
- build.py: Add 'small' memory profile (1KB, 10-bit addresses)
- build.py: Add 32-bit generators for adder, subtractor, comparators,
multiplier, divider, bitwise ops, shifts, inc/dec, neg
- eval.py: Add 32-bit test data and comparator testing
- README.md: Document 32-bit support, pivot to from-scratch extractor

32-bit adder verified: 1M + 2M = 3M, 0xDEAD0000 + 0xBEEF = 0xDEADBEEF

TODO:
- Add missing 32-bit eval tests (sub, mul, div, bitwise, shifts)
- Fix 32-bit comparator precision (float32 mantissa overflow on 2^31 weights)
Planned fix: cascaded byte-wise comparison

Files changed (4) hide show

README.md +109 -76
build.py +266 -9
eval.py +140 -0
neural_alu32.safetensors +3 -0

README.md CHANGED Viewed

@@ -12,30 +12,31 @@ tags:
 # 8bit-threshold-computer
-**A Turing-complete 8-bit CPU implemented entirely as threshold logic gates.**
 Every logic gate is a threshold neuron: `output = 1 if (Σ wᵢxᵢ + b) ≥ 0 else 0`
 ```
-Tensors:    11,581
-Parameters: 8,290,134 (full CPU) / 32,397 (pure ALU for LLM)
 ```
 ---
 ## What Is This?
-A complete 8-bit processor where every operation—from Boolean logic to arithmetic to control flow—is implemented using only weighted sums and step functions. No traditional gates.
-| Component | Specification |
-|-----------|---------------|
-| Registers | 4 × 8-bit general purpose |
-| Memory | Configurable: 0B (pure ALU) to 64KB (full CPU) |
-| ALU | 16 operations (ADD, SUB, AND, OR, XOR, NOT, SHL, SHR, MUL, DIV, INC, DEC, NEG, ROL, ROR, CMP) |
-| Flags | Zero, Negative, Carry, Overflow |
-| Control | JMP, JZ, JNZ, JC, JNC, JN, JP, JV, JNV, CALL, RET, PUSH, POP |
-**Turing complete.** Verified with loops, conditionals, recursion, and self-modification.
 ---
@@ -583,27 +584,29 @@ Head-to-head on 50 random cases: SmolLM2 got 7/50 (14%), circuits got 50/50 (100
 **Stage 3: LLM Integration — IN PROGRESS**
-The actual challenge: train an interface that extracts operands and operations from LLM hidden states (not from pre-formatted bit inputs).
 ```
 "47 + 86"
     ↓
-[SmolLM2 hidden states: (seq_len, 960)]
     ↓
-Extractor (must LEARN: hidden → a_bits, b_bits, op_logits)
     ↓
 [Frozen threshold circuits]
     ↓
 [Result bits] → 133
 ```
-**Training Infrastructure** (`train.py`):
 | Mode | Description | Status |
 |------|-------------|--------|
 | `--mode router` | Train OpRouter with ground truth bits | 100% achieved |
 | `--mode interface` | Train BitEncoder + OpRouter | Ready |
-| `--mode llm` | Train from LLM hidden states | Active development |
 **LLM Mode Options**:
 - `--unfreeze_layers N`: Fine-tune top N transformer layers
@@ -615,95 +618,125 @@ Extractor (must LEARN: hidden → a_bits, b_bits, op_logits)
 - `Extractor`: Attention pooling + per-bit MLPs
 - `PositionExtractor`: Position-aware (operand A from positions 0-2, B from 5-7)
 - `DigitExtractor`: Predicts 3 digits per operand, converts to bits
 **Curriculum Learning**: Training progresses 0-9 → 0-99 → 0-255 over epochs.
 #### Proof of Concept Scope
-- **8-bit operands** (0-255)
 - **Six operations**: ADD, SUB, MUL, GT, LT, EQ
-- **Pure ALU profile** (no memory access)
 **Current Status**:
-- Circuit validation: Complete (100% on all operations)
-- LLM baseline: Measured (11.90%)
-- SmolLM2 architecture analysis: Complete (see `SMOLLM2_ARCHITECTURE.md`)
-- Extraction training: In progress
 ### Extension Roadmap
-The following extensions are planned after proof-of-concept validation:
-1. **16-bit operations (0-65535)** — Chain two 8-bit circuits with carry propagation. ADD16: low = ADD8(A_lo, B_lo), high = ADD8(A_hi, B_hi, carry_out). MUL16: four partial products + shift-add. Doubles operand extraction width. This extension is a priority as it dramatically expands the useful range of arithmetic operations.
-2. **Parenthetical expressions ((5 + 3) × 2 = 16)** — Explicit grouping overrides precedence. Parser must recognize parens and build correct tree. Evaluation proceeds innermost-out. Adds complexity to extraction layer.
-3. **Multi-operation chains (a + b - c × d)** — Sequential dispatch through multiple circuits with intermediate result routing. Requires state management in interface layers.
-4. **Floating point arithmetic** — IEEE 754-style with separate circuits for mantissa and exponent. ADD: align exponents, add mantissas, renormalize. MUL: add exponents, multiply mantissas. Requires sign handling, overflow detection, and rounding logic.
-5. **Full CPU integration** — Enable memory access circuits for stateful computation. Allows multi-step algorithms executed entirely within threshold logic.
-### Completed Extensions
-- **3-operand addition (15 + 27 + 33 = 75)** — `arithmetic.add3_8bit` chains two 8-bit ripple carry stages. 16 full adders, 144 gates, 240 test cases verified.
-- **Order of operations (5 + 3 × 2 = 11)** — `arithmetic.expr_add_mul` computes A + (B × C) using shift-add multiplication then addition. 64 AND gates + 64 full adders, 73 test cases verified.
----
-## Files
-### Core
-| File | Description |
-|------|-------------|
-| `neural_computer.safetensors` | Frozen threshold circuits (~8.29M params full, ~32K pure ALU) |
-| `eval.py` | Unified evaluation suite (GPU-batched, exhaustive testing) |
-| `build.py` | Circuit generator with configurable memory profiles |
-| `prune_weights.py` | Weight magnitude pruning (GPU-batched, binary search conflict resolution) |
-### LLM Integration (`llm_integration/`)
-| File | Description |
-|------|-------------|
-| `SMOLLM2_ARCHITECTURE.md` | Complete technical analysis of SmolLM2-360M (layers, weights, tokenization) |
-| `baseline.py` | SmolLM2-360M vanilla arithmetic evaluation (11.90% baseline) |
-| `circuits.py` | Frozen threshold circuit wrapper with STE gradients |
-| `fitness.py` | Shared fitness function (randomized arithmetic, no answer supervision) |
-| `model.py` | Interface layers: `BitEncoder`, `OpRouter`, `Extractor`, `ArithmeticModel` |
-| `train.py` | Unified training: `--mode router`, `--mode interface`, `--mode llm` |
-| `trained/router.pt` | Trained OpRouter checkpoint (100% with ground truth bits) |
-### Build Tool Usage
 ```bash
-# Full CPU (64KB memory, default)
-python build.py memory --apply
-# LLM integration profiles
-python build.py --memory-profile none memory --apply       # Pure ALU (32K params)
-python build.py --memory-profile registers memory --apply  # 16-byte register file
-python build.py --memory-profile scratchpad memory --apply # 256-byte scratchpad
-# Custom memory size
-python build.py --addr-bits 6 memory --apply  # 64 bytes (2^6)
-# Regenerate ALU and input metadata
-python build.py alu --apply
-python build.py inputs --apply
-python build.py all --apply  # memory + alu + inputs
 ```
-Memory profiles:
-| Profile | Addr Bits | Memory | Memory Params | Total Params |
-|---------|-----------|--------|---------------|--------------|
-| `none` | 0 | 0B | 0 | ~32K |
-| `registers` | 4 | 16B | ~2K | ~34K |
-| `scratchpad` | 8 | 256B | ~30K | ~63K |
-| `reduced` | 12 | 4KB | ~516K | ~549K |
-| `full` | 16 | 64KB | ~8.26M | ~8.29M |
 ---

 # 8bit-threshold-computer
+**A Turing-complete CPU implemented entirely as threshold logic gates, with 8-bit and 32-bit ALU support.**
 Every logic gate is a threshold neuron: `output = 1 if (Σ wᵢxᵢ + b) ≥ 0 else 0`
 ```
+8-bit CPU:   8,290,134 params (full) / 32,397 params (pure ALU)
+32-bit ALU:  202,869 params (1KB scratch memory)
 ```
 ---
 ## What Is This?
+A complete processor where every operation—from Boolean logic to arithmetic to control flow—is implemented using only weighted sums and step functions. No traditional gates.
+| Component | 8-bit CPU | 32-bit ALU |
+|-----------|-----------|------------|
+| Registers | 4 × 8-bit | N/A (pure computation) |
+| Memory | 0B–64KB configurable | 1KB scratch |
+| ALU | 16 ops @ 8-bit | ADD, SUB, MUL, DIV, CMP, bitwise, shifts |
+| Precision | 0–255 | 0–4,294,967,295 |
+| Flags | Z, N, C, V | Carry/overflow |
+| Control | Full ISA | Stateless |
+**Turing complete.** The 8-bit CPU is verified with loops, conditionals, recursion, and self-modification. The 32-bit ALU extends arithmetic to practical ranges (0–4B) where 8-bit (0–255) is insufficient.
 ---
 **Stage 3: LLM Integration — IN PROGRESS**
+The challenge: train an interface that extracts operands and operations from natural language (not from pre-formatted bit inputs).
 ```
 "47 + 86"
     ↓
+[Language Model / Extractor]
     ↓
+[a_bits, b_bits, op_logits]
     ↓
 [Frozen threshold circuits]
     ↓
 [Result bits] → 133
 ```
+**SmolLM2 Approach** (`llm_integration/`):
+Initial experiments used SmolLM2-360M-Instruct as the language understanding backbone.
 | Mode | Description | Status |
 |------|-------------|--------|
 | `--mode router` | Train OpRouter with ground truth bits | 100% achieved |
 | `--mode interface` | Train BitEncoder + OpRouter | Ready |
+| `--mode llm` | Train from LLM hidden states | Explored |
 **LLM Mode Options**:
 - `--unfreeze_layers N`: Fine-tune top N transformer layers
 - `Extractor`: Attention pooling + per-bit MLPs
 - `PositionExtractor`: Position-aware (operand A from positions 0-2, B from 5-7)
 - `DigitExtractor`: Predicts 3 digits per operand, converts to bits
+- `HybridExtractor`: Digit lookup + MLP fallback for word inputs
 **Curriculum Learning**: Training progresses 0-9 → 0-99 → 0-255 over epochs.
+**Observations**: SmolLM2 integration proved challenging—360M parameters of pre-trained representations largely irrelevant to arithmetic parsing, high VRAM requirements, and gradient conflicts between frozen circuits and pre-trained weights.
+**Pivot: From-Scratch Extractor**
+Given that the task is fundamentally simple—parse `(a, b, op)` from structured text—a lightweight purpose-built model may be more appropriate than adapting a general LLM.
+```
+"one thousand plus two thousand"
+    ↓
+[Char-level tokenizer: ~40 tokens]
+    ↓
+[Small transformer: ~1-5M params]
+    ↓
+[3 heads: a_value, b_value, op_idx]
+    ↓
+[Frozen 32-bit threshold circuits]
+    ↓
+3000
+```
+**Design principles**:
+- **Minimal Python**: All parsing logic learned in weights, not hardcoded
+- **Character-level input**: No word tokenization; model learns "forty seven" = 47
+- **From-scratch training**: No pre-trained weights to conflict with
+- **32-bit target**: Practical arithmetic range (0–4,294,967,295)
+**Planned architecture**:
+- Vocab: ~40 chars (a-z, 0-9, space, operators)
+- Embedding: 40 × 128d
+- Encoder: 2-3 transformer layers
+- Output heads: `a_classifier`, `b_classifier`, `op_classifier`
+- Total: ~1-5M params (vs 360M for SmolLM2)
+This approach treats the problem as what it is: a structured parsing task where the frozen circuits handle all computation. The extractor need only learn the mapping from text to operands—no world knowledge required.
 #### Proof of Concept Scope
+- **32-bit operands** (0–4,294,967,295)
 - **Six operations**: ADD, SUB, MUL, GT, LT, EQ
+- **Structured input**: Digits ("1000 + 2000") and number words ("one thousand plus two thousand")
 **Current Status**:
+- Circuit validation: Complete (100% on 8-bit operations)
+- 32-bit circuits: Built and tested (adder verified on 1M+2M=3M, etc.)
+- LLM baseline: Measured (11.90% - establishes control condition)
+- SmolLM2 integration: Infrastructure complete, training explored
+- From-scratch extractor: Design phase
 ### Extension Roadmap
+#### Completed
+1. **32-bit operations (0–4,294,967,295)** — Full 32-bit ALU implemented via `--bits 32` flag:
+   - 32-bit ripple carry adder (32 chained full adders) — **verified**
+   - 32-bit subtractor (NOT + adder with carry-in)
+   - 32-bit multiplication (1024 partial product ANDs)
+   - 32-bit division (32 restoring stages)
+   - 32-bit comparators (GT, LT, GE, LE, EQ)
+   - 32-bit bitwise ops (AND, OR, XOR, NOT)
+   - 32-bit shifts (SHL, SHR), INC, DEC, NEG
+   **Known issue**: Single-layer 32-bit comparators use weights up to 2³¹, which exceeds float32 mantissa precision (24 bits). Comparisons between large numbers differing only in low bits may fail. Fix planned: cascaded byte-wise comparison (compare MSB first, if equal compare next byte, etc.).
+2. **3-operand addition (15 + 27 + 33 = 75)** — `arithmetic.add3_8bit` chains two 8-bit ripple carry stages. 16 full adders, 144 gates, 240 test cases verified.
+3. **Order of operations (5 + 3 × 2 = 11)** — `arithmetic.expr_add_mul` computes A + (B × C) using shift-add multiplication then addition. 64 AND gates + 64 full adders, 73 test cases verified.
+#### Planned
+1. **Cascaded 32-bit comparators** — Replace single-layer weighted comparison with multi-layer byte-wise cascade. Each byte comparison uses 8-bit weights (max 128), well within float32 precision. Hardware-accurate and extensible to 64-bit, 128-bit, etc.
+2. **Parenthetical expressions ((5 + 3) × 2 = 16)** — Explicit grouping overrides precedence. Parser must recognize parens and build correct tree. Evaluation proceeds innermost-out.
+3. **Multi-operation chains (a + b - c × d)** — Sequential dispatch through multiple circuits with intermediate result routing. Requires state management in interface layers.
+4. **Floating point arithmetic** — IEEE 754-style with separate circuits for mantissa and exponent. ADD: align exponents, add mantissas, renormalize. MUL: add exponents, multiply mantissas.
+5. **Full CPU integration** — Enable memory access circuits for stateful computation. Allows multi-step algorithms executed entirely within threshold logic.
+---
+## Build Tool
 ```bash
+# 8-bit CPU (default)
+python build.py --apply all                                # Full 64KB memory
+python build.py -m none --apply all                        # Pure ALU (32K params)
+python build.py -m scratchpad --apply all                  # 256-byte scratch
+# 32-bit ALU
+python build.py --bits 32 -m small --apply all             # 1KB scratch (~203K params)
+python build.py --bits 32 -m none --apply all              # Pure 32-bit ALU
+# Custom configurations
+python build.py --bits 16 --addr-bits 6 --apply all        # 16-bit ALU, 64 bytes memory
 ```
+**Bit widths** (`--bits`):
+| Width | Range | Use Case |
+|-------|-------|----------|
+| 8 | 0–255 | Full CPU, legacy |
+| 16 | 0–65,535 | Extended arithmetic |
+| 32 | 0–4,294,967,295 | Practical arithmetic |
+**Memory profiles** (`-m`):
+| Profile | Size | Params | Use Case |
+|---------|------|--------|----------|
+| `none` | 0B | ~32K | Pure ALU |
+| `registers` | 16B | ~34K | Minimal state |
+| `scratchpad` | 256B | ~63K | 8-bit scratch |
+| `small` | 1KB | ~123K | 32-bit scratch |
+| `reduced` | 4KB | ~549K | Small programs |
+| `full` | 64KB | ~8.29M | Full CPU |
 ---

build.py CHANGED Viewed

@@ -121,11 +121,14 @@ DEFAULT_MEM_BYTES = 1 << DEFAULT_ADDR_BITS
 MEMORY_PROFILES = {
     "full": 16,      # 64KB - full CPU mode
     "reduced": 12,   # 4KB - reduced CPU
     "scratchpad": 8, # 256 bytes - LLM scratchpad
     "registers": 4,  # 16 bytes - LLM register file
     "none": 0,       # Pure ALU, no memory
 }
 def load_tensors(path: Path) -> Dict[str, torch.Tensor]:
     tensors: Dict[str, torch.Tensor] = {}
@@ -674,6 +677,163 @@ def add_comparators(tensors: Dict[str, torch.Tensor]) -> None:
     add_gate(tensors, "arithmetic.equality8bit.layer2", [1.0, 1.0], [-2.0])
 def update_manifest(tensors: Dict[str, torch.Tensor], addr_bits: int, mem_bytes: int) -> None:
     tensors["manifest.memory_bytes"] = torch.tensor([float(mem_bytes)], dtype=torch.float32)
     tensors["manifest.pc_width"] = torch.tensor([float(addr_bits)], dtype=torch.float32)
@@ -1863,14 +2023,15 @@ def cmd_inputs(args) -> None:
 def cmd_alu(args) -> None:
     print("=" * 60)
-    print(" BUILD ALU CIRCUITS")
     print("=" * 60)
     print(f"\nLoading: {args.model}")
     tensors = load_tensors(args.model)
     print(f"  Loaded {len(tensors)} tensors")
-    print("\nDropping existing ALU extension tensors...")
-    drop_prefixes(tensors, [
         "alu.alu8bit.shl.", "alu.alu8bit.shr.",
         "alu.alu8bit.mul.", "alu.alu8bit.div.",
         "alu.alu8bit.inc.", "alu.alu8bit.dec.",
@@ -1880,7 +2041,18 @@ def cmd_alu(args) -> None:
         "arithmetic.equality8bit.", "arithmetic.add3_8bit.", "arithmetic.expr_add_mul.", "arithmetic.expr_paren.",
         "control.push.", "control.pop.", "control.ret.",
         "combinational.barrelshifter.", "combinational.priorityencoder.",
-    ])
     print(f"  Now {len(tensors)} tensors")
     print("\nGenerating SHL/SHR circuits...")
     try:
@@ -1960,13 +2132,85 @@ def cmd_alu(args) -> None:
         print("  Added EXPR_PAREN (8 + 64 AND + 56 full adders = 640 gates)")
     except ValueError as e:
         print(f"  EXPR_PAREN already exists: {e}")
     if args.apply:
         print(f"\nSaving: {args.model}")
         save_file(tensors, str(args.model))
         print("  Done.")
     else:
         print("\n[DRY-RUN] Use --apply to save.")
     print(f"\nTotal: {len(tensors)} tensors")
     print("=" * 60)
@@ -1987,26 +2231,39 @@ def main() -> None:
 Memory Profiles:
   full        64KB (16-bit addr) - Full CPU mode
   reduced     4KB  (12-bit addr) - Reduced CPU
   scratchpad  256B (8-bit addr)  - LLM scratchpad
   registers   16B  (4-bit addr)  - LLM register file
   none        0B   (no memory)   - Pure ALU for LLM
 Examples:
   python build.py memory --memory-profile none --apply    # LLM-only (no RAM)
-  python build.py memory --memory-profile scratchpad     # 256-byte scratchpad
-  python build.py memory --addr-bits 6                   # Custom: 64 bytes
-  python build.py memory                                 # Default: 64KB
 """
     )
     parser.add_argument("--model", type=Path, default=MODEL_PATH, help="Model path")
     parser.add_argument("--apply", action="store_true", help="Apply changes (default: dry-run)")
     parser.add_argument("--manifest", action="store_true", help="Write tensors.txt manifest (memory only)")
     mem_group = parser.add_mutually_exclusive_group()
     mem_group.add_argument(
         "--memory-profile", "-m",
         choices=list(MEMORY_PROFILES.keys()),
-        help="Memory size profile (full/reduced/scratchpad/registers/none)"
     )
     mem_group.add_argument(
         "--addr-bits", "-a",
@@ -2018,7 +2275,7 @@ Examples:
     subparsers = parser.add_subparsers(dest="command", help="Subcommands")
     subparsers.add_parser("memory", help="Generate memory circuits (size controlled by --memory-profile or --addr-bits)")
-    subparsers.add_parser("alu", help="Generate ALU extension circuits (SHL, SHR, comparators)")
     subparsers.add_parser("inputs", help="Add .inputs metadata tensors")
     subparsers.add_parser("all", help="Run memory, alu, then inputs")

 MEMORY_PROFILES = {
     "full": 16,      # 64KB - full CPU mode
     "reduced": 12,   # 4KB - reduced CPU
+    "small": 10,     # 1KB - 32-bit arithmetic scratch
     "scratchpad": 8, # 256 bytes - LLM scratchpad
     "registers": 4,  # 16 bytes - LLM register file
     "none": 0,       # Pure ALU, no memory
 }
+SUPPORTED_BITS = [8, 16, 32]
 def load_tensors(path: Path) -> Dict[str, torch.Tensor]:
     tensors: Dict[str, torch.Tensor] = {}
     add_gate(tensors, "arithmetic.equality8bit.layer2", [1.0, 1.0], [-2.0])
+def add_ripple_carry_nbits(tensors: Dict[str, torch.Tensor], bits: int) -> None:
+    """Add N-bit ripple carry adder circuit.
+    Creates a chain of full adders for N-bit addition.
+    Works for 8, 16, or 32 bits.
+    Inputs: $a[0..N-1], $b[0..N-1] (MSB-first)
+    Outputs: fa0-fa{N-1} sum bits, fa{N-1}.carry_or for overflow
+    """
+    prefix = f"arithmetic.ripplecarry{bits}bit"
+    for bit in range(bits):
+        add_full_adder(tensors, f"{prefix}.fa{bit}")
+def add_sub_nbits(tensors: Dict[str, torch.Tensor], bits: int) -> None:
+    """Add N-bit subtractor circuit (A - B).
+    Uses two's complement: A - B = A + (~B) + 1
+    Structure:
+    - NOT gates for each bit of B
+    - N-bit ripple carry adder with carry_in = 1
+    The carry_in=1 is handled by the adder's fa0 having cin=#1 instead of #0.
+    """
+    prefix = f"arithmetic.sub{bits}bit"
+    for bit in range(bits):
+        add_gate(tensors, f"{prefix}.not_b.bit{bit}", [-1.0], [0.0])
+    for bit in range(bits):
+        add_full_adder(tensors, f"{prefix}.fa{bit}")
+def add_comparators_nbits(tensors: Dict[str, torch.Tensor], bits: int) -> None:
+    """Add N-bit comparator circuits (GT, LT, GE, LE, EQ).
+    Uses weighted sum comparison extended to N bits.
+    For N=32: weights are 2^31, 2^30, ..., 2^0 for A, negated for B.
+    """
+    pos_weights = [float(1 << (bits - 1 - i)) for i in range(bits)]
+    neg_weights = [-w for w in pos_weights]
+    gt_weights = pos_weights + neg_weights
+    lt_weights = neg_weights + pos_weights
+    add_gate(tensors, f"arithmetic.greaterthan{bits}bit", gt_weights, [-1.0])
+    add_gate(tensors, f"arithmetic.greaterorequal{bits}bit", gt_weights, [0.0])
+    add_gate(tensors, f"arithmetic.lessthan{bits}bit", lt_weights, [-1.0])
+    add_gate(tensors, f"arithmetic.lessorequal{bits}bit", lt_weights, [0.0])
+    add_gate(tensors, f"arithmetic.equality{bits}bit.layer1.geq", gt_weights, [0.0])
+    add_gate(tensors, f"arithmetic.equality{bits}bit.layer1.leq", lt_weights, [0.0])
+    add_gate(tensors, f"arithmetic.equality{bits}bit.layer2", [1.0, 1.0], [-2.0])
+def add_mul_nbits(tensors: Dict[str, torch.Tensor], bits: int) -> None:
+    """Add N-bit multiplication circuit.
+    Produces low N bits of the 2N-bit result.
+    Structure:
+    - N*N AND gates for partial products P[i][j] = A[i] AND B[j]
+    - Shift-add accumulation using existing adder circuits
+    For 32-bit: 1024 AND gates for partial products.
+    """
+    for i in range(bits):
+        for j in range(bits):
+            add_gate(tensors, f"alu.alu{bits}bit.mul.pp.a{i}b{j}", [1.0, 1.0], [-2.0])
+def add_div_nbits(tensors: Dict[str, torch.Tensor], bits: int) -> None:
+    """Add N-bit division circuit.
+    Uses restoring division algorithm with N iterations.
+    """
+    pos_weights = [float(1 << (bits - 1 - i)) for i in range(bits)]
+    neg_weights = [-w for w in pos_weights]
+    cmp_weights = pos_weights + neg_weights
+    for stage in range(bits):
+        add_gate(tensors, f"alu.alu{bits}bit.div.stage{stage}.cmp", cmp_weights, [0.0])
+    for stage in range(bits):
+        for bit in range(bits):
+            add_gate(tensors, f"alu.alu{bits}bit.div.stage{stage}.mux.bit{bit}.not_sel", [-1.0], [0.0])
+            add_gate(tensors, f"alu.alu{bits}bit.div.stage{stage}.mux.bit{bit}.and_a", [1.0, 1.0], [-2.0])
+            add_gate(tensors, f"alu.alu{bits}bit.div.stage{stage}.mux.bit{bit}.and_b", [1.0, 1.0], [-2.0])
+            add_gate(tensors, f"alu.alu{bits}bit.div.stage{stage}.mux.bit{bit}.or", [1.0, 1.0], [-1.0])
+def add_bitwise_nbits(tensors: Dict[str, torch.Tensor], bits: int) -> None:
+    """Add N-bit bitwise operation circuits (AND, OR, XOR, NOT).
+    These are simply N copies of the 1-bit gates.
+    """
+    for bit in range(bits):
+        add_gate(tensors, f"alu.alu{bits}bit.and.bit{bit}", [1.0, 1.0], [-2.0])
+    for bit in range(bits):
+        add_gate(tensors, f"alu.alu{bits}bit.or.bit{bit}", [1.0, 1.0], [-1.0])
+    for bit in range(bits):
+        add_gate(tensors, f"alu.alu{bits}bit.xor.bit{bit}.layer1.or", [1.0, 1.0], [-1.0])
+        add_gate(tensors, f"alu.alu{bits}bit.xor.bit{bit}.layer1.nand", [-1.0, -1.0], [1.0])
+        add_gate(tensors, f"alu.alu{bits}bit.xor.bit{bit}.layer2", [1.0, 1.0], [-2.0])
+    for bit in range(bits):
+        add_gate(tensors, f"alu.alu{bits}bit.not.bit{bit}", [-1.0], [0.0])
+def add_shift_nbits(tensors: Dict[str, torch.Tensor], bits: int) -> None:
+    """Add N-bit shift circuits (SHL, SHR by 1 position).
+    SHL: out[i] = in[i+1] for i<N-1, out[N-1] = 0
+    SHR: out[0] = 0, out[i] = in[i-1] for i>0
+    """
+    for bit in range(bits):
+        if bit < bits - 1:
+            add_gate(tensors, f"alu.alu{bits}bit.shl.bit{bit}", [2.0], [-1.0])
+        else:
+            add_gate(tensors, f"alu.alu{bits}bit.shl.bit{bit}", [0.0], [-1.0])
+    for bit in range(bits):
+        if bit > 0:
+            add_gate(tensors, f"alu.alu{bits}bit.shr.bit{bit}", [2.0], [-1.0])
+        else:
+            add_gate(tensors, f"alu.alu{bits}bit.shr.bit{bit}", [0.0], [-1.0])
+def add_inc_dec_nbits(tensors: Dict[str, torch.Tensor], bits: int) -> None:
+    """Add N-bit INC and DEC circuits."""
+    for bit in range(bits):
+        add_gate(tensors, f"alu.alu{bits}bit.inc.bit{bit}.xor.layer1.or", [1.0, 1.0], [-1.0])
+        add_gate(tensors, f"alu.alu{bits}bit.inc.bit{bit}.xor.layer1.nand", [-1.0, -1.0], [1.0])
+        add_gate(tensors, f"alu.alu{bits}bit.inc.bit{bit}.xor.layer2", [1.0, 1.0], [-2.0])
+        add_gate(tensors, f"alu.alu{bits}bit.inc.bit{bit}.carry", [1.0, 1.0], [-2.0])
+    for bit in range(bits):
+        add_gate(tensors, f"alu.alu{bits}bit.dec.bit{bit}.xor.layer1.or", [1.0, 1.0], [-1.0])
+        add_gate(tensors, f"alu.alu{bits}bit.dec.bit{bit}.xor.layer1.nand", [-1.0, -1.0], [1.0])
+        add_gate(tensors, f"alu.alu{bits}bit.dec.bit{bit}.xor.layer2", [1.0, 1.0], [-2.0])
+        add_gate(tensors, f"alu.alu{bits}bit.dec.bit{bit}.not_a", [-1.0], [0.0])
+        add_gate(tensors, f"alu.alu{bits}bit.dec.bit{bit}.borrow", [1.0, 1.0], [-2.0])
+def add_neg_nbits(tensors: Dict[str, torch.Tensor], bits: int) -> None:
+    """Add N-bit NEG circuit (two's complement negation)."""
+    for bit in range(bits):
+        add_gate(tensors, f"alu.alu{bits}bit.neg.not.bit{bit}", [-1.0], [0.0])
+        add_gate(tensors, f"alu.alu{bits}bit.neg.inc.bit{bit}.xor.layer1.or", [1.0, 1.0], [-1.0])
+        add_gate(tensors, f"alu.alu{bits}bit.neg.inc.bit{bit}.xor.layer1.nand", [-1.0, -1.0], [1.0])
+        add_gate(tensors, f"alu.alu{bits}bit.neg.inc.bit{bit}.xor.layer2", [1.0, 1.0], [-2.0])
+        add_gate(tensors, f"alu.alu{bits}bit.neg.inc.bit{bit}.carry", [1.0, 1.0], [-2.0])
 def update_manifest(tensors: Dict[str, torch.Tensor], addr_bits: int, mem_bytes: int) -> None:
     tensors["manifest.memory_bytes"] = torch.tensor([float(mem_bytes)], dtype=torch.float32)
     tensors["manifest.pc_width"] = torch.tensor([float(addr_bits)], dtype=torch.float32)
 def cmd_alu(args) -> None:
+    bits = getattr(args, 'bits', 8) or 8
     print("=" * 60)
+    print(f" BUILD ALU CIRCUITS ({bits}-bit)")
     print("=" * 60)
     print(f"\nLoading: {args.model}")
     tensors = load_tensors(args.model)
     print(f"  Loaded {len(tensors)} tensors")
+    drop_list = [
         "alu.alu8bit.shl.", "alu.alu8bit.shr.",
         "alu.alu8bit.mul.", "alu.alu8bit.div.",
         "alu.alu8bit.inc.", "alu.alu8bit.dec.",
         "arithmetic.equality8bit.", "arithmetic.add3_8bit.", "arithmetic.expr_add_mul.", "arithmetic.expr_paren.",
         "control.push.", "control.pop.", "control.ret.",
         "combinational.barrelshifter.", "combinational.priorityencoder.",
+    ]
+    if bits in [16, 32]:
+        drop_list.extend([
+            f"alu.alu{bits}bit.", f"arithmetic.ripplecarry{bits}bit.",
+            f"arithmetic.sub{bits}bit.", f"arithmetic.greaterthan{bits}bit.",
+            f"arithmetic.lessthan{bits}bit.", f"arithmetic.greaterorequal{bits}bit.",
+            f"arithmetic.lessorequal{bits}bit.", f"arithmetic.equality{bits}bit.",
+        ])
+    print("\nDropping existing ALU extension tensors...")
+    drop_prefixes(tensors, drop_list)
     print(f"  Now {len(tensors)} tensors")
     print("\nGenerating SHL/SHR circuits...")
     try:
         print("  Added EXPR_PAREN (8 + 64 AND + 56 full adders = 640 gates)")
     except ValueError as e:
         print(f"  EXPR_PAREN already exists: {e}")
+    if bits in [16, 32]:
+        print(f"\n{'=' * 60}")
+        print(f" GENERATING {bits}-BIT CIRCUITS")
+        print(f"{'=' * 60}")
+        print(f"\nGenerating {bits}-bit ripple carry adder...")
+        try:
+            add_ripple_carry_nbits(tensors, bits)
+            print(f"  Added {bits}-bit adder ({bits} full adders = {bits * 9} gates)")
+        except ValueError as e:
+            print(f"  {bits}-bit adder already exists: {e}")
+        print(f"\nGenerating {bits}-bit subtractor...")
+        try:
+            add_sub_nbits(tensors, bits)
+            print(f"  Added {bits}-bit subtractor ({bits} NOT + {bits} full adders)")
+        except ValueError as e:
+            print(f"  {bits}-bit subtractor already exists: {e}")
+        print(f"\nGenerating {bits}-bit comparators...")
+        try:
+            add_comparators_nbits(tensors, bits)
+            print(f"  Added {bits}-bit GT, GE, LT, LE, EQ")
+        except ValueError as e:
+            print(f"  {bits}-bit comparators already exist: {e}")
+        print(f"\nGenerating {bits}-bit multiplication...")
+        try:
+            add_mul_nbits(tensors, bits)
+            print(f"  Added {bits}-bit MUL ({bits * bits} partial product AND gates)")
+        except ValueError as e:
+            print(f"  {bits}-bit MUL already exists: {e}")
+        print(f"\nGenerating {bits}-bit division...")
+        try:
+            add_div_nbits(tensors, bits)
+            print(f"  Added {bits}-bit DIV ({bits} stages)")
+        except ValueError as e:
+            print(f"  {bits}-bit DIV already exists: {e}")
+        print(f"\nGenerating {bits}-bit bitwise ops (AND, OR, XOR, NOT)...")
+        try:
+            add_bitwise_nbits(tensors, bits)
+            print(f"  Added {bits}-bit AND, OR, XOR, NOT")
+        except ValueError as e:
+            print(f"  {bits}-bit bitwise ops already exist: {e}")
+        print(f"\nGenerating {bits}-bit shift ops (SHL, SHR)...")
+        try:
+            add_shift_nbits(tensors, bits)
+            print(f"  Added {bits}-bit SHL, SHR")
+        except ValueError as e:
+            print(f"  {bits}-bit shift ops already exist: {e}")
+        print(f"\nGenerating {bits}-bit INC/DEC...")
+        try:
+            add_inc_dec_nbits(tensors, bits)
+            print(f"  Added {bits}-bit INC, DEC")
+        except ValueError as e:
+            print(f"  {bits}-bit INC/DEC already exist: {e}")
+        print(f"\nGenerating {bits}-bit NEG...")
+        try:
+            add_neg_nbits(tensors, bits)
+            print(f"  Added {bits}-bit NEG")
+        except ValueError as e:
+            print(f"  {bits}-bit NEG already exists: {e}")
     if args.apply:
         print(f"\nSaving: {args.model}")
         save_file(tensors, str(args.model))
         print("  Done.")
     else:
         print("\n[DRY-RUN] Use --apply to save.")
     print(f"\nTotal: {len(tensors)} tensors")
+    total_params = sum(t.numel() for t in tensors.values())
+    print(f"Total params: {total_params:,}")
     print("=" * 60)
 Memory Profiles:
   full        64KB (16-bit addr) - Full CPU mode
   reduced     4KB  (12-bit addr) - Reduced CPU
+  small       1KB  (10-bit addr) - 32-bit arithmetic scratch
   scratchpad  256B (8-bit addr)  - LLM scratchpad
   registers   16B  (4-bit addr)  - LLM register file
   none        0B   (no memory)   - Pure ALU for LLM
+ALU Bit Widths:
+  8           Standard 8-bit ALU (default)
+  16          16-bit ALU (0-65535)
+  32          32-bit ALU (0-4294967295)
 Examples:
   python build.py memory --memory-profile none --apply    # LLM-only (no RAM)
+  python build.py memory --memory-profile small --apply   # 1KB for 32-bit scratch
+  python build.py alu --bits 32 --apply                   # 32-bit ALU circuits
+  python build.py all --bits 32 -m small --apply          # Full 32-bit build
 """
     )
     parser.add_argument("--model", type=Path, default=MODEL_PATH, help="Model path")
     parser.add_argument("--apply", action="store_true", help="Apply changes (default: dry-run)")
     parser.add_argument("--manifest", action="store_true", help="Write tensors.txt manifest (memory only)")
+    parser.add_argument(
+        "--bits", "-b",
+        type=int,
+        choices=SUPPORTED_BITS,
+        default=8,
+        help="ALU bit width: 8 (default), 16, or 32"
+    )
     mem_group = parser.add_mutually_exclusive_group()
     mem_group.add_argument(
         "--memory-profile", "-m",
         choices=list(MEMORY_PROFILES.keys()),
+        help="Memory size profile (full/reduced/small/scratchpad/registers/none)"
     )
     mem_group.add_argument(
         "--addr-bits", "-a",
     subparsers = parser.add_subparsers(dest="command", help="Subcommands")
     subparsers.add_parser("memory", help="Generate memory circuits (size controlled by --memory-profile or --addr-bits)")
+    subparsers.add_parser("alu", help="Generate ALU extension circuits (use --bits for 16/32-bit)")
     subparsers.add_parser("inputs", help="Add .inputs metadata tensors")
     subparsers.add_parser("all", help="Run memory, alu, then inputs")

eval.py CHANGED Viewed

@@ -968,6 +968,27 @@ class BatchedFitnessEvaluator:
         # Modular test range
         self.mod_test = torch.arange(256, device=d, dtype=torch.long)
     def _record(self, name: str, passed: int, total: int, failures: List[Tuple] = None):
         """Record a circuit test result."""
         self.results.append(CircuitResult(
@@ -1705,6 +1726,107 @@ class BatchedFitnessEvaluator:
         return scores, total
     # =========================================================================
     # THRESHOLD GATES
     # =========================================================================
@@ -3399,6 +3521,24 @@ class BatchedFitnessEvaluator:
             total_tests += t
             self.category_scores[f'ripplecarry{bits}'] = (s[0].item() if pop_size == 1 else s.mean().item(), t)
         # 3-operand adder
         s, t = self._test_add3(population, debug)
         scores += s

         # Modular test range
         self.mod_test = torch.arange(256, device=d, dtype=torch.long)
+        # 32-bit test values (strategic sampling)
+        self.test_32bit = torch.tensor([
+            0, 1, 2, 255, 256, 65535, 65536,
+            0x7FFFFFFF, 0x80000000, 0xFFFFFFFF,
+            0x12345678, 0xDEADBEEF, 0xCAFEBABE,
+            1000000, 1000000000, 2147483647,
+            0x55555555, 0xAAAAAAAA, 0x0F0F0F0F, 0xF0F0F0F0
+        ], device=d, dtype=torch.long)
+        # 32-bit comparator test pairs
+        comp32_tests = [
+            (0, 0), (1, 0), (0, 1), (1000, 999), (999, 1000),
+            (0xFFFFFFFF, 0), (0, 0xFFFFFFFF),
+            (0x80000000, 0x7FFFFFFF), (0x7FFFFFFF, 0x80000000),
+            (1000000, 1000000), (0x12345678, 0x12345678),
+            (0xDEADBEEF, 0xCAFEBABE), (0xCAFEBABE, 0xDEADBEEF),
+            (256, 255), (255, 256), (65536, 65535), (65535, 65536),
+        ]
+        self.comp32_a = torch.tensor([c[0] for c in comp32_tests], device=d, dtype=torch.long)
+        self.comp32_b = torch.tensor([c[1] for c in comp32_tests], device=d, dtype=torch.long)
     def _record(self, name: str, passed: int, total: int, failures: List[Tuple] = None):
         """Record a circuit test result."""
         self.results.append(CircuitResult(
         return scores, total
+    def _test_comparators_nbits(self, pop: Dict, bits: int, debug: bool) -> Tuple[torch.Tensor, int]:
+        """Test N-bit comparator circuits (GT, LT, GE, LE, EQ)."""
+        pop_size = next(iter(pop.values())).shape[0]
+        scores = torch.zeros(pop_size, device=self.device)
+        total = 0
+        if debug:
+            print(f"\n=== {bits}-BIT COMPARATORS ===")
+        if bits == 32:
+            comp_a = self.comp32_a
+            comp_b = self.comp32_b
+        elif bits == 16:
+            comp_a = self.comp_a.clamp(0, 65535)
+            comp_b = self.comp_b.clamp(0, 65535)
+        else:
+            comp_a = self.comp_a
+            comp_b = self.comp_b
+        a_bits = torch.stack([((comp_a >> (bits - 1 - i)) & 1).float() for i in range(bits)], dim=1)
+        b_bits = torch.stack([((comp_b >> (bits - 1 - i)) & 1).float() for i in range(bits)], dim=1)
+        inputs = torch.cat([a_bits, b_bits], dim=1)
+        comparators = [
+            (f'arithmetic.greaterthan{bits}bit', lambda a, b: a > b),
+            (f'arithmetic.greaterorequal{bits}bit', lambda a, b: a >= b),
+            (f'arithmetic.lessthan{bits}bit', lambda a, b: a < b),
+            (f'arithmetic.lessorequal{bits}bit', lambda a, b: a <= b),
+        ]
+        for name, op in comparators:
+            try:
+                expected = torch.tensor([1.0 if op(a.item(), b.item()) else 0.0
+                                        for a, b in zip(comp_a, comp_b)], device=self.device)
+                w = pop[f'{name}.weight']
+                b = pop[f'{name}.bias']
+                out = heaviside(inputs @ w.view(pop_size, -1).T + b.view(pop_size))
+                correct = (out == expected.unsqueeze(1)).float().sum(0)
+                failures = []
+                if pop_size == 1:
+                    for i in range(len(comp_a)):
+                        if out[i, 0].item() != expected[i].item():
+                            failures.append((
+                                [int(comp_a[i].item()), int(comp_b[i].item())],
+                                expected[i].item(),
+                                out[i, 0].item()
+                            ))
+                self._record(name, int(correct[0].item()), len(comp_a), failures)
+                if debug:
+                    r = self.results[-1]
+                    print(f"  {r.name}: {r.passed}/{r.total} {'PASS' if r.success else 'FAIL'}")
+                scores += correct
+                total += len(comp_a)
+            except KeyError:
+                pass
+        prefix = f'arithmetic.equality{bits}bit'
+        try:
+            expected = torch.tensor([1.0 if a.item() == b.item() else 0.0
+                                    for a, b in zip(comp_a, comp_b)], device=self.device)
+            w_geq = pop[f'{prefix}.layer1.geq.weight']
+            b_geq = pop[f'{prefix}.layer1.geq.bias']
+            w_leq = pop[f'{prefix}.layer1.leq.weight']
+            b_leq = pop[f'{prefix}.layer1.leq.bias']
+            h_geq = heaviside(inputs @ w_geq.view(pop_size, -1).T + b_geq.view(pop_size))
+            h_leq = heaviside(inputs @ w_leq.view(pop_size, -1).T + b_leq.view(pop_size))
+            hidden = torch.stack([h_geq, h_leq], dim=-1)
+            w2 = pop[f'{prefix}.layer2.weight']
+            b2 = pop[f'{prefix}.layer2.bias']
+            out = heaviside((hidden * w2.view(pop_size, 1, 2)).sum(-1) + b2.view(pop_size))
+            correct = (out == expected.unsqueeze(1)).float().sum(0)
+            failures = []
+            if pop_size == 1:
+                for i in range(len(comp_a)):
+                    if out[i, 0].item() != expected[i].item():
+                        failures.append((
+                            [int(comp_a[i].item()), int(comp_b[i].item())],
+                            expected[i].item(),
+                            out[i, 0].item()
+                        ))
+            self._record(prefix, int(correct[0].item()), len(comp_a), failures)
+            if debug:
+                r = self.results[-1]
+                print(f"  {r.name}: {r.passed}/{r.total} {'PASS' if r.success else 'FAIL'}")
+            scores += correct
+            total += len(comp_a)
+        except KeyError:
+            pass
+        return scores, total
     # =========================================================================
     # THRESHOLD GATES
     # =========================================================================
             total_tests += t
             self.category_scores[f'ripplecarry{bits}'] = (s[0].item() if pop_size == 1 else s.mean().item(), t)
+        # 16/32-bit circuits (if present)
+        for bits in [16, 32]:
+            if f'arithmetic.ripplecarry{bits}bit.fa0.ha1.sum.layer1.or.weight' in population:
+                if debug:
+                    print(f"\n{'=' * 60}")
+                    print(f" {bits}-BIT CIRCUITS")
+                    print(f"{'=' * 60}")
+                s, t = self._test_ripplecarry(population, bits, debug)
+                scores += s
+                total_tests += t
+                self.category_scores[f'ripplecarry{bits}'] = (s[0].item() if pop_size == 1 else s.mean().item(), t)
+                s, t = self._test_comparators_nbits(population, bits, debug)
+                scores += s
+                total_tests += t
+                self.category_scores[f'comparators{bits}'] = (s[0].item() if pop_size == 1 else s.mean().item(), t)
         # 3-operand adder
         s, t = self._test_add3(population, debug)
         scores += s

neural_alu32.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:788a277fbff9e44eb9006f5f76839ced42d90c1ff31513b36b34c9ee604e3d97
+size 4972488