Rebuild float16 LUT/pow + 16-bit arithmetic; fix neg16bit

- build.py: add float16 LUT match/output generation (sqrt/rsqrt/exp/ln/log2/sin/cos/tan/tanh) and pow via ln*mul->exp; add float16 half conversion helpers and LUT output builders; add 16-bit arithmetic builders (ripplecarry/adc/sbc/sub/cmp/equality/neg/asr/rol/ror/clz) plus comparator/constant vectors; add gate helpers for NOT/AND/OR/XOR/XNOR; extend input inference for new circuits and 16-bit variants; infer multiplier2x2 .andXY inputs; define neg16bit sum0 as NOT(not0).

- eval.py: add float16 LUT and pow tests with direct LUT index evaluation; add 16-bit arithmetic/comparator/shift tests; add topo/alias caching; update negation evaluation to handle sum0/carry0 arity; update orphan/selector tests to 16-bit.

- arithmetic.safetensors: regenerate tensors and .inputs registry with zero missing inputs (626,374 tensors / 208,788 gates).

- README: update circuit list (float16 LUT+pow, 16-bit integer), accuracy notes, counts, and TODO status.

Files changed (4) hide show

README.md +22 -18
arithmetic.safetensors +2 -2
build.py +613 -75
eval.py +601 -39

README.md CHANGED Viewed

@@ -22,26 +22,34 @@ Each gate is a threshold logic unit: `output = step(weights · inputs + bias)`.
 | File | Description |
 |------|-------------|
-| `arithmetic.safetensors` | 33,451 tensors encoding 11,147 gates |
-| `eval.py` | Test harness (208,637 tests) |
 | `build.py` | Builds tensors and infers gate connectivity |
 ## Circuits
 **Float16 (IEEE 754)**
 - `float16.add`, `float16.sub`, `float16.mul`, `float16.div`
 - `float16.neg`, `float16.abs`, `float16.cmp`
 - `float16.toint`, `float16.fromint`
 - `float16.pack`, `float16.unpack`, `float16.normalize`
 Handles NaN, Inf, zero, subnormals. Mantissa alignment via barrel shifter. Normalization via CLZ.
-**8-bit Integer**
-- Adders: half, full, ripple carry (2/4/8 bit), add-with-carry
-- Subtraction: sub8bit, sbc8bit, neg8bit
-- Comparison: cmp8bit, equality8bit
-- Shifts: asr8bit, rol8bit, ror8bit
-- CLZ: 8-bit and 16-bit
 **Modular Arithmetic**
 - mod2 through mod12 (divisibility testing)
@@ -141,24 +149,20 @@ This began as an attempt to build a complete threshold-logic CPU. The CPU is in
 - Float16 core (add/sub/mul/div)
 - Float16 utilities (pack/unpack/normalize/conversions)
 - Float16 IEEE-754 half compliance for add/sub/mul/div + toint/fromint (including subnormals)
-- 8-bit integer arithmetic
 - Boolean, threshold, modular, pattern recognition, combinational
 **Next:**
-- Float16 sqrt, rsqrt, pow
-- Float16 exp, ln, log2
-- Float16 trig (sin, cos, tan via CORDIC)
-- Float16 tanh (ML activation)
 **Cleanup:**
-- Rip out 8-bit integer circuits, replace with 16-bit
-- 8-bit was scaffolding for float16 development, not the product
 ## TODO (Unified)
-1. Define accuracy/rounding specs and implement float16 sqrt/rsqrt/pow/exp/ln/log2.
-2. Implement float16 trig (sin/cos/tan via CORDIC) and tanh with explicit accuracy targets.
-3. Replace 8-bit integer circuits with 16-bit and remove 8-bit scaffolding.
 ## License

 | File | Description |
 |------|-------------|
+| `arithmetic.safetensors` | 626,374 tensors encoding 208,788 gates |
+| `eval.py` | Test harness (211,581 tests) |
 | `build.py` | Builds tensors and infers gate connectivity |
 ## Circuits
 **Float16 (IEEE 754)**
 - `float16.add`, `float16.sub`, `float16.mul`, `float16.div`
+- `float16.sqrt`, `float16.rsqrt`, `float16.pow`
+- `float16.exp`, `float16.ln`, `float16.log2`
+- `float16.sin`, `float16.cos`, `float16.tan`, `float16.tanh`
 - `float16.neg`, `float16.abs`, `float16.cmp`
 - `float16.toint`, `float16.fromint`
 - `float16.pack`, `float16.unpack`, `float16.normalize`
 Handles NaN, Inf, zero, subnormals. Mantissa alignment via barrel shifter. Normalization via CLZ.
+Accuracy/rounding:
+- Unary transcendental ops are LUT-backed over all 65,536 float16 inputs.
+- Outputs match torch.float16 results (round-to-nearest-even); NaNs are canonicalized to 0x7E00.
+- `float16.pow` is defined as exp(b * ln(a)) with float16 rounding at each stage.
+**16-bit Integer**
+- Adders: half, full, ripple carry (2/4/16 bit), add-with-carry (adc16bit)
+- Subtraction: sub16bit, sbc16bit, neg16bit
+- Comparison: cmp16bit, equality16bit
+- Shifts: asr16bit, rol16bit, ror16bit
+- CLZ: 16-bit
 **Modular Arithmetic**
 - mod2 through mod12 (divisibility testing)
 - Float16 core (add/sub/mul/div)
 - Float16 utilities (pack/unpack/normalize/conversions)
 - Float16 IEEE-754 half compliance for add/sub/mul/div + toint/fromint (including subnormals)
+- Float16 unary LUTs (sqrt/rsqrt/exp/ln/log2/sin/cos/tan/tanh)
+- Float16 pow via exp(b * ln(a))
+- 16-bit integer arithmetic (add/sub/cmp/shifts/CLZ)
 - Boolean, threshold, modular, pattern recognition, combinational
 **Next:**
+- TBD
 **Cleanup:**
+- None (8-bit arithmetic scaffolding removed)
 ## TODO (Unified)
+None.
 ## License

arithmetic.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:418e84ff00ccfcbeb22920eaa70851d4cd6d6d6945f624fbd9354ab340c11bcb
-size 4281848

 version https://git-lfs.github.com/spec/v1
+oid sha256:7437f06026058f699dad09dd4b657ed6fe81d4845a2b0dc7213b2bbb048273c6
+size 247445516

build.py CHANGED Viewed

@@ -17,8 +17,10 @@ from safetensors import safe_open
 from safetensors.torch import save_file
 import json
 import re
 from collections import defaultdict
-from typing import Dict, List, Tuple, Set
 class SignalRegistry:
     """Manages signal ID assignments."""
@@ -46,6 +48,77 @@ class SignalRegistry:
         return json.dumps(self.id_to_name)
 def extract_gate_name(tensor_name: str) -> str:
     """Extract gate name from tensor name (remove .weight or .bias suffix)."""
     if tensor_name.endswith('.weight'):
@@ -92,6 +165,85 @@ def infer_boolean_inputs(gate: str, registry: SignalRegistry) -> List[int]:
     return []
 def infer_halfadder_inputs(gate: str, prefix: str, registry: SignalRegistry) -> List[int]:
     """Infer inputs for half adder gates."""
     registry.register(f"{prefix}.$a")
@@ -264,27 +416,27 @@ def infer_modular_inputs(gate: str, registry: SignalRegistry) -> List[int]:
 def infer_comparator_inputs(gate: str, registry: SignalRegistry) -> List[int]:
     """Infer inputs for comparator gates."""
-    # 8-bit inputs a and b
     prefix = gate.rsplit('.', 1)[0]  # Remove .comparator
     inputs = []
-    for i in range(8):
         registry.register(f"{prefix}.$a[{i}]")
         registry.register(f"{prefix}.$b[{i}]")
     # Comparator takes difference of bit pairs
-    for i in range(8):
         inputs.append(registry.get_id(f"{prefix}.$a[{i}]"))
-    for i in range(8):
         inputs.append(registry.get_id(f"{prefix}.$b[{i}]"))
     return inputs
-def infer_adc_sbc_inputs(gate: str, prefix: str, registry: SignalRegistry) -> List[int]:
     """Infer inputs for ADC/SBC (add/subtract with carry) gates."""
     # Register inputs
-    for i in range(8):
         registry.register(f"{prefix}.$a[{i}]")
         registry.register(f"{prefix}.$b[{i}]")
     registry.register(f"{prefix}.$cin")
@@ -346,11 +498,9 @@ def infer_adc_sbc_inputs(gate: str, prefix: str, registry: SignalRegistry) -> Li
     return []
-def infer_sub8bit_inputs(gate: str, registry: SignalRegistry) -> List[int]:
-    """Infer inputs for SUB8BIT (subtraction via complement addition)."""
-    prefix = "arithmetic.sub8bit"
-    for i in range(8):
         registry.register(f"{prefix}.$a[{i}]")
         registry.register(f"{prefix}.$b[{i}]")
@@ -404,15 +554,22 @@ def infer_sub8bit_inputs(gate: str, registry: SignalRegistry) -> List[int]:
     return []
-def infer_cmp8bit_inputs(gate: str, registry: SignalRegistry) -> List[int]:
-    """Infer inputs for CMP8BIT (compare via subtraction)."""
-    prefix = "arithmetic.cmp8bit"
-    for i in range(8):
         registry.register(f"{prefix}.$a[{i}]")
         registry.register(f"{prefix}.$b[{i}]")
-    # Similar to sub8bit
     if '.notb' in gate:
         match = re.search(r'\.notb(\d+)', gate)
         if match:
@@ -454,23 +611,28 @@ def infer_cmp8bit_inputs(gate: str, registry: SignalRegistry) -> List[int]:
             return [registry.register(f"{fa_prefix}.and1"),
                     registry.register(f"{fa_prefix}.and2")]
-    # Flag outputs
     if '.flags.' in gate:
-        # Flags take the result bits
-        return [registry.register(f"{prefix}.fa{i}.sum") for i in range(8)]
     return []
-def infer_equality8bit_inputs(gate: str, registry: SignalRegistry) -> List[int]:
-    """Infer inputs for equality circuit (XNOR chain + AND)."""
-    prefix = "arithmetic.equality8bit"
-    for i in range(8):
         registry.register(f"{prefix}.$a[{i}]")
         registry.register(f"{prefix}.$b[{i}]")
-    # XNOR gates
     match = re.search(r'\.xnor(\d+)\.', gate)
     if match:
         idx = int(match.group(1))
@@ -484,29 +646,36 @@ def infer_equality8bit_inputs(gate: str, registry: SignalRegistry) -> List[int]:
             nor_out = registry.register(f"{prefix}.xnor{idx}.layer1.nor")
             return [and_out, nor_out]
-    # Final AND
     if '.and' in gate or '.final_and' in gate:
-        return [registry.register(f"{prefix}.xnor{i}") for i in range(8)]
     return []
-def infer_neg8bit_inputs(gate: str, registry: SignalRegistry) -> List[int]:
-    """Infer inputs for NEG8BIT (two's complement negation)."""
-    prefix = "arithmetic.neg8bit"
-    for i in range(8):
         registry.register(f"{prefix}.$x[{i}]")
-    # NOT gates
     if '.not' in gate and 'layer' not in gate:
         match = re.search(r'\.not(\d+)', gate)
         if match:
             idx = int(match.group(1))
             return [registry.get_id(f"{prefix}.$x[{idx}]")]
-    # Increment by 1 (add chain)
-    if '.sum0' in gate or '.carry0' in gate:
         return [registry.register(f"{prefix}.not0"), registry.get_id("#1")]
     match = re.search(r'\.xor(\d+)\.', gate)
@@ -538,19 +707,41 @@ def infer_neg8bit_inputs(gate: str, registry: SignalRegistry) -> List[int]:
     return []
 def infer_shift_rotate_inputs(gate: str, registry: SignalRegistry) -> List[int]:
     """Infer inputs for ASR, ROL, ROR."""
     # Determine which circuit
-    if 'asr8bit' in gate:
         prefix = "arithmetic.asr8bit"
     elif 'rol8bit' in gate:
         prefix = "arithmetic.rol8bit"
     elif 'ror8bit' in gate:
         prefix = "arithmetic.ror8bit"
     else:
         return []
-    for i in range(8):
         registry.register(f"{prefix}.$x[{i}]")
     # Bit selectors
@@ -558,12 +749,12 @@ def infer_shift_rotate_inputs(gate: str, registry: SignalRegistry) -> List[int]:
     if match:
         idx = int(match.group(1))
         # Each output bit selects from input bits based on shift
-        return [registry.get_id(f"{prefix}.$x[{i}]") for i in range(8)]
     # Carry/shift out
     if '.cout' in gate or '.shiftout' in gate:
         if 'rol' in gate:
-            return [registry.get_id(f"{prefix}.$x[7]")]  # MSB shifts out
         elif 'ror' in gate:
             return [registry.get_id(f"{prefix}.$x[0]")]  # LSB shifts out
         elif 'asr' in gate:
@@ -603,6 +794,15 @@ def infer_multiplier_inputs(gate: str, registry: SignalRegistry) -> List[int]:
             return [registry.get_id(f"{prefix}.$a[{col}]"),
                     registry.get_id(f"{prefix}.$b[{row}]")]
     # Stage adders
     match = re.search(r'\.stage(\d+)\.bit(\d+)\.', gate)
     if match:
@@ -661,40 +861,60 @@ def infer_multiplier_inputs(gate: str, registry: SignalRegistry) -> List[int]:
 def infer_incr_decr_inputs(gate: str, registry: SignalRegistry) -> List[int]:
     """Infer inputs for incrementer/decrementer."""
-    if 'incrementer' in gate:
         prefix = "arithmetic.incrementer8bit"
     elif 'decrementer' in gate:
         prefix = "arithmetic.decrementer8bit"
     else:
         return []
-    for i in range(8):
         registry.register(f"{prefix}.$x[{i}]")
     # These typically just reference adder and constant
-    return [registry.get_id(f"{prefix}.$x[{i}]") for i in range(8)]
 def infer_minmax_inputs(gate: str, registry: SignalRegistry) -> List[int]:
     """Infer inputs for min/max/absolutedifference."""
-    if 'max8bit' in gate:
         prefix = "arithmetic.max8bit"
     elif 'min8bit' in gate:
         prefix = "arithmetic.min8bit"
     elif 'absolutedifference' in gate:
         prefix = "arithmetic.absolutedifference8bit"
     else:
         return []
-    for i in range(8):
         registry.register(f"{prefix}.$a[{i}]")
         registry.register(f"{prefix}.$b[{i}]")
     # Select/diff weights take comparison + both operands
     inputs = []
-    for i in range(8):
         inputs.append(registry.get_id(f"{prefix}.$a[{i}]"))
-    for i in range(8):
         inputs.append(registry.get_id(f"{prefix}.$b[{i}]"))
     return inputs
@@ -993,6 +1213,8 @@ def infer_inputs_for_gate(gate: str, registry: SignalRegistry, routing: dict) ->
         # Ripple carry adders
         if 'ripplecarry8bit' in gate:
             return infer_ripplecarry_inputs(gate, 'arithmetic.ripplecarry8bit', 8, registry)
         if 'ripplecarry4bit' in gate:
             return infer_ripplecarry_inputs(gate, 'arithmetic.ripplecarry4bit', 4, registry)
         if 'ripplecarry2bit' in gate:
@@ -1000,28 +1222,41 @@ def infer_inputs_for_gate(gate: str, registry: SignalRegistry, routing: dict) ->
         # ADC/SBC
         if 'adc8bit' in gate:
-            return infer_adc_sbc_inputs(gate, 'arithmetic.adc8bit', registry)
         if 'sbc8bit' in gate:
-            return infer_adc_sbc_inputs(gate, 'arithmetic.sbc8bit', registry)
         # SUB
         if 'sub8bit' in gate:
             return infer_sub8bit_inputs(gate, registry)
         # CMP
         if 'cmp8bit' in gate:
             return infer_cmp8bit_inputs(gate, registry)
         # Equality
         if 'equality8bit' in gate:
             return infer_equality8bit_inputs(gate, registry)
         # Negate
         if 'neg8bit' in gate:
             return infer_neg8bit_inputs(gate, registry)
         # Shifts and rotates
-        if 'asr8bit' in gate or 'rol8bit' in gate or 'ror8bit' in gate:
             return infer_shift_rotate_inputs(gate, registry)
         # Multipliers
@@ -1038,7 +1273,9 @@ def infer_inputs_for_gate(gate: str, registry: SignalRegistry, routing: dict) ->
         # Comparators
         if 'greaterthan8bit' in gate or 'lessthan8bit' in gate or \
-           'greaterorequal8bit' in gate or 'lessorequal8bit' in gate:
             return infer_comparator_inputs(gate, registry)
         # CLZ (count leading zeros)
@@ -1049,6 +1286,16 @@ def infer_inputs_for_gate(gate: str, registry: SignalRegistry, routing: dict) ->
     # Float16 circuits
     if gate.startswith('float16.'):
         if 'unpack' in gate:
             return infer_float16_unpack_inputs(gate, registry)
         if 'pack' in gate:
@@ -2786,18 +3033,24 @@ def infer_float16_sub_inputs(gate: str, registry: SignalRegistry) -> List[int]:
     return []
-def infer_float16_mul_inputs(gate: str, registry: SignalRegistry) -> List[int]:
-    """Infer inputs for float16.mul circuit."""
-    prefix = "float16.mul"
-    for i in range(16):
-        registry.register(f"{prefix}.$a[{i}]")
-        registry.register(f"{prefix}.$b[{i}]")
-    exp_a_bits = [f"{prefix}.$a[{10+i}]" for i in range(5)]
-    exp_b_bits = [f"{prefix}.$b[{10+i}]" for i in range(5)]
-    mant_a_bits = [f"{prefix}.$a[{i}]" for i in range(10)]
-    mant_b_bits = [f"{prefix}.$b[{i}]" for i in range(10)]
     if '.exp_a_all_ones' in gate:
         return [registry.get_id(b) for b in exp_a_bits]
@@ -2842,11 +3095,11 @@ def infer_float16_mul_inputs(gate: str, registry: SignalRegistry) -> List[int]:
     match = re.search(r'\.mant_a_norm(\d+)$', gate)
     if match:
         i = int(match.group(1))
-        return [registry.get_id(f"{prefix}.$a[{i}]")]
     match = re.search(r'\.mant_b_norm(\d+)$', gate)
     if match:
         i = int(match.group(1))
-        return [registry.get_id(f"{prefix}.$b[{i}]")]
     for i in range(10):
         registry.register(f"{prefix}.mant_a_norm{i}")
@@ -2919,11 +3172,11 @@ def infer_float16_mul_inputs(gate: str, registry: SignalRegistry) -> List[int]:
     registry.register(f"{prefix}.result_is_zero")
     if '.result_sign.layer1.or' in gate:
-        return [registry.get_id(f"{prefix}.$a[15]"),
-                registry.get_id(f"{prefix}.$b[15]")]
     if '.result_sign.layer1.nand' in gate:
-        return [registry.get_id(f"{prefix}.$a[15]"),
-                registry.get_id(f"{prefix}.$b[15]")]
     if '.result_sign.layer2' in gate:
         return [registry.register(f"{prefix}.result_sign.layer1.or"),
                 registry.register(f"{prefix}.result_sign.layer1.nand")]
@@ -2944,11 +3197,11 @@ def infer_float16_mul_inputs(gate: str, registry: SignalRegistry) -> List[int]:
         if i == 10:
             a_bit = registry.get_id(f"{prefix}.implicit_a")
         else:
-            a_bit = registry.get_id(f"{prefix}.$a[{i}]")
         if j == 10:
             b_bit = registry.get_id(f"{prefix}.implicit_b")
         else:
-            b_bit = registry.get_id(f"{prefix}.$b[{j}]")
         return [a_bit, b_bit]
     for i in range(11):
@@ -6810,6 +7063,32 @@ def build_float16_unpack_tensors() -> Dict[str, torch.Tensor]:
     return tensors
 def build_clz16bit_tensors() -> Dict[str, torch.Tensor]:
     """Build tensors for arithmetic.clz16bit circuit.
@@ -10595,6 +10874,166 @@ def build_clz8bit_tensors() -> Dict[str, torch.Tensor]:
     return tensors
 def main():
     print("Loading existing tensors...")
     tensors = {}
@@ -10625,6 +11064,31 @@ def main():
         del tensors[k]
     print(f"Removed {len(old_float16_div)} old float16.div tensors")
     # Remove broken mod2/mod4/mod8 tensors
     old_mod_power2 = [k for k in tensors.keys() if k.startswith('modular.mod2') or
                       k.startswith('modular.mod4') or k.startswith('modular.mod8')]
@@ -10647,10 +11111,6 @@ def main():
     # Build new circuits
     print("Building new circuits...")
-    clz_tensors = build_clz8bit_tensors()
-    tensors.update(clz_tensors)
-    print(f"  CLZ8BIT: {len(clz_tensors)} tensors")
     clz16_tensors = build_clz16bit_tensors()
     tensors.update(clz16_tensors)
     print(f"  CLZ16BIT: {len(clz16_tensors)} tensors")
@@ -10703,14 +11163,92 @@ def main():
     tensors.update(fromint_tensors)
     print(f"  float16.fromint: {len(fromint_tensors)} tensors")
     mod_power2_tensors = build_modular_power2_tensors()
     tensors.update(mod_power2_tensors)
     print(f"  modular.mod2/4/8: {len(mod_power2_tensors)} tensors")
-    bitwise_tensors = build_bitwise_shift_tensors()
-    tensors.update(bitwise_tensors)
-    print(f"  bitwise shifts: {len(bitwise_tensors)} tensors")
     symmetry_tensors = build_symmetry8bit_tensors()
     tensors.update(symmetry_tensors)
     print(f"  symmetry8bit: {len(symmetry_tensors)} tensors")

 from safetensors.torch import save_file
 import json
 import re
+import struct
+import math
 from collections import defaultdict
+from typing import Dict, List, Tuple, Set, Callable, Optional
 class SignalRegistry:
     """Manages signal ID assignments."""
         return json.dumps(self.id_to_name)
+def float16_bits_to_float(bits: int) -> float:
+    """Interpret 16-bit int as IEEE-754 float16."""
+    packed = struct.pack('>H', bits & 0xFFFF)
+    return struct.unpack('>e', packed)[0]
+def float16_float_to_bits(val: float) -> int:
+    """Convert float to IEEE-754 float16 bits with canonical NaN."""
+    try:
+        packed = struct.pack('>e', float(val))
+        return struct.unpack('>H', packed)[0]
+    except (OverflowError, struct.error):
+        if val == float('inf'):
+            return 0x7C00
+        if val == float('-inf'):
+            return 0xFC00
+        if val != val:
+            return 0x7E00
+        return 0x7BFF if val > 0 else 0xFBFF
+def compute_float16_unary_lut_outputs(op_fn: Callable[[torch.Tensor], torch.Tensor]) -> List[int]:
+    """Compute output bits for all 65536 float16 inputs using a unary op."""
+    outputs: List[int] = [0] * 65536
+    for bits in range(65536):
+        val = float16_bits_to_float(bits)
+        out = op_fn(torch.tensor(val, dtype=torch.float16)).item()
+        if out != out:
+            outputs[bits] = 0x7E00
+        else:
+            outputs[bits] = float16_float_to_bits(float(out))
+    return outputs
+def build_float16_lut_match_tensors(prefix: str) -> Dict[str, torch.Tensor]:
+    """Build exact-match gates for all 16-bit patterns under prefix.matchXXXX."""
+    tensors: Dict[str, torch.Tensor] = {}
+    for bits in range(65536):
+        ones = bits.bit_count()
+        weights = [1.0 if (bits >> i) & 1 else -1.0 for i in range(16)]
+        bias = -(ones - 0.5)
+        name = f"{prefix}.match{bits:04x}"
+        tensors[f"{name}.weight"] = torch.tensor(weights)
+        tensors[f"{name}.bias"] = torch.tensor([bias])
+    return tensors
+def build_float16_lut_output_tensors(prefix: str, outputs: List[int]) -> Dict[str, torch.Tensor]:
+    """Build LUT output gates (prefix.out0..out15) using one-hot match inputs."""
+    tensors: Dict[str, torch.Tensor] = {}
+    for bit in range(16):
+        weights = torch.zeros(65536)
+        for idx, out_bits in enumerate(outputs):
+            if (out_bits >> bit) & 1:
+                weights[idx] = 1.0
+        tensors[f"{prefix}.out{bit}.weight"] = weights
+        tensors[f"{prefix}.out{bit}.bias"] = torch.tensor([-0.5])
+    return tensors
+def clone_prefix_tensors(src: Dict[str, torch.Tensor], old_prefix: str,
+                         new_prefix: str) -> Dict[str, torch.Tensor]:
+    """Clone tensors and rewrite the prefix in tensor names."""
+    out: Dict[str, torch.Tensor] = {}
+    for name, tensor in src.items():
+        if name.startswith(old_prefix + "."):
+            out_name = new_prefix + name[len(old_prefix):]
+            out[out_name] = tensor.clone()
+    return out
 def extract_gate_name(tensor_name: str) -> str:
     """Extract gate name from tensor name (remove .weight or .bias suffix)."""
     if tensor_name.endswith('.weight'):
     return []
+def get_lut_match_ids(registry: SignalRegistry, match_prefix: str) -> List[int]:
+    """Get (and cache) match gate IDs for a LUT prefix."""
+    cache = getattr(registry, "_lut_match_ids", None)
+    if cache is None:
+        cache = {}
+        setattr(registry, "_lut_match_ids", cache)
+    if match_prefix not in cache:
+        cache[match_prefix] = [registry.register(f"{match_prefix}.match{idx:04x}") for idx in range(65536)]
+    return cache[match_prefix]
+def infer_float16_lut_match_inputs(gate: str, registry: SignalRegistry,
+                                   match_prefix: str, input_bits: List[str]) -> List[int]:
+    """Infer inputs for LUT match gates (exact pattern match)."""
+    if not gate.startswith(f"{match_prefix}.match"):
+        return []
+    for name in input_bits:
+        registry.register(name)
+    return [registry.get_id(name) for name in input_bits]
+def infer_float16_lut_out_inputs(gate: str, registry: SignalRegistry, match_prefix: str) -> List[int]:
+    """Infer inputs for LUT output gates (one-hot match vector)."""
+    match = re.search(r'\.out(\d+)$', gate)
+    if not match:
+        return []
+    return get_lut_match_ids(registry, match_prefix)
+def infer_float16_lut_inputs(gate: str, registry: SignalRegistry) -> List[int]:
+    """Infer inputs for shared float16.lut match gates."""
+    prefix = "float16.lut"
+    input_bits = [f"{prefix}.$x[{i}]" for i in range(16)]
+    return infer_float16_lut_match_inputs(gate, registry, prefix, input_bits)
+def infer_float16_pow_inputs(gate: str, registry: SignalRegistry) -> List[int]:
+    """Infer inputs for float16.pow circuit (ln -> mul -> exp)."""
+    prefix = "float16.pow"
+    # External inputs
+    for i in range(16):
+        registry.register(f"{prefix}.$a[{i}]")
+        registry.register(f"{prefix}.$b[{i}]")
+    # ln subcircuit (match + outputs)
+    ln_prefix = f"{prefix}.ln"
+    ln_input_bits = [f"{prefix}.$a[{i}]" for i in range(16)]
+    inputs = infer_float16_lut_match_inputs(gate, registry, ln_prefix, ln_input_bits)
+    if inputs:
+        return inputs
+    if gate.startswith(f"{ln_prefix}."):
+        return infer_float16_lut_out_inputs(gate, registry, ln_prefix)
+    # mul subcircuit (a = ln.out, b = external b)
+    if gate.startswith(f"{prefix}.mul."):
+        a_bits = [f"{ln_prefix}.out{i}" for i in range(16)]
+        b_bits = [f"{prefix}.$b[{i}]" for i in range(16)]
+        return infer_float16_mul_inputs(gate, registry, prefix=f"{prefix}.mul",
+                                        a_bits=a_bits, b_bits=b_bits)
+    # exp subcircuit (match + outputs) with input from mul outputs
+    exp_prefix = f"{prefix}.exp"
+    exp_input_bits = [f"{prefix}.mul.out{i}" for i in range(16)]
+    inputs = infer_float16_lut_match_inputs(gate, registry, exp_prefix, exp_input_bits)
+    if inputs:
+        return inputs
+    if gate.startswith(f"{exp_prefix}."):
+        return infer_float16_lut_out_inputs(gate, registry, exp_prefix)
+    # pow outputs (pass-through from exp.out)
+    match = re.search(r'\.out(\d+)$', gate)
+    if match:
+        i = int(match.group(1))
+        return [registry.get_id(f"{exp_prefix}.out{i}")]
+    return []
 def infer_halfadder_inputs(gate: str, prefix: str, registry: SignalRegistry) -> List[int]:
     """Infer inputs for half adder gates."""
     registry.register(f"{prefix}.$a")
 def infer_comparator_inputs(gate: str, registry: SignalRegistry) -> List[int]:
     """Infer inputs for comparator gates."""
     prefix = gate.rsplit('.', 1)[0]  # Remove .comparator
+    bits = 16 if "16bit" in prefix else 8
     inputs = []
+    for i in range(bits):
         registry.register(f"{prefix}.$a[{i}]")
         registry.register(f"{prefix}.$b[{i}]")
     # Comparator takes difference of bit pairs
+    for i in range(bits):
         inputs.append(registry.get_id(f"{prefix}.$a[{i}]"))
+    for i in range(bits):
         inputs.append(registry.get_id(f"{prefix}.$b[{i}]"))
     return inputs
+def infer_adc_sbc_inputs(gate: str, prefix: str, registry: SignalRegistry, bits: int = 8) -> List[int]:
     """Infer inputs for ADC/SBC (add/subtract with carry) gates."""
     # Register inputs
+    for i in range(bits):
         registry.register(f"{prefix}.$a[{i}]")
         registry.register(f"{prefix}.$b[{i}]")
     registry.register(f"{prefix}.$cin")
     return []
+def infer_sub_inputs(gate: str, prefix: str, bits: int, registry: SignalRegistry) -> List[int]:
+    """Infer inputs for subtractor (complement addition) gates."""
+    for i in range(bits):
         registry.register(f"{prefix}.$a[{i}]")
         registry.register(f"{prefix}.$b[{i}]")
     return []
+def infer_sub8bit_inputs(gate: str, registry: SignalRegistry) -> List[int]:
+    """Infer inputs for SUB8BIT (subtraction via complement addition)."""
+    return infer_sub_inputs(gate, "arithmetic.sub8bit", 8, registry)
+def infer_sub16bit_inputs(gate: str, registry: SignalRegistry) -> List[int]:
+    """Infer inputs for SUB16BIT (subtraction via complement addition)."""
+    return infer_sub_inputs(gate, "arithmetic.sub16bit", 16, registry)
+def infer_cmp_inputs(gate: str, prefix: str, bits: int, registry: SignalRegistry) -> List[int]:
+    """Infer inputs for comparator via subtraction."""
+    for i in range(bits):
         registry.register(f"{prefix}.$a[{i}]")
         registry.register(f"{prefix}.$b[{i}]")
     if '.notb' in gate:
         match = re.search(r'\.notb(\d+)', gate)
         if match:
             return [registry.register(f"{fa_prefix}.and1"),
                     registry.register(f"{fa_prefix}.and2")]
     if '.flags.' in gate:
+        return [registry.register(f"{prefix}.fa{i}.sum") for i in range(bits)]
     return []
+def infer_cmp8bit_inputs(gate: str, registry: SignalRegistry) -> List[int]:
+    """Infer inputs for CMP8BIT (compare via subtraction)."""
+    return infer_cmp_inputs(gate, "arithmetic.cmp8bit", 8, registry)
+def infer_cmp16bit_inputs(gate: str, registry: SignalRegistry) -> List[int]:
+    """Infer inputs for CMP16BIT (compare via subtraction)."""
+    return infer_cmp_inputs(gate, "arithmetic.cmp16bit", 16, registry)
+def infer_equality_inputs(gate: str, prefix: str, bits: int, registry: SignalRegistry) -> List[int]:
+    """Infer inputs for equality circuit (XNOR chain + AND)."""
+    for i in range(bits):
         registry.register(f"{prefix}.$a[{i}]")
         registry.register(f"{prefix}.$b[{i}]")
     match = re.search(r'\.xnor(\d+)\.', gate)
     if match:
         idx = int(match.group(1))
             nor_out = registry.register(f"{prefix}.xnor{idx}.layer1.nor")
             return [and_out, nor_out]
     if '.and' in gate or '.final_and' in gate:
+        return [registry.register(f"{prefix}.xnor{i}") for i in range(bits)]
     return []
+def infer_equality8bit_inputs(gate: str, registry: SignalRegistry) -> List[int]:
+    """Infer inputs for equality8bit circuit (XNOR chain + AND)."""
+    return infer_equality_inputs(gate, "arithmetic.equality8bit", 8, registry)
+def infer_equality16bit_inputs(gate: str, registry: SignalRegistry) -> List[int]:
+    """Infer inputs for equality16bit circuit (XNOR chain + AND)."""
+    return infer_equality_inputs(gate, "arithmetic.equality16bit", 16, registry)
+def infer_neg_inputs(gate: str, prefix: str, bits: int, registry: SignalRegistry) -> List[int]:
+    """Infer inputs for negation (two's complement)."""
+    for i in range(bits):
         registry.register(f"{prefix}.$x[{i}]")
     if '.not' in gate and 'layer' not in gate:
         match = re.search(r'\.not(\d+)', gate)
         if match:
             idx = int(match.group(1))
             return [registry.get_id(f"{prefix}.$x[{idx}]")]
+    if '.sum0' in gate:
+        return [registry.register(f"{prefix}.not0")]
+    if '.carry0' in gate:
         return [registry.register(f"{prefix}.not0"), registry.get_id("#1")]
     match = re.search(r'\.xor(\d+)\.', gate)
     return []
+def infer_neg8bit_inputs(gate: str, registry: SignalRegistry) -> List[int]:
+    """Infer inputs for NEG8BIT (two's complement negation)."""
+    return infer_neg_inputs(gate, "arithmetic.neg8bit", 8, registry)
+def infer_neg16bit_inputs(gate: str, registry: SignalRegistry) -> List[int]:
+    """Infer inputs for NEG16BIT (two's complement negation)."""
+    return infer_neg_inputs(gate, "arithmetic.neg16bit", 16, registry)
 def infer_shift_rotate_inputs(gate: str, registry: SignalRegistry) -> List[int]:
     """Infer inputs for ASR, ROL, ROR."""
     # Determine which circuit
+    if 'asr16bit' in gate:
+        prefix = "arithmetic.asr16bit"
+        bits = 16
+    elif 'rol16bit' in gate:
+        prefix = "arithmetic.rol16bit"
+        bits = 16
+    elif 'ror16bit' in gate:
+        prefix = "arithmetic.ror16bit"
+        bits = 16
+    elif 'asr8bit' in gate:
         prefix = "arithmetic.asr8bit"
+        bits = 8
     elif 'rol8bit' in gate:
         prefix = "arithmetic.rol8bit"
+        bits = 8
     elif 'ror8bit' in gate:
         prefix = "arithmetic.ror8bit"
+        bits = 8
     else:
         return []
+    for i in range(bits):
         registry.register(f"{prefix}.$x[{i}]")
     # Bit selectors
     if match:
         idx = int(match.group(1))
         # Each output bit selects from input bits based on shift
+        return [registry.get_id(f"{prefix}.$x[{i}]") for i in range(bits)]
     # Carry/shift out
     if '.cout' in gate or '.shiftout' in gate:
         if 'rol' in gate:
+            return [registry.get_id(f"{prefix}.$x[{bits-1}]")]  # MSB shifts out
         elif 'ror' in gate:
             return [registry.get_id(f"{prefix}.$x[0]")]  # LSB shifts out
         elif 'asr' in gate:
             return [registry.get_id(f"{prefix}.$a[{col}]"),
                     registry.get_id(f"{prefix}.$b[{row}]")]
+    # Direct AND gates used by multiplier2x2
+    if 'multiplier2x2' in gate:
+        match = re.search(r'\.and(\d)(\d)$', gate)
+        if match:
+            row, col = int(match.group(1)), int(match.group(2))
+            if row < size and col < size:
+                return [registry.get_id(f"{prefix}.$a[{col}]"),
+                        registry.get_id(f"{prefix}.$b[{row}]")]
     # Stage adders
     match = re.search(r'\.stage(\d+)\.bit(\d+)\.', gate)
     if match:
 def infer_incr_decr_inputs(gate: str, registry: SignalRegistry) -> List[int]:
     """Infer inputs for incrementer/decrementer."""
+    if 'incrementer16bit' in gate:
+        prefix = "arithmetic.incrementer16bit"
+        bits = 16
+    elif 'decrementer16bit' in gate:
+        prefix = "arithmetic.decrementer16bit"
+        bits = 16
+    elif 'incrementer' in gate:
         prefix = "arithmetic.incrementer8bit"
+        bits = 8
     elif 'decrementer' in gate:
         prefix = "arithmetic.decrementer8bit"
+        bits = 8
     else:
         return []
+    for i in range(bits):
         registry.register(f"{prefix}.$x[{i}]")
     # These typically just reference adder and constant
+    return [registry.get_id(f"{prefix}.$x[{i}]") for i in range(bits)]
 def infer_minmax_inputs(gate: str, registry: SignalRegistry) -> List[int]:
     """Infer inputs for min/max/absolutedifference."""
+    if 'max16bit' in gate:
+        prefix = "arithmetic.max16bit"
+        bits = 16
+    elif 'min16bit' in gate:
+        prefix = "arithmetic.min16bit"
+        bits = 16
+    elif 'absolutedifference16bit' in gate:
+        prefix = "arithmetic.absolutedifference16bit"
+        bits = 16
+    elif 'max8bit' in gate:
         prefix = "arithmetic.max8bit"
+        bits = 8
     elif 'min8bit' in gate:
         prefix = "arithmetic.min8bit"
+        bits = 8
     elif 'absolutedifference' in gate:
         prefix = "arithmetic.absolutedifference8bit"
+        bits = 8
     else:
         return []
+    for i in range(bits):
         registry.register(f"{prefix}.$a[{i}]")
         registry.register(f"{prefix}.$b[{i}]")
     # Select/diff weights take comparison + both operands
     inputs = []
+    for i in range(bits):
         inputs.append(registry.get_id(f"{prefix}.$a[{i}]"))
+    for i in range(bits):
         inputs.append(registry.get_id(f"{prefix}.$b[{i}]"))
     return inputs
         # Ripple carry adders
         if 'ripplecarry8bit' in gate:
             return infer_ripplecarry_inputs(gate, 'arithmetic.ripplecarry8bit', 8, registry)
+        if 'ripplecarry16bit' in gate:
+            return infer_ripplecarry_inputs(gate, 'arithmetic.ripplecarry16bit', 16, registry)
         if 'ripplecarry4bit' in gate:
             return infer_ripplecarry_inputs(gate, 'arithmetic.ripplecarry4bit', 4, registry)
         if 'ripplecarry2bit' in gate:
         # ADC/SBC
         if 'adc8bit' in gate:
+            return infer_adc_sbc_inputs(gate, 'arithmetic.adc8bit', registry, bits=8)
+        if 'adc16bit' in gate:
+            return infer_adc_sbc_inputs(gate, 'arithmetic.adc16bit', registry, bits=16)
         if 'sbc8bit' in gate:
+            return infer_adc_sbc_inputs(gate, 'arithmetic.sbc8bit', registry, bits=8)
+        if 'sbc16bit' in gate:
+            return infer_adc_sbc_inputs(gate, 'arithmetic.sbc16bit', registry, bits=16)
         # SUB
         if 'sub8bit' in gate:
             return infer_sub8bit_inputs(gate, registry)
+        if 'sub16bit' in gate:
+            return infer_sub16bit_inputs(gate, registry)
         # CMP
         if 'cmp8bit' in gate:
             return infer_cmp8bit_inputs(gate, registry)
+        if 'cmp16bit' in gate:
+            return infer_cmp16bit_inputs(gate, registry)
         # Equality
         if 'equality8bit' in gate:
             return infer_equality8bit_inputs(gate, registry)
+        if 'equality16bit' in gate:
+            return infer_equality16bit_inputs(gate, registry)
         # Negate
         if 'neg8bit' in gate:
             return infer_neg8bit_inputs(gate, registry)
+        if 'neg16bit' in gate:
+            return infer_neg16bit_inputs(gate, registry)
         # Shifts and rotates
+        if ('asr8bit' in gate or 'rol8bit' in gate or 'ror8bit' in gate or
+            'asr16bit' in gate or 'rol16bit' in gate or 'ror16bit' in gate):
             return infer_shift_rotate_inputs(gate, registry)
         # Multipliers
         # Comparators
         if 'greaterthan8bit' in gate or 'lessthan8bit' in gate or \
+           'greaterorequal8bit' in gate or 'lessorequal8bit' in gate or \
+           'greaterthan16bit' in gate or 'lessthan16bit' in gate or \
+           'greaterorequal16bit' in gate or 'lessorequal16bit' in gate:
             return infer_comparator_inputs(gate, registry)
         # CLZ (count leading zeros)
     # Float16 circuits
     if gate.startswith('float16.'):
+        if gate.startswith('float16.lut'):
+            return infer_float16_lut_inputs(gate, registry)
+        if gate.startswith('float16.pow'):
+            return infer_float16_pow_inputs(gate, registry)
+        if gate.startswith('float16.sqrt') or gate.startswith('float16.rsqrt') or \
+           gate.startswith('float16.exp') or gate.startswith('float16.ln') or \
+           gate.startswith('float16.log2') or gate.startswith('float16.sin') or \
+           gate.startswith('float16.cos') or gate.startswith('float16.tan') or \
+           gate.startswith('float16.tanh'):
+            return infer_float16_lut_out_inputs(gate, registry, "float16.lut")
         if 'unpack' in gate:
             return infer_float16_unpack_inputs(gate, registry)
         if 'pack' in gate:
     return []
+def infer_float16_mul_inputs(gate: str, registry: SignalRegistry, prefix: str = "float16.mul",
+                             a_bits: Optional[List[str]] = None,
+                             b_bits: Optional[List[str]] = None) -> List[int]:
+    """Infer inputs for float16.mul circuit (optionally with custom input sources)."""
+    if a_bits is None:
+        a_bits = [f"{prefix}.$a[{i}]" for i in range(16)]
+    if b_bits is None:
+        b_bits = [f"{prefix}.$b[{i}]" for i in range(16)]
+    for name in a_bits:
+        registry.register(name)
+    for name in b_bits:
+        registry.register(name)
+    exp_a_bits = [a_bits[10 + i] for i in range(5)]
+    exp_b_bits = [b_bits[10 + i] for i in range(5)]
+    mant_a_bits = [a_bits[i] for i in range(10)]
+    mant_b_bits = [b_bits[i] for i in range(10)]
     if '.exp_a_all_ones' in gate:
         return [registry.get_id(b) for b in exp_a_bits]
     match = re.search(r'\.mant_a_norm(\d+)$', gate)
     if match:
         i = int(match.group(1))
+        return [registry.get_id(a_bits[i])]
     match = re.search(r'\.mant_b_norm(\d+)$', gate)
     if match:
         i = int(match.group(1))
+        return [registry.get_id(b_bits[i])]
     for i in range(10):
         registry.register(f"{prefix}.mant_a_norm{i}")
     registry.register(f"{prefix}.result_is_zero")
     if '.result_sign.layer1.or' in gate:
+        return [registry.get_id(a_bits[15]),
+                registry.get_id(b_bits[15])]
     if '.result_sign.layer1.nand' in gate:
+        return [registry.get_id(a_bits[15]),
+                registry.get_id(b_bits[15])]
     if '.result_sign.layer2' in gate:
         return [registry.register(f"{prefix}.result_sign.layer1.or"),
                 registry.register(f"{prefix}.result_sign.layer1.nand")]
         if i == 10:
             a_bit = registry.get_id(f"{prefix}.implicit_a")
         else:
+            a_bit = registry.get_id(a_bits[i])
         if j == 10:
             b_bit = registry.get_id(f"{prefix}.implicit_b")
         else:
+            b_bit = registry.get_id(b_bits[j])
         return [a_bit, b_bit]
     for i in range(11):
     return tensors
+def build_float16_pow_tensors(mul_tensors: Dict[str, torch.Tensor],
+                              ln_outputs: List[int],
+                              exp_outputs: List[int]) -> Dict[str, torch.Tensor]:
+    """Build tensors for float16.pow via ln -> mul -> exp."""
+    tensors: Dict[str, torch.Tensor] = {}
+    # ln(a) LUT
+    tensors.update(build_float16_lut_match_tensors("float16.pow.ln"))
+    tensors.update(build_float16_lut_output_tensors("float16.pow.ln", ln_outputs))
+    # mul(ln(a), b)
+    tensors.update(clone_prefix_tensors(mul_tensors, "float16.mul", "float16.pow.mul"))
+    # exp(mul)
+    tensors.update(build_float16_lut_match_tensors("float16.pow.exp"))
+    tensors.update(build_float16_lut_output_tensors("float16.pow.exp", exp_outputs))
+    # Final outputs (pass-through from exp)
+    prefix = "float16.pow"
+    for i in range(16):
+        tensors[f"{prefix}.out{i}.weight"] = torch.tensor([1.0])
+        tensors[f"{prefix}.out{i}.bias"] = torch.tensor([-0.5])
+    return tensors
 def build_clz16bit_tensors() -> Dict[str, torch.Tensor]:
     """Build tensors for arithmetic.clz16bit circuit.
     return tensors
+def add_not_gate(tensors: Dict[str, torch.Tensor], name: str) -> None:
+    tensors[f"{name}.weight"] = torch.tensor([-1.0])
+    tensors[f"{name}.bias"] = torch.tensor([0.0])
+def add_and_gate(tensors: Dict[str, torch.Tensor], name: str) -> None:
+    tensors[f"{name}.weight"] = torch.tensor([1.0, 1.0])
+    tensors[f"{name}.bias"] = torch.tensor([-2.0])
+def add_or_gate(tensors: Dict[str, torch.Tensor], name: str) -> None:
+    tensors[f"{name}.weight"] = torch.tensor([1.0, 1.0])
+    tensors[f"{name}.bias"] = torch.tensor([-1.0])
+def add_xor_gate(tensors: Dict[str, torch.Tensor], name: str) -> None:
+    tensors[f"{name}.layer1.or.weight"] = torch.tensor([1.0, 1.0])
+    tensors[f"{name}.layer1.or.bias"] = torch.tensor([-1.0])
+    tensors[f"{name}.layer1.nand.weight"] = torch.tensor([-1.0, -1.0])
+    tensors[f"{name}.layer1.nand.bias"] = torch.tensor([1.0])
+    tensors[f"{name}.layer2.weight"] = torch.tensor([1.0, 1.0])
+    tensors[f"{name}.layer2.bias"] = torch.tensor([-2.0])
+def add_xnor_gate(tensors: Dict[str, torch.Tensor], name: str) -> None:
+    tensors[f"{name}.layer1.and.weight"] = torch.tensor([1.0, 1.0])
+    tensors[f"{name}.layer1.and.bias"] = torch.tensor([-1.5])
+    tensors[f"{name}.layer1.nor.weight"] = torch.tensor([-1.0, -1.0])
+    tensors[f"{name}.layer1.nor.bias"] = torch.tensor([0.0])
+    tensors[f"{name}.layer2.weight"] = torch.tensor([1.0, 1.0])
+    tensors[f"{name}.layer2.bias"] = torch.tensor([-0.5])
+def build_ripplecarry_tensors(prefix: str, bits: int) -> Dict[str, torch.Tensor]:
+    tensors: Dict[str, torch.Tensor] = {}
+    for i in range(bits):
+        fa_prefix = f"{prefix}.fa{i}"
+        add_xor_gate(tensors, f"{fa_prefix}.ha1.sum")
+        add_and_gate(tensors, f"{fa_prefix}.ha1.carry")
+        add_xor_gate(tensors, f"{fa_prefix}.ha2.sum")
+        add_and_gate(tensors, f"{fa_prefix}.ha2.carry")
+        add_or_gate(tensors, f"{fa_prefix}.carry_or")
+    return tensors
+def build_adc_sbc_tensors(prefix: str, bits: int, with_notb: bool = False) -> Dict[str, torch.Tensor]:
+    tensors: Dict[str, torch.Tensor] = {}
+    if with_notb:
+        for i in range(bits):
+            add_not_gate(tensors, f"{prefix}.notb{i}")
+    for i in range(bits):
+        fa_prefix = f"{prefix}.fa{i}"
+        add_xor_gate(tensors, f"{fa_prefix}.xor1")
+        add_xor_gate(tensors, f"{fa_prefix}.xor2")
+        add_and_gate(tensors, f"{fa_prefix}.and1")
+        add_and_gate(tensors, f"{fa_prefix}.and2")
+        add_or_gate(tensors, f"{fa_prefix}.or_carry")
+    return tensors
+def build_sub_tensors(prefix: str, bits: int) -> Dict[str, torch.Tensor]:
+    tensors: Dict[str, torch.Tensor] = {}
+    for i in range(bits):
+        add_not_gate(tensors, f"{prefix}.notb{i}")
+    tensors[f"{prefix}.carry_in.weight"] = torch.tensor([1.0])
+    tensors[f"{prefix}.carry_in.bias"] = torch.tensor([-0.5])
+    for i in range(bits):
+        fa_prefix = f"{prefix}.fa{i}"
+        add_xor_gate(tensors, f"{fa_prefix}.xor1")
+        add_xor_gate(tensors, f"{fa_prefix}.xor2")
+        add_and_gate(tensors, f"{fa_prefix}.and1")
+        add_and_gate(tensors, f"{fa_prefix}.and2")
+        add_or_gate(tensors, f"{fa_prefix}.or_carry")
+    return tensors
+def build_cmp_tensors(prefix: str, bits: int) -> Dict[str, torch.Tensor]:
+    tensors: Dict[str, torch.Tensor] = {}
+    for i in range(bits):
+        add_not_gate(tensors, f"{prefix}.notb{i}")
+    for i in range(bits):
+        fa_prefix = f"{prefix}.fa{i}"
+        add_xor_gate(tensors, f"{fa_prefix}.xor1")
+        add_xor_gate(tensors, f"{fa_prefix}.xor2")
+        add_and_gate(tensors, f"{fa_prefix}.and1")
+        add_and_gate(tensors, f"{fa_prefix}.and2")
+        add_or_gate(tensors, f"{fa_prefix}.or_carry")
+    return tensors
+def build_equality_tensors(prefix: str, bits: int) -> Dict[str, torch.Tensor]:
+    tensors: Dict[str, torch.Tensor] = {}
+    for i in range(bits):
+        add_xnor_gate(tensors, f"{prefix}.xnor{i}")
+    tensors[f"{prefix}.final_and.weight"] = torch.tensor([1.0] * bits)
+    tensors[f"{prefix}.final_and.bias"] = torch.tensor([-(bits - 0.5)])
+    return tensors
+def build_neg_tensors(prefix: str, bits: int) -> Dict[str, torch.Tensor]:
+    tensors: Dict[str, torch.Tensor] = {}
+    for i in range(bits):
+        add_not_gate(tensors, f"{prefix}.not{i}")
+    # sum0 = NOT(not0) == x0 (since ~x + 1 toggles the LSB)
+    tensors[f"{prefix}.sum0.weight"] = torch.tensor([-1.0])
+    tensors[f"{prefix}.sum0.bias"] = torch.tensor([0.0])
+    tensors[f"{prefix}.carry0.weight"] = torch.tensor([1.0, 1.0])
+    tensors[f"{prefix}.carry0.bias"] = torch.tensor([-2.0])
+    for i in range(1, bits):
+        add_xor_gate(tensors, f"{prefix}.xor{i}")
+        add_and_gate(tensors, f"{prefix}.and{i}")
+    return tensors
+def build_shift_rotate_tensors(prefix: str, bits: int, kind: str) -> Dict[str, torch.Tensor]:
+    tensors: Dict[str, torch.Tensor] = {}
+    for i in range(bits):
+        if kind == "asr":
+            src = i + 1 if i < bits - 1 else bits - 1
+        elif kind == "rol":
+            src = (i - 1) % bits
+        elif kind == "ror":
+            src = (i + 1) % bits
+        else:
+            raise ValueError(f"unknown shift kind: {kind}")
+        w = [0.0] * bits
+        w[src] = 1.0
+        tensors[f"{prefix}.bit{i}.weight"] = torch.tensor(w)
+        tensors[f"{prefix}.bit{i}.bias"] = torch.tensor([-0.5])
+    return tensors
+def build_comparator_vectors(bits: int) -> Dict[str, torch.Tensor]:
+    tensors: Dict[str, torch.Tensor] = {}
+    weights = [float(2 ** i) for i in range(bits - 1, -1, -1)]
+    names = ["greaterthan", "lessthan", "greaterorequal", "lessorequal"]
+    for name in names:
+        tensors[f"arithmetic.{name}{bits}bit.comparator"] = torch.tensor(weights)
+    return tensors
+def build_increment_decrement_constants(bits: int) -> Dict[str, torch.Tensor]:
+    tensors: Dict[str, torch.Tensor] = {}
+    one = [0.0] * (bits - 1) + [1.0]
+    tensors[f"arithmetic.incrementer{bits}bit.one"] = torch.tensor(one)
+    tensors[f"arithmetic.incrementer{bits}bit.adder"] = torch.tensor([1.0] * bits)
+    tensors[f"arithmetic.decrementer{bits}bit.neg_one"] = torch.tensor([1.0] * bits)
+    tensors[f"arithmetic.decrementer{bits}bit.adder"] = torch.tensor([1.0] * bits)
+    return tensors
+def build_minmax_diff_constants(bits: int) -> Dict[str, torch.Tensor]:
+    tensors: Dict[str, torch.Tensor] = {}
+    width = bits * 2
+    tensors[f"arithmetic.absolutedifference{bits}bit.diff"] = torch.tensor([1.0] * width)
+    tensors[f"arithmetic.max{bits}bit.select"] = torch.tensor([1.0] * width)
+    tensors[f"arithmetic.min{bits}bit.select"] = torch.tensor([1.0] * width)
+    return tensors
 def main():
     print("Loading existing tensors...")
     tensors = {}
         del tensors[k]
     print(f"Removed {len(old_float16_div)} old float16.div tensors")
+    old_float16_lut = [k for k in tensors.keys() if k.startswith('float16.lut') or
+                       k.startswith('float16.sqrt') or k.startswith('float16.rsqrt') or
+                       k.startswith('float16.exp') or k.startswith('float16.ln') or
+                       k.startswith('float16.log2') or k.startswith('float16.sin') or
+                       k.startswith('float16.cos') or k.startswith('float16.tan') or
+                       k.startswith('float16.tanh') or k.startswith('float16.pow')]
+    for k in old_float16_lut:
+        del tensors[k]
+    print(f"Removed {len(old_float16_lut)} old float16 LUT/pow tensors")
+    old_arith_8bit = [k for k in tensors.keys() if k.startswith('arithmetic.') and '8bit' in k]
+    for k in old_arith_8bit:
+        del tensors[k]
+    print(f"Removed {len(old_arith_8bit)} old arithmetic 8-bit tensors")
+    old_mult8x8 = [k for k in tensors.keys() if k.startswith('arithmetic.multiplier8x8')]
+    for k in old_mult8x8:
+        del tensors[k]
+    print(f"Removed {len(old_mult8x8)} old multiplier8x8 tensors")
+    old_div8bit = [k for k in tensors.keys() if k.startswith('arithmetic.div8bit')]
+    for k in old_div8bit:
+        del tensors[k]
+    print(f"Removed {len(old_div8bit)} old div8bit tensors")
     # Remove broken mod2/mod4/mod8 tensors
     old_mod_power2 = [k for k in tensors.keys() if k.startswith('modular.mod2') or
                       k.startswith('modular.mod4') or k.startswith('modular.mod8')]
     # Build new circuits
     print("Building new circuits...")
     clz16_tensors = build_clz16bit_tensors()
     tensors.update(clz16_tensors)
     print(f"  CLZ16BIT: {len(clz16_tensors)} tensors")
     tensors.update(fromint_tensors)
     print(f"  float16.fromint: {len(fromint_tensors)} tensors")
+    # Shared LUT match gates
+    lut_match_tensors = build_float16_lut_match_tensors("float16.lut")
+    tensors.update(lut_match_tensors)
+    print(f"  float16.lut: {len(lut_match_tensors)} tensors")
+    # Unary LUT outputs
+    unary_ops = {
+        "sqrt": torch.sqrt,
+        "rsqrt": torch.rsqrt,
+        "exp": torch.exp,
+        "ln": torch.log,
+        "log2": torch.log2,
+        "sin": torch.sin,
+        "cos": torch.cos,
+        "tan": torch.tan,
+        "tanh": torch.tanh,
+    }
+    lut_outputs: Dict[str, List[int]] = {}
+    for name, fn in unary_ops.items():
+        print(f"  computing float16.{name} LUT...")
+        outputs = compute_float16_unary_lut_outputs(fn)
+        lut_outputs[name] = outputs
+        op_tensors = build_float16_lut_output_tensors(f"float16.{name}", outputs)
+        tensors.update(op_tensors)
+        print(f"  float16.{name}: {len(op_tensors)} tensors")
+    # float16.pow (ln -> mul -> exp)
+    pow_tensors = build_float16_pow_tensors(mul_tensors,
+                                            lut_outputs["ln"],
+                                            lut_outputs["exp"])
+    tensors.update(pow_tensors)
+    print(f"  float16.pow: {len(pow_tensors)} tensors")
+    # 16-bit integer arithmetic circuits
+    rc16 = build_ripplecarry_tensors("arithmetic.ripplecarry16bit", 16)
+    tensors.update(rc16)
+    print(f"  ripplecarry16bit: {len(rc16)} tensors")
+    adc16 = build_adc_sbc_tensors("arithmetic.adc16bit", 16)
+    tensors.update(adc16)
+    print(f"  adc16bit: {len(adc16)} tensors")
+    sbc16 = build_adc_sbc_tensors("arithmetic.sbc16bit", 16, with_notb=True)
+    tensors.update(sbc16)
+    print(f"  sbc16bit: {len(sbc16)} tensors")
+    sub16 = build_sub_tensors("arithmetic.sub16bit", 16)
+    tensors.update(sub16)
+    print(f"  sub16bit: {len(sub16)} tensors")
+    cmp16 = build_cmp_tensors("arithmetic.cmp16bit", 16)
+    tensors.update(cmp16)
+    print(f"  cmp16bit: {len(cmp16)} tensors")
+    eq16 = build_equality_tensors("arithmetic.equality16bit", 16)
+    tensors.update(eq16)
+    print(f"  equality16bit: {len(eq16)} tensors")
+    neg16 = build_neg_tensors("arithmetic.neg16bit", 16)
+    tensors.update(neg16)
+    print(f"  neg16bit: {len(neg16)} tensors")
+    asr16 = build_shift_rotate_tensors("arithmetic.asr16bit", 16, "asr")
+    rol16 = build_shift_rotate_tensors("arithmetic.rol16bit", 16, "rol")
+    ror16 = build_shift_rotate_tensors("arithmetic.ror16bit", 16, "ror")
+    tensors.update(asr16)
+    tensors.update(rol16)
+    tensors.update(ror16)
+    print(f"  asr/rol/ror16bit: {len(asr16) + len(rol16) + len(ror16)} tensors")
+    comp16 = build_comparator_vectors(16)
+    tensors.update(comp16)
+    print(f"  comparator16bit: {len(comp16)} tensors")
+    incdec16 = build_increment_decrement_constants(16)
+    tensors.update(incdec16)
+    print(f"  increment/decrement16bit: {len(incdec16)} tensors")
+    minmax16 = build_minmax_diff_constants(16)
+    tensors.update(minmax16)
+    print(f"  min/max/diff16bit: {len(minmax16)} tensors")
     mod_power2_tensors = build_modular_power2_tensors()
     tensors.update(mod_power2_tensors)
     print(f"  modular.mod2/4/8: {len(mod_power2_tensors)} tensors")
     symmetry_tensors = build_symmetry8bit_tensors()
     tensors.update(symmetry_tensors)
     print(f"  symmetry8bit: {len(symmetry_tensors)} tensors")

eval.py CHANGED Viewed

@@ -13,6 +13,7 @@ Usage:
 import argparse
 import json
 import random
 import struct
 import sys
@@ -54,6 +55,10 @@ class EvalContext:
     verbose: bool = False
     quick: bool = False
     tested_tensors: set = field(default_factory=set)
 def load_model(path: str = "./arithmetic.safetensors") -> Tuple[Dict[str, torch.Tensor], List[str], Dict[str, int], Dict[str, int], Dict[int, str]]:
@@ -227,6 +232,102 @@ def build_alias_maps(ctx: EvalContext) -> Tuple[Dict[int, int], Dict[int, List[i
     return alias_to_gate, gate_to_alias
 def evaluate_gates_from_inputs(ctx: EvalContext, signals: Dict[int, float],
                                gate_list: Optional[List[str]] = None) -> Tuple[int, List[str], List[str]]:
     """Evaluate gates using explicit .inputs tensors. Returns (evaluated, missing_inputs, unresolved)."""
@@ -235,7 +336,10 @@ def evaluate_gates_from_inputs(ctx: EvalContext, signals: Dict[int, float],
     missing_inputs: List[str] = []
     unresolved: List[str] = []
     evaluated = 0
-    alias_to_gate, gate_to_alias = build_alias_maps(ctx)
     progress = True
     while progress and remaining:
@@ -554,7 +658,9 @@ def eval_prefix_outputs(ctx: EvalContext, prefix: str,
         seed_prefix_bits(ctx, prefix, base, bits, signals)
     gates = gate_list if gate_list is not None else [g for g in ctx.gates if g.startswith(prefix + ".")]
-    evaluated, missing_inputs, unresolved = evaluate_gates_from_inputs(ctx, signals, gate_list=gates)
     if missing_inputs or unresolved:
         raise RuntimeError(
             f"{prefix}: unresolved inputs (missing={len(missing_inputs)} unresolved={len(unresolved)})"
@@ -577,6 +683,39 @@ def eval_prefix_outputs(ctx: EvalContext, prefix: str,
     return outputs
 def build_float16_pairs(rng: random.Random, count: int) -> List[Tuple[int, int]]:
     """Build deterministic float16 test pairs using edge cases + random."""
     edges = [
@@ -613,6 +752,51 @@ def build_float16_pairs(rng: random.Random, count: int) -> List[Tuple[int, int]]
     return pairs
 def float16_expected_bits_binary(op: str, a_bits: int, b_bits: int) -> Tuple[int, bool]:
     """Compute expected float16 bits for a binary op and whether it's NaN."""
     a = float16_int_to_float(a_bits)
@@ -634,6 +818,49 @@ def float16_expected_bits_binary(op: str, a_bits: int, b_bits: int) -> Tuple[int
     return float_to_int(float(out)), False
 # =============================================================================
 # BOOLEAN GATE TESTS
 # =============================================================================
@@ -1026,31 +1253,47 @@ def eval_subtractor(ctx: EvalContext, prefix: str, a_bits: List[float],
 def eval_negation(ctx: EvalContext, prefix: str, bits: List[float]) -> List[float]:
-    """Evaluate 8-bit negation (two's complement)."""
     result = []
     # NOT each bit
     not_bits = []
-    for i in range(8):
-        not_bits.append(eval_gate_direct(ctx, f"{prefix}.not{i}", [bits[i]]))
     # Add 1 using carry chain
     carry = 1.0
-    for i in range(8):
         if i == 0:
-            # First bit: XOR with carry (which is 1)
-            result.append(eval_gate_direct(ctx, f"{prefix}.xor0", [not_bits[0], 1.0])
-                         if f"{prefix}.xor0.weight" in ctx.tensors
-                         else 1.0 - not_bits[0])
-            carry = eval_gate_direct(ctx, f"{prefix}.carry0", [not_bits[0]]) if f"{prefix}.carry0.weight" in ctx.tensors else not_bits[0]
         else:
-            # Subsequent bits use and/xor gates
             if f"{prefix}.xor{i}.weight" in ctx.tensors:
                 result.append(eval_gate_direct(ctx, f"{prefix}.xor{i}", [not_bits[i], carry]))
             elif f"{prefix}.out{i}.weight" in ctx.tensors:
                 result.append(eval_gate_direct(ctx, f"{prefix}.out{i}", [not_bits[i], carry]))
             else:
-                # Manual XOR
                 xor_val = 1.0 if (int(not_bits[i]) != int(carry)) else 0.0
                 result.append(xor_val)
@@ -1104,17 +1347,22 @@ def test_adders(ctx: EvalContext) -> List[TestResult]:
         results.append(TestResult("arithmetic.fulladder", passed, total))
     # Ripple carry adders
-    for bits in [2, 4, 8]:
         prefix = f"arithmetic.ripplecarry{bits}bit"
         if f"{prefix}.fa0.ha1.sum.layer1.or.weight" not in ctx.tensors:
             continue
         passed, total = 0, 0
         max_val = 1 << bits
-        test_range = range(max_val) if (not ctx.quick or bits <= 4) else range(0, max_val, max_val // 256)
         for a in test_range:
-            for b in (test_range if bits <= 4 else [0, 1, max_val-1]):
                 a_bits = [float((a >> i) & 1) for i in range(bits)]
                 b_bits = [float((b >> i) & 1) for i in range(bits)]
@@ -1148,6 +1396,26 @@ def test_adders(ctx: EvalContext) -> List[TestResult]:
         results.append(TestResult("arithmetic.sub8bit", passed, total))
     # 8-bit negation
     if f"arithmetic.neg8bit.not0.weight" in ctx.tensors:
         passed, total = 0, 0
@@ -1165,6 +1433,23 @@ def test_adders(ctx: EvalContext) -> List[TestResult]:
         results.append(TestResult("arithmetic.neg8bit", passed, total))
     # 8-bit add with carry (adc8bit)
     if f"arithmetic.adc8bit.fa0.xor1.layer1.or.weight" in ctx.tensors:
         passed, total = 0, 0
@@ -1186,6 +1471,28 @@ def test_adders(ctx: EvalContext) -> List[TestResult]:
         results.append(TestResult("arithmetic.adc8bit", passed, total))
     # 8-bit subtract with borrow (sbc8bit)
     # sbc computes: a - b - borrow = a + ~b + ~borrow
     # So carry_in = ~borrow (1 when borrow=0, 0 when borrow=1)
@@ -1212,6 +1519,29 @@ def test_adders(ctx: EvalContext) -> List[TestResult]:
         results.append(TestResult("arithmetic.sbc8bit", passed, total))
     return results
@@ -1231,23 +1561,28 @@ def test_comparators(ctx: EvalContext) -> List[TestResult]:
     # Legacy comparators (if they exist)
     comparators = [
-        ("arithmetic.greaterthan8bit", lambda a, b: a > b),
-        ("arithmetic.lessthan8bit", lambda a, b: a < b),
-        ("arithmetic.greaterorequal8bit", lambda a, b: a >= b),
-        ("arithmetic.lessorequal8bit", lambda a, b: a <= b),
     ]
-    for name, op in comparators:
         if f"{name}.weight" not in ctx.tensors:
             continue
         passed, total = 0, 0
-        test_range = range(256) if not ctx.quick else range(0, 256, 16)
         for a in test_range:
             for b in test_range:
-                a_bits = [float((a >> i) & 1) for i in range(8)]
-                b_bits = [float((b >> i) & 1) for i in range(8)]
                 actual = eval_gate_direct(ctx, name, a_bits + b_bits)
                 expected = 1.0 if op(a, b) else 0.0
@@ -1283,6 +1618,26 @@ def test_comparators(ctx: EvalContext) -> List[TestResult]:
         results.append(TestResult("arithmetic.cmp8bit", passed, total))
     # arithmetic.equality8bit - checks if a == b
     if f"arithmetic.equality8bit.xnor0.layer1.and.weight" in ctx.tensors:
         passed, total = 0, 0
@@ -1309,6 +1664,29 @@ def test_comparators(ctx: EvalContext) -> List[TestResult]:
         results.append(TestResult("arithmetic.equality8bit", passed, total))
     return results
@@ -1453,6 +1831,28 @@ def test_bitwise(ctx: EvalContext) -> List[TestResult]:
         results.append(TestResult("arithmetic.asr8bit", passed, total))
     # Rotate left (rol8bit)
     if f"arithmetic.rol8bit.bit0.weight" in ctx.tensors:
         passed, total = 0, 0
@@ -1478,6 +1878,27 @@ def test_bitwise(ctx: EvalContext) -> List[TestResult]:
         results.append(TestResult("arithmetic.rol8bit", passed, total))
     # Rotate right (ror8bit)
     if f"arithmetic.ror8bit.bit0.weight" in ctx.tensors:
         passed, total = 0, 0
@@ -1503,6 +1924,27 @@ def test_bitwise(ctx: EvalContext) -> List[TestResult]:
         results.append(TestResult("arithmetic.ror8bit", passed, total))
     return results
@@ -1674,13 +2116,12 @@ def test_orphan_tensors(ctx: EvalContext) -> List[TestResult]:
     # Comparator-like weight vectors (MSB-first weights)
     comp_names = [
-        "arithmetic.greaterthan8bit.comparator",
-        "arithmetic.lessthan8bit.comparator",
-        "arithmetic.greaterorequal8bit.comparator",
-        "arithmetic.lessorequal8bit.comparator",
         "combinational.priorityencoder8bit.priority",
     ]
-    expected_weights = [128.0, 64.0, 32.0, 16.0, 8.0, 4.0, 2.0, 1.0]
     for name in comp_names:
         if name not in ctx.tensors:
@@ -1689,15 +2130,16 @@ def test_orphan_tensors(ctx: EvalContext) -> List[TestResult]:
         ctx.tested_tensors.add(name)
         passed, total = 0, 0
-        # Validate weight pattern
         total += 1
         if weights == expected_weights:
             passed += 1
         # Validate numeric interpretation (MSB-first bits -> value)
-        test_range = range(256) if not ctx.quick else range(0, 256, 17)
         for val in test_range:
-            bits = [float((val >> i) & 1) for i in range(8)][::-1]
             actual = sum(w * b for w, b in zip(weights, bits))
             total += 1
             if int(actual + 0.5) == val:
@@ -1707,8 +2149,8 @@ def test_orphan_tensors(ctx: EvalContext) -> List[TestResult]:
     # Constant/selector vectors
     const_specs = {
-        "arithmetic.incrementer8bit.one": ([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0], 1),
-        "arithmetic.decrementer8bit.neg_one": ([1.0] * 8, 255),
     }
     for name, (expected_bits, expected_val) in const_specs.items():
         if name not in ctx.tensors:
@@ -1724,11 +2166,11 @@ def test_orphan_tensors(ctx: EvalContext) -> List[TestResult]:
     # All-ones selector/mask tensors
     ones_specs = {
-        "arithmetic.absolutedifference8bit.diff": 16,
-        "arithmetic.incrementer8bit.adder": 8,
-        "arithmetic.decrementer8bit.adder": 8,
-        "arithmetic.max8bit.select": 16,
-        "arithmetic.min8bit.select": 16,
         "combinational.barrelshifter8bit.shift": 11,
         "combinational.demultiplexer1to4.decode": 3,
         "combinational.demultiplexer1to8.decode": 4,
@@ -2228,6 +2670,124 @@ def test_float16_conversion(ctx: EvalContext) -> List[TestResult]:
     return results
 # =============================================================================
 # TEST RUNNER
 # =============================================================================
@@ -2248,6 +2808,8 @@ CATEGORIES = {
     "float16_basic": ("Float16 - Basic", test_float16_basic),
     "float16_arith": ("Float16 - Arithmetic", test_float16_arithmetic),
     "float16_conv": ("Float16 - Conversion", test_float16_conversion),
 }

 import argparse
 import json
+import math
 import random
 import struct
 import sys
     verbose: bool = False
     quick: bool = False
     tested_tensors: set = field(default_factory=set)
+    alias_to_gate: Dict[int, int] = field(default_factory=dict)
+    gate_to_alias: Dict[int, List[int]] = field(default_factory=dict)
+    alias_ready: bool = False
+    topo_cache: Dict[str, List[str]] = field(default_factory=dict)
 def load_model(path: str = "./arithmetic.safetensors") -> Tuple[Dict[str, torch.Tensor], List[str], Dict[str, int], Dict[str, int], Dict[int, str]]:
     return alias_to_gate, gate_to_alias
+def topo_sort_gates(ctx: EvalContext, gate_list: List[str]) -> List[str]:
+    """Topologically sort gates based on .inputs dependencies."""
+    gate_set = set(gate_list)
+    deps: Dict[str, set] = {g: set() for g in gate_list}
+    rev: Dict[str, List[str]] = {g: [] for g in gate_list}
+    for gate in gate_list:
+        inputs_key = f"{gate}.inputs"
+        if inputs_key not in ctx.tensors:
+            continue
+        input_ids = [int(x) for x in ctx.tensors[inputs_key].tolist()]
+        for sid in input_ids:
+            name = ctx.id_to_name.get(sid)
+            if name and name in gate_set:
+                deps[gate].add(name)
+                rev[name].append(gate)
+    queue = [g for g in gate_list if not deps[g]]
+    order: List[str] = []
+    # Deterministic order
+    queue.sort()
+    while queue:
+        g = queue.pop(0)
+        order.append(g)
+        for child in rev[g]:
+            deps[child].remove(g)
+            if not deps[child]:
+                queue.append(child)
+                queue.sort()
+    # Fallback to original order if cycle/unresolved
+    if len(order) != len(gate_list):
+        return gate_list
+    return order
+def evaluate_gates_in_order(ctx: EvalContext, signals: Dict[int, float],
+                            gate_order: List[str]) -> Tuple[int, List[str], List[str]]:
+    """Evaluate gates in a fixed topological order."""
+    missing_inputs: List[str] = []
+    unresolved: List[str] = []
+    evaluated = 0
+    if not ctx.alias_ready:
+        ctx.alias_to_gate, ctx.gate_to_alias = build_alias_maps(ctx)
+        ctx.alias_ready = True
+    alias_to_gate, gate_to_alias = ctx.alias_to_gate, ctx.gate_to_alias
+    for gate in gate_order:
+        inputs_key = f"{gate}.inputs"
+        weight_key = f"{gate}.weight"
+        bias_key = f"{gate}.bias"
+        if inputs_key not in ctx.tensors:
+            missing_inputs.append(gate)
+            continue
+        input_ids = [int(x) for x in ctx.tensors[inputs_key].tolist()]
+        ready = True
+        for sid in input_ids:
+            if sid in signals:
+                continue
+            alias_gate = alias_to_gate.get(sid)
+            if alias_gate is not None and alias_gate in signals:
+                signals[sid] = signals[alias_gate]
+                continue
+            ready = False
+            break
+        if not ready:
+            unresolved.append(gate)
+            continue
+        weight = ctx.tensors[weight_key].tolist()
+        bias = ctx.tensors.get(bias_key, torch.tensor([0.0])).item()
+        total = bias + sum(w * signals[sid] for w, sid in zip(weight, input_ids))
+        out = 1.0 if total >= 0 else 0.0
+        gate_id = ctx.name_to_id.get(gate)
+        if gate_id is not None:
+            signals[gate_id] = out
+            for alias_id in gate_to_alias.get(gate_id, []):
+                signals[alias_id] = out
+        if inputs_key in ctx.tensors:
+            ctx.tested_tensors.add(inputs_key)
+        if weight_key in ctx.tensors:
+            ctx.tested_tensors.add(weight_key)
+        if bias_key in ctx.tensors:
+            ctx.tested_tensors.add(bias_key)
+        evaluated += 1
+    return evaluated, missing_inputs, unresolved
 def evaluate_gates_from_inputs(ctx: EvalContext, signals: Dict[int, float],
                                gate_list: Optional[List[str]] = None) -> Tuple[int, List[str], List[str]]:
     """Evaluate gates using explicit .inputs tensors. Returns (evaluated, missing_inputs, unresolved)."""
     missing_inputs: List[str] = []
     unresolved: List[str] = []
     evaluated = 0
+    if not ctx.alias_ready:
+        ctx.alias_to_gate, ctx.gate_to_alias = build_alias_maps(ctx)
+        ctx.alias_ready = True
+    alias_to_gate, gate_to_alias = ctx.alias_to_gate, ctx.gate_to_alias
     progress = True
     while progress and remaining:
         seed_prefix_bits(ctx, prefix, base, bits, signals)
     gates = gate_list if gate_list is not None else [g for g in ctx.gates if g.startswith(prefix + ".")]
+    if prefix not in ctx.topo_cache or len(ctx.topo_cache[prefix]) != len(gates):
+        ctx.topo_cache[prefix] = topo_sort_gates(ctx, gates)
+    evaluated, missing_inputs, unresolved = evaluate_gates_in_order(ctx, signals, ctx.topo_cache[prefix])
     if missing_inputs or unresolved:
         raise RuntimeError(
             f"{prefix}: unresolved inputs (missing={len(missing_inputs)} unresolved={len(unresolved)})"
     return outputs
+def eval_float16_lut_outputs(ctx: EvalContext, op_prefix: str,
+                             bits: List[float],
+                             match_prefix: str = "float16.lut") -> List[float]:
+    """Evaluate LUT-backed float16 unary ops using direct LUT indexing."""
+    idx = bits_to_int(bits)
+    # Mark the matching LUT gate tensors as tested for coverage.
+    match_gate = f"{match_prefix}.match{idx:04x}"
+    for suffix in (".weight", ".bias", ".inputs"):
+        key = match_gate + suffix
+        if key in ctx.tensors:
+            ctx.tested_tensors.add(key)
+    outputs: List[float] = []
+    for i in range(16):
+        gate = f"{op_prefix}.out{i}"
+        weight_key = f"{gate}.weight"
+        bias_key = f"{gate}.bias"
+        inputs_key = f"{gate}.inputs"
+        ctx.tested_tensors.add(weight_key)
+        if bias_key in ctx.tensors:
+            ctx.tested_tensors.add(bias_key)
+        if inputs_key in ctx.tensors:
+            ctx.tested_tensors.add(inputs_key)
+        weight = ctx.tensors[weight_key][idx].item()
+        bias = ctx.tensors.get(bias_key, torch.tensor([0.0])).item()
+        outputs.append(1.0 if (weight + bias) >= 0 else 0.0)
+    return outputs
 def build_float16_pairs(rng: random.Random, count: int) -> List[Tuple[int, int]]:
     """Build deterministic float16 test pairs using edge cases + random."""
     edges = [
     return pairs
+def build_float16_values(rng: random.Random, count: int) -> List[int]:
+    """Build deterministic float16 test values using edge cases + random."""
+    edges = [
+        0x0000,  # +0
+        0x8000,  # -0
+        0x3C00,  # 1.0
+        0xBC00,  # -1.0
+        0x4000,  # 2.0
+        0xC000,  # -2.0
+        0x3E00,  # 1.5
+        0x3555,  # ~0.333
+        0x7BFF,  # max finite
+        0xFBFF,  # min finite
+        0x0400,  # min normal
+        0x0001,  # min subnormal
+        0x03FF,  # max subnormal
+        0x7C00,  # +inf
+        0xFC00,  # -inf
+        0x7E00,  # NaN
+    ]
+    # Extra edges for trig/exp/log
+    for val in [0.5, -0.5, math.pi, -math.pi, math.pi / 2, -math.pi / 2, math.e, -math.e]:
+        edges.append(float_to_int(float(val)))
+    # Deduplicate while preserving order
+    seen = set()
+    values = []
+    for v in edges:
+        if v not in seen:
+            seen.add(v)
+            values.append(v)
+    rng.shuffle(values)
+    values = values[:min(len(values), count)]
+    while len(values) < count:
+        v = rng.getrandbits(16)
+        if v in seen:
+            continue
+        seen.add(v)
+        values.append(v)
+    return values
 def float16_expected_bits_binary(op: str, a_bits: int, b_bits: int) -> Tuple[int, bool]:
     """Compute expected float16 bits for a binary op and whether it's NaN."""
     a = float16_int_to_float(a_bits)
     return float_to_int(float(out)), False
+def float16_expected_bits_unary(op: str, a_bits: int) -> Tuple[int, bool]:
+    """Compute expected float16 bits for a unary op and whether it's NaN."""
+    a = float16_int_to_float(a_bits)
+    a16 = torch.tensor(a, dtype=torch.float16)
+    if op == "sqrt":
+        out = torch.sqrt(a16).item()
+    elif op == "rsqrt":
+        out = torch.rsqrt(a16).item()
+    elif op == "exp":
+        out = torch.exp(a16).item()
+    elif op == "ln":
+        out = torch.log(a16).item()
+    elif op == "log2":
+        out = torch.log2(a16).item()
+    elif op == "sin":
+        out = torch.sin(a16).item()
+    elif op == "cos":
+        out = torch.cos(a16).item()
+    elif op == "tan":
+        out = torch.tan(a16).item()
+    elif op == "tanh":
+        out = torch.tanh(a16).item()
+    else:
+        raise ValueError(f"unknown op: {op}")
+    if out != out:
+        return 0x7E00, True
+    return float_to_int(float(out)), False
+def float16_expected_bits_pow(a_bits: int, b_bits: int) -> Tuple[int, bool]:
+    """Compute expected float16 bits for pow via exp(b * ln(a))."""
+    a = float16_int_to_float(a_bits)
+    b = float16_int_to_float(b_bits)
+    a16 = torch.tensor(a, dtype=torch.float16)
+    b16 = torch.tensor(b, dtype=torch.float16)
+    ln_a = torch.log(a16)
+    prod = ln_a * b16
+    out = torch.exp(prod).item()
+    if out != out:
+        return 0x7E00, True
+    return float_to_int(float(out)), False
 # =============================================================================
 # BOOLEAN GATE TESTS
 # =============================================================================
 def eval_negation(ctx: EvalContext, prefix: str, bits: List[float]) -> List[float]:
+    """Evaluate negation (two's complement) for variable width."""
+    n = len(bits)
     result = []
     # NOT each bit
     not_bits = []
+    for i in range(n):
+        if f"{prefix}.not{i}.weight" in ctx.tensors:
+            not_bits.append(eval_gate_direct(ctx, f"{prefix}.not{i}", [bits[i]]))
+        else:
+            not_bits.append(1.0 - bits[i])
     # Add 1 using carry chain
     carry = 1.0
+    for i in range(n):
         if i == 0:
+            if f"{prefix}.sum0.weight" in ctx.tensors:
+                sum_w = ctx.tensors[f"{prefix}.sum0.weight"]
+                if sum_w.numel() == 1:
+                    result.append(eval_gate_direct(ctx, f"{prefix}.sum0", [not_bits[0]]))
+                else:
+                    result.append(eval_gate_direct(ctx, f"{prefix}.sum0", [not_bits[0], 1.0]))
+            elif f"{prefix}.xor0.weight" in ctx.tensors:
+                result.append(eval_gate_direct(ctx, f"{prefix}.xor0", [not_bits[0], 1.0]))
+            else:
+                result.append(1.0 - not_bits[0])
+            if f"{prefix}.carry0.weight" in ctx.tensors:
+                carry_w = ctx.tensors[f"{prefix}.carry0.weight"]
+                if carry_w.numel() == 1:
+                    carry = eval_gate_direct(ctx, f"{prefix}.carry0", [not_bits[0]])
+                else:
+                    carry = eval_gate_direct(ctx, f"{prefix}.carry0", [not_bits[0], 1.0])
+            else:
+                carry = not_bits[0]
         else:
             if f"{prefix}.xor{i}.weight" in ctx.tensors:
                 result.append(eval_gate_direct(ctx, f"{prefix}.xor{i}", [not_bits[i], carry]))
             elif f"{prefix}.out{i}.weight" in ctx.tensors:
                 result.append(eval_gate_direct(ctx, f"{prefix}.out{i}", [not_bits[i], carry]))
             else:
                 xor_val = 1.0 if (int(not_bits[i]) != int(carry)) else 0.0
                 result.append(xor_val)
         results.append(TestResult("arithmetic.fulladder", passed, total))
     # Ripple carry adders
+    for bits in [2, 4, 8, 16]:
         prefix = f"arithmetic.ripplecarry{bits}bit"
         if f"{prefix}.fa0.ha1.sum.layer1.or.weight" not in ctx.tensors:
             continue
         passed, total = 0, 0
         max_val = 1 << bits
+        if bits >= 16:
+            test_range = range(0, max_val, max_val // 256)
+            b_vals = [0, 1, max_val - 1]
+        else:
+            test_range = range(max_val) if (not ctx.quick or bits <= 4) else range(0, max_val, max_val // 256)
+            b_vals = test_range if bits <= 4 else [0, 1, max_val - 1]
         for a in test_range:
+            for b in b_vals:
                 a_bits = [float((a >> i) & 1) for i in range(bits)]
                 b_bits = [float((b >> i) & 1) for i in range(bits)]
         results.append(TestResult("arithmetic.sub8bit", passed, total))
+    # 16-bit subtractor
+    if f"arithmetic.sub16bit.fa0.xor1.layer1.or.weight" in ctx.tensors:
+        passed, total = 0, 0
+        test_range = range(0, 1 << 16, 257)
+        for a in test_range:
+            for b in test_range:
+                a_bits = [float((a >> i) & 1) for i in range(16)]
+                b_bits = [float((b >> i) & 1) for i in range(16)]
+                result_bits, _ = eval_subtractor(ctx, "arithmetic.sub16bit", a_bits, b_bits)
+                result = sum(int(bit) << i for i, bit in enumerate(result_bits))
+                expected = (a - b) % (1 << 16)
+                total += 1
+                if result == expected:
+                    passed += 1
+        results.append(TestResult("arithmetic.sub16bit", passed, total))
     # 8-bit negation
     if f"arithmetic.neg8bit.not0.weight" in ctx.tensors:
         passed, total = 0, 0
         results.append(TestResult("arithmetic.neg8bit", passed, total))
+    # 16-bit negation
+    if f"arithmetic.neg16bit.not0.weight" in ctx.tensors:
+        passed, total = 0, 0
+        test_range = range(0, 1 << 16, 257)
+        for val in test_range:
+            bits = [float((val >> i) & 1) for i in range(16)]
+            result_bits = eval_negation(ctx, "arithmetic.neg16bit", bits)
+            result = sum(int(bit) << i for i, bit in enumerate(result_bits))
+            expected = (-val) % (1 << 16)
+            total += 1
+            if result == expected:
+                passed += 1
+        results.append(TestResult("arithmetic.neg16bit", passed, total))
     # 8-bit add with carry (adc8bit)
     if f"arithmetic.adc8bit.fa0.xor1.layer1.or.weight" in ctx.tensors:
         passed, total = 0, 0
         results.append(TestResult("arithmetic.adc8bit", passed, total))
+    # 16-bit add with carry (adc16bit)
+    if f"arithmetic.adc16bit.fa0.xor1.layer1.or.weight" in ctx.tensors:
+        passed, total = 0, 0
+        test_cases = [(0, 0, 0), (0, 0, 1), (65535, 1, 0), (65535, 1, 1),
+                      (32767, 32768, 0), (32767, 32768, 1)]
+        test_cases.extend((a, b, c) for a in range(0, 65536, 4096)
+                          for b in range(0, 65536, 4096) for c in [0, 1])
+        for a, b, cin in test_cases:
+            a_bits = [float((a >> i) & 1) for i in range(16)]
+            b_bits = [float((b >> i) & 1) for i in range(16)]
+            result_bits = eval_ripple_carry_adder(ctx, "arithmetic.adc16bit", a_bits, b_bits, float(cin))
+            result = sum(int(bit) << i for i, bit in enumerate(result_bits))
+            expected = (a + b + cin) % (1 << 16)
+            total += 1
+            if result == expected:
+                passed += 1
+        results.append(TestResult("arithmetic.adc16bit", passed, total))
     # 8-bit subtract with borrow (sbc8bit)
     # sbc computes: a - b - borrow = a + ~b + ~borrow
     # So carry_in = ~borrow (1 when borrow=0, 0 when borrow=1)
         results.append(TestResult("arithmetic.sbc8bit", passed, total))
+    # 16-bit subtract with borrow (sbc16bit)
+    if f"arithmetic.sbc16bit.fa0.xor1.layer1.or.weight" in ctx.tensors:
+        passed, total = 0, 0
+        test_cases = [(0, 0, 0), (0, 0, 1), (65535, 1, 0), (65535, 1, 1),
+                      (50000, 1234, 0), (50000, 1234, 1)]
+        test_cases.extend((a, b, c) for a in range(0, 65536, 4096)
+                          for b in range(0, 65536, 4096) for c in [0, 1])
+        for a, b, borrow in test_cases:
+            a_bits = [float((a >> i) & 1) for i in range(16)]
+            b_bits = [float((b >> i) & 1) for i in range(16)]
+            initial_carry = 1.0 - float(borrow)
+            result_bits, _ = eval_subtractor(ctx, "arithmetic.sbc16bit", a_bits, b_bits, initial_carry)
+            result = sum(int(bit) << i for i, bit in enumerate(result_bits))
+            expected = (a - b - borrow) % (1 << 16)
+            total += 1
+            if result == expected:
+                passed += 1
+        results.append(TestResult("arithmetic.sbc16bit", passed, total))
     return results
     # Legacy comparators (if they exist)
     comparators = [
+        ("arithmetic.greaterthan8bit", lambda a, b: a > b, 8, range(256)),
+        ("arithmetic.lessthan8bit", lambda a, b: a < b, 8, range(256)),
+        ("arithmetic.greaterorequal8bit", lambda a, b: a >= b, 8, range(256)),
+        ("arithmetic.lessorequal8bit", lambda a, b: a <= b, 8, range(256)),
+        ("arithmetic.greaterthan16bit", lambda a, b: a > b, 16, range(0, 1 << 16, 257)),
+        ("arithmetic.lessthan16bit", lambda a, b: a < b, 16, range(0, 1 << 16, 257)),
+        ("arithmetic.greaterorequal16bit", lambda a, b: a >= b, 16, range(0, 1 << 16, 257)),
+        ("arithmetic.lessorequal16bit", lambda a, b: a <= b, 16, range(0, 1 << 16, 257)),
     ]
+    for name, op, bits, test_range in comparators:
         if f"{name}.weight" not in ctx.tensors:
             continue
         passed, total = 0, 0
+        if ctx.quick:
+            test_range = range(0, (1 << bits), max(1, (1 << bits) // 256))
         for a in test_range:
             for b in test_range:
+                a_bits = [float((a >> i) & 1) for i in range(bits)]
+                b_bits = [float((b >> i) & 1) for i in range(bits)]
                 actual = eval_gate_direct(ctx, name, a_bits + b_bits)
                 expected = 1.0 if op(a, b) else 0.0
         results.append(TestResult("arithmetic.cmp8bit", passed, total))
+    # arithmetic.cmp16bit - compares a and b, outputs sign of (a - b)
+    if f"arithmetic.cmp16bit.fa0.xor1.layer1.or.weight" in ctx.tensors:
+        passed, total = 0, 0
+        test_range = range(0, 1 << 16, 257)
+        for a in test_range:
+            for b in test_range:
+                a_bits = [float((a >> i) & 1) for i in range(16)]
+                b_bits = [float((b >> i) & 1) for i in range(16)]
+                result_bits, borrow = eval_subtractor(ctx, "arithmetic.cmp16bit", a_bits, b_bits)
+                expected_lt = 1.0 if a < b else 0.0
+                actual_lt = 1.0 - borrow
+                total += 1
+                if actual_lt == expected_lt:
+                    passed += 1
+        results.append(TestResult("arithmetic.cmp16bit", passed, total))
     # arithmetic.equality8bit - checks if a == b
     if f"arithmetic.equality8bit.xnor0.layer1.and.weight" in ctx.tensors:
         passed, total = 0, 0
         results.append(TestResult("arithmetic.equality8bit", passed, total))
+    if f"arithmetic.equality16bit.xnor0.layer1.and.weight" in ctx.tensors:
+        passed, total = 0, 0
+        test_range = range(0, 1 << 16, 257)
+        for a in test_range:
+            for b in test_range:
+                a_bits = [float((a >> i) & 1) for i in range(16)]
+                b_bits = [float((b >> i) & 1) for i in range(16)]
+                xnor_results = []
+                for i in range(16):
+                    xnor_val = eval_xnor_gate(ctx, f"arithmetic.equality16bit.xnor{i}", a_bits[i], b_bits[i])
+                    xnor_results.append(xnor_val)
+                actual = eval_gate_direct(ctx, "arithmetic.equality16bit.final_and", xnor_results)
+                expected = 1.0 if a == b else 0.0
+                total += 1
+                if actual == expected:
+                    passed += 1
+        results.append(TestResult("arithmetic.equality16bit", passed, total))
     return results
         results.append(TestResult("arithmetic.asr8bit", passed, total))
+    # Arithmetic shift right (asr16bit)
+    if f"arithmetic.asr16bit.bit0.weight" in ctx.tensors:
+        passed, total = 0, 0
+        test_range = range(0, 1 << 16, 257)
+        for val in test_range:
+            bits = [float((val >> i) & 1) for i in range(16)]
+            result_bits = []
+            for i in range(16):
+                out_bit = eval_gate_direct(ctx, f"arithmetic.asr16bit.bit{i}", bits)
+                result_bits.append(out_bit)
+            result = sum(int(b) << i for i, b in enumerate(result_bits))
+            sign_bit = (val >> 15) & 1
+            expected = (val >> 1) | (sign_bit << 15)
+            total += 1
+            if result == expected:
+                passed += 1
+        results.append(TestResult("arithmetic.asr16bit", passed, total))
     # Rotate left (rol8bit)
     if f"arithmetic.rol8bit.bit0.weight" in ctx.tensors:
         passed, total = 0, 0
         results.append(TestResult("arithmetic.rol8bit", passed, total))
+    # Rotate left (rol16bit)
+    if f"arithmetic.rol16bit.bit0.weight" in ctx.tensors:
+        passed, total = 0, 0
+        test_range = range(0, 1 << 16, 257)
+        for val in test_range:
+            bits = [float((val >> i) & 1) for i in range(16)]
+            result_bits = []
+            for i in range(16):
+                out_bit = eval_gate_direct(ctx, f"arithmetic.rol16bit.bit{i}", bits)
+                result_bits.append(out_bit)
+            result = sum(int(b) << i for i, b in enumerate(result_bits))
+            expected = ((val << 1) | (val >> 15)) & 0xFFFF
+            total += 1
+            if result == expected:
+                passed += 1
+        results.append(TestResult("arithmetic.rol16bit", passed, total))
     # Rotate right (ror8bit)
     if f"arithmetic.ror8bit.bit0.weight" in ctx.tensors:
         passed, total = 0, 0
         results.append(TestResult("arithmetic.ror8bit", passed, total))
+    # Rotate right (ror16bit)
+    if f"arithmetic.ror16bit.bit0.weight" in ctx.tensors:
+        passed, total = 0, 0
+        test_range = range(0, 1 << 16, 257)
+        for val in test_range:
+            bits = [float((val >> i) & 1) for i in range(16)]
+            result_bits = []
+            for i in range(16):
+                out_bit = eval_gate_direct(ctx, f"arithmetic.ror16bit.bit{i}", bits)
+                result_bits.append(out_bit)
+            result = sum(int(b) << i for i, b in enumerate(result_bits))
+            expected = ((val >> 1) | ((val & 1) << 15)) & 0xFFFF
+            total += 1
+            if result == expected:
+                passed += 1
+        results.append(TestResult("arithmetic.ror16bit", passed, total))
     return results
     # Comparator-like weight vectors (MSB-first weights)
     comp_names = [
+        "arithmetic.greaterthan16bit.comparator",
+        "arithmetic.lessthan16bit.comparator",
+        "arithmetic.greaterorequal16bit.comparator",
+        "arithmetic.lessorequal16bit.comparator",
         "combinational.priorityencoder8bit.priority",
     ]
     for name in comp_names:
         if name not in ctx.tensors:
         ctx.tested_tensors.add(name)
         passed, total = 0, 0
+        # Validate weight pattern (MSB-first powers of two)
+        expected_weights = [float(2 ** i) for i in range(len(weights) - 1, -1, -1)]
         total += 1
         if weights == expected_weights:
             passed += 1
         # Validate numeric interpretation (MSB-first bits -> value)
+        test_range = range(256) if len(weights) == 8 else range(0, 1 << 16, 257)
         for val in test_range:
+            bits = [float((val >> i) & 1) for i in range(len(weights))][::-1]
             actual = sum(w * b for w, b in zip(weights, bits))
             total += 1
             if int(actual + 0.5) == val:
     # Constant/selector vectors
     const_specs = {
+        "arithmetic.incrementer16bit.one": ([0.0] * 15 + [1.0], 1),
+        "arithmetic.decrementer16bit.neg_one": ([1.0] * 16, 0xFFFF),
     }
     for name, (expected_bits, expected_val) in const_specs.items():
         if name not in ctx.tensors:
     # All-ones selector/mask tensors
     ones_specs = {
+        "arithmetic.absolutedifference16bit.diff": 32,
+        "arithmetic.incrementer16bit.adder": 16,
+        "arithmetic.decrementer16bit.adder": 16,
+        "arithmetic.max16bit.select": 32,
+        "arithmetic.min16bit.select": 32,
         "combinational.barrelshifter8bit.shift": 11,
         "combinational.demultiplexer1to4.decode": 3,
         "combinational.demultiplexer1to8.decode": 4,
     return results
+def test_float16_unary(ctx: EvalContext) -> List[TestResult]:
+    """Test LUT-backed float16 unary operations."""
+    results: List[TestResult] = []
+    rng = random.Random(1337)
+    values = build_float16_values(rng, 256)
+    ops = [
+        ("float16.sqrt", "sqrt"),
+        ("float16.rsqrt", "rsqrt"),
+        ("float16.exp", "exp"),
+        ("float16.ln", "ln"),
+        ("float16.log2", "log2"),
+        ("float16.sin", "sin"),
+        ("float16.cos", "cos"),
+        ("float16.tan", "tan"),
+        ("float16.tanh", "tanh"),
+    ]
+    for prefix, op in ops:
+        if f"{prefix}.out0.weight" not in ctx.tensors:
+            continue
+        passed, total = 0, 0
+        failures: List[Dict[str, Any]] = []
+        for a_bits in values:
+            bits_list = [float((a_bits >> i) & 1) for i in range(16)]
+            actual_bits = eval_float16_lut_outputs(ctx, prefix, bits_list)
+            actual_int = bits_to_int(actual_bits)
+            expected_int, expected_nan = float16_expected_bits_unary(op, a_bits)
+            ok = float16_is_nan_bits(actual_int) if expected_nan else actual_int == expected_int
+            total += 1
+            if ok:
+                passed += 1
+            elif len(failures) < 8:
+                failures.append({
+                    "input": hex(a_bits),
+                    "actual": hex(actual_int),
+                    "expected": hex(expected_int),
+                })
+        results.append(TestResult(prefix, passed, total, failures))
+    return results
+def test_float16_pow(ctx: EvalContext) -> List[TestResult]:
+    """Test float16.pow (defined as exp(b * ln(a)))."""
+    results: List[TestResult] = []
+    if f"float16.pow.out0.weight" not in ctx.tensors:
+        return results
+    rng = random.Random(1337)
+    pairs = build_float16_pairs(rng, 128)
+    mul_prefix = "float16.pow.mul"
+    mul_gates = sorted([g for g in ctx.gates if g.startswith(mul_prefix + ".")])
+    passed, total = 0, 0
+    failures: List[Dict[str, Any]] = []
+    for a_bits, b_bits in pairs:
+        a_list = [float((a_bits >> i) & 1) for i in range(16)]
+        b_list = [float((b_bits >> i) & 1) for i in range(16)]
+        # ln(a) via LUT, then mul, then exp via LUT (fast path)
+        ln_bits = eval_float16_lut_outputs(ctx, "float16.pow.ln", a_list, match_prefix="float16.pow.ln")
+        # Evaluate pow.mul with ln outputs as internal inputs
+        signals: Dict[int, float] = {}
+        if "#0" in ctx.name_to_id:
+            signals[ctx.name_to_id["#0"]] = 0.0
+        if "#1" in ctx.name_to_id:
+            signals[ctx.name_to_id["#1"]] = 1.0
+        for i in range(16):
+            sid = ctx.name_to_id.get(f"float16.pow.$b[{i}]")
+            if sid is not None:
+                signals[sid] = float(b_list[i])
+        for i in range(16):
+            sid = ctx.name_to_id.get(f"float16.pow.ln.out{i}")
+            if sid is not None:
+                signals[sid] = float(ln_bits[i])
+        if mul_prefix not in ctx.topo_cache or len(ctx.topo_cache[mul_prefix]) != len(mul_gates):
+            ctx.topo_cache[mul_prefix] = topo_sort_gates(ctx, mul_gates)
+        evaluate_gates_in_order(ctx, signals, ctx.topo_cache[mul_prefix])
+        mul_bits = []
+        for i in range(16):
+            gate = f"{mul_prefix}.out{i}"
+            sid = ctx.name_to_id.get(gate)
+            if sid is None or sid not in signals:
+                raise RuntimeError(f"{mul_prefix}: missing output {gate}")
+            mul_bits.append(float(signals[sid]))
+        exp_bits = eval_float16_lut_outputs(ctx, "float16.pow.exp", mul_bits, match_prefix="float16.pow.exp")
+        # Mark pow output pass-through gates as tested
+        for i in range(16):
+            gate = f"float16.pow.out{i}"
+            for suffix in (".weight", ".bias", ".inputs"):
+                key = gate + suffix
+                if key in ctx.tensors:
+                    ctx.tested_tensors.add(key)
+        actual_int = bits_to_int(exp_bits)
+        expected_int, expected_nan = float16_expected_bits_pow(a_bits, b_bits)
+        ok = float16_is_nan_bits(actual_int) if expected_nan else actual_int == expected_int
+        total += 1
+        if ok:
+            passed += 1
+        elif len(failures) < 8:
+            failures.append({
+                "a": hex(a_bits),
+                "b": hex(b_bits),
+                "actual": hex(actual_int),
+                "expected": hex(expected_int),
+            })
+    results.append(TestResult("float16.pow", passed, total, failures))
+    return results
 # =============================================================================
 # TEST RUNNER
 # =============================================================================
     "float16_basic": ("Float16 - Basic", test_float16_basic),
     "float16_arith": ("Float16 - Arithmetic", test_float16_arithmetic),
     "float16_conv": ("Float16 - Conversion", test_float16_conversion),
+    "float16_unary": ("Float16 - Unary LUT", test_float16_unary),
+    "float16_pow": ("Float16 - Pow", test_float16_pow),
 }