Add multi-bit carry infrastructure for float16.mul/div

- Add col_bit2 (floor/4 mod 2) and col_bit3 (floor/8 mod 2) gates
- Add carry accumulator gates for positions receiving multiple carries
- Update TODO.md with detailed remaining work documentation
- Move completed float16 circuits to Completed section

Mul/div still failing due to carry_acc_carry propagation issue.
Proper fix requires Wallace/Dadda tree or secondary carry chain.

Files changed (3) hide show

TODO.md +73 -21
arithmetic.safetensors +2 -2
convert_to_explicit_inputs.py +356 -34

TODO.md CHANGED Viewed

@@ -2,23 +2,60 @@
 ## High Priority
-### Floating Point Circuits
-- [x] `float16.unpack` -- extract sign, exponent, mantissa (16 gates, 63/63 tests)
-- [x] `float16.pack` -- assemble from components (16 gates, 63/63 tests)
-- [x] `float16.cmp` -- comparison a > b (14 gates, 113/113 tests)
-- [x] `float16.normalize` -- CLZ-based shift calculator (51 gates, 14/14 tests)
-- [x] `float16.add` -- IEEE 754 addition (~998 gates, 125/125 tests)
-- [x] `float16.sub` -- IEEE 754 subtraction (via add with -b, 115/115 tests)
-- [ ] `float16.mul` -- IEEE 754 multiplication (766 gates, 13/84 tests, col_sum precision)
-- [ ] `float16.div` -- IEEE 754 division (1854 gates, 5/53 tests, col_sum precision)
-- [x] `float16.toint` -- float16 to int16 (401 gates, 93/93 tests)
-- [x] `float16.fromint` -- int16 to float16 (478 gates, 53/53 tests)
-- [x] `float16.neg` -- sign flip (16 gates, 58/58 tests)
-- [x] `float16.abs` -- clear sign bit (16 gates, 58/58 tests)
-### Supporting Infrastructure
-- [x] `arithmetic.clz8bit` -- 8-bit count leading zeros (30 gates, 256/256 tests)
-- [x] `arithmetic.clz16bit` -- 16-bit count leading zeros (63 gates, 217/217 tests)
 ## Medium Priority
@@ -30,11 +67,6 @@
 - [ ] `arithmetic.gcd8bit` -- greatest common divisor
 - [ ] `arithmetic.lcm8bit` -- least common multiple
-### Evaluator Improvements
-- [x] Full circuit evaluation using .inputs topology
-- [x] Exhaustive testing for boolean, threshold, CLZ, float16, comparator circuits
-- [x] Automatic topological sort from signal registry
 ## Low Priority
 ### Transcendental Approximations
@@ -56,6 +88,26 @@
 ## Completed
 - [x] Boolean gates (AND, OR, NOT, NAND, NOR, XOR, XNOR, IMPLIES, BIIMPLIES)
 - [x] Arithmetic adders (half, full, ripple carry 2/4/8 bit)
 - [x] Arithmetic subtraction (SUB, SBC, NEG)

 ## High Priority
+### Floating Point Circuits - Remaining Work
+#### `float16.mul` -- IEEE 754 multiplication (~800 gates, ~55/84 tests)
+**Problem**: Multi-bit carry propagation in 11x11 mantissa multiplier.
+**Background**: The mantissa multiplier produces a 22-bit product from two 11-bit mantissas (including implicit leading 1). Each column `i` has `min(i+1, 21-i)` partial products (AND gates). Column 10 has the maximum of 11 partial products.
+**Current Implementation**:
+- Column sums computed via threshold gates: `col_sum = parity(PP_0, PP_1, ..., PP_n)`
+- Parity computed as `(ge1 AND NOT ge2) OR (ge3 AND NOT ge4) OR ...`
+- `col_bit1` = floor(sum/2) mod 2 (carry to next position)
+- `col_bit2` = floor(sum/4) mod 2 (carry to position i+2)
+- `col_bit3` = floor(sum/8) mod 2 (carry to position i+3)
+- Carry accumulator gates sum incoming carries from multiple columns
+**Remaining Issue**: The carry accumulator can itself produce a carry (`carry_acc_carry`) when the sum of incoming carry bits is >= 2. This secondary carry needs to propagate to position i+1, creating a cascading effect that requires either:
+1. A proper CSA (Carry Save Adder) tree structure, or
+2. A secondary FA chain for accumulated carries, or
+3. Iterating until carry stabilization
+**Files**: `convert_to_explicit_inputs.py` lines 5350-5650 (build), lines 2400-2700 (infer)
+---
+#### `float16.div` -- IEEE 754 division (~1900 gates, ~5/53 tests)
+**Problem**: Same multi-bit carry issue as multiplication, plus potential issues in the non-restoring division algorithm.
+**Background**: Division uses non-restoring algorithm with 11-bit dividend and divisor. The quotient mantissa is computed iteratively, and similar column reduction issues arise.
+**Current Implementation**:
+- NaN output bit 9 fixed (canonical NaN = 0x7E00)
+- Column sum parity gates similar to multiplication
+**Remaining Issues**:
+1. Same multi-bit carry propagation problem as multiplication
+2. May have additional issues in division-specific logic (partial remainder computation)
+**Files**: `convert_to_explicit_inputs.py` lines 5700-6200 (build), lines 2700-3100 (infer)
+---
+### Potential Solutions for Carry Propagation
+1. **Wallace Tree**: Replace column reduction with proper Wallace tree structure. More gates but handles arbitrary partial product counts correctly.
+2. **Dadda Tree**: Similar to Wallace but minimizes gate count per level.
+3. **Iterative Carry Resolution**: After initial FA chain, detect remaining carries and iterate until stable. Simple but slow.
+4. **Hybrid Approach**: Use threshold gates for small columns (2-3 PPs) and proper tree reduction for larger columns.
+---
 ## Medium Priority
 - [ ] `arithmetic.gcd8bit` -- greatest common divisor
 - [ ] `arithmetic.lcm8bit` -- least common multiple
 ## Low Priority
 ### Transcendental Approximations
 ## Completed
+### Floating Point Circuits
+- [x] `float16.unpack` -- extract sign, exponent, mantissa (16 gates, 63/63 tests)
+- [x] `float16.pack` -- assemble from components (16 gates, 63/63 tests)
+- [x] `float16.cmp` -- comparison a > b (14 gates, 113/113 tests)
+- [x] `float16.normalize` -- CLZ-based shift calculator (51 gates, 14/14 tests)
+- [x] `float16.add` -- IEEE 754 addition (~998 gates, 125/125 tests)
+- [x] `float16.sub` -- IEEE 754 subtraction (via add with -b, 115/115 tests)
+- [x] `float16.toint` -- float16 to int16 (401 gates, 93/93 tests)
+- [x] `float16.fromint` -- int16 to float16 (478 gates, 53/53 tests)
+- [x] `float16.neg` -- sign flip (16 gates, 58/58 tests)
+- [x] `float16.abs` -- clear sign bit (16 gates, 58/58 tests)
+### Supporting Infrastructure
+- [x] `arithmetic.clz8bit` -- 8-bit count leading zeros (30 gates, 256/256 tests)
+- [x] `arithmetic.clz16bit` -- 16-bit count leading zeros (63 gates, 217/217 tests)
+- [x] Full circuit evaluation using .inputs topology
+- [x] Exhaustive testing for boolean, threshold, CLZ, float16, comparator circuits
+- [x] Automatic topological sort from signal registry
+### Core Circuits
 - [x] Boolean gates (AND, OR, NOT, NAND, NOR, XOR, XNOR, IMPLIES, BIIMPLIES)
 - [x] Arithmetic adders (half, full, ripple carry 2/4/8 bit)
 - [x] Arithmetic subtraction (SUB, SBC, NEG)

arithmetic.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:3bb8fad90726b27a8bd9502c5cb4154242a5f8a6d046c4ba69470be55bb98624
-size 2865388

 version https://git-lfs.github.com/spec/v1
+oid sha256:ce88f97552b5471d1c2adb4a88ea64ddaf8e05537884a12624a455da32531910
+size 2980552

convert_to_explicit_inputs.py CHANGED Viewed

@@ -2410,28 +2410,192 @@ def infer_float16_mul_inputs(gate: str, registry: SignalRegistry) -> List[int]:
             if 0 <= j < 11:
                 pps.append(f"{prefix}.pp{i}_{j}")
-        if len(pps) == 1:
             if f'.col{col}' in gate and f'.col{col}_' not in gate:
                 return [registry.get_id(pps[0])]
             registry.register(f"{prefix}.col{col}")
-        elif len(pps) > 1:
-            if f'.col{col}_sum' in gate:
-                return [registry.get_id(pp) for pp in pps]
-            registry.register(f"{prefix}.col{col}_sum")
             match = re.search(rf'\.col{col}_ge(\d+)$', gate)
             if match:
                 return [registry.get_id(pp) for pp in pps]
-            for t in range(1, len(pps)):
                 registry.register(f"{prefix}.col{col}_ge{t}")
     if '.prod_fa' in gate:
         match = re.search(r'\.prod_fa(\d+)\.', gate)
         if match:
             i = int(match.group(1))
             fa_prefix = f"{prefix}.prod_fa{i}"
             # Count partial products in each column to determine signal names
             # col 0 and col 20 have 1 PP each, others have more
             def get_col_sum(col):
@@ -2441,26 +2605,43 @@ def infer_float16_mul_inputs(gate: str, registry: SignalRegistry) -> List[int]:
                     return registry.get_id(f"{prefix}.col{col}_sum")
                 return registry.get_id("#0")
-            def get_col_carry(col):
-                # Carry from column = sum >= 2, which is ge2
-                # For columns with count >= 3, ge2 exists
-                # For columns with count == 2, no ge2 (but carry is rare anyway)
-                # For single-bit columns, no carry possible
-                if col == 0 or col == 20:
-                    return registry.get_id("#0")  # No carry from single PP columns
-                elif col < 21:
-                    # ge2 exists for columns with 3+ PPs
-                    # For 2-PP columns (col 1 and col 19), ge2 doesn't exist
-                    # but those columns can only produce carry if both PPs are 1,
-                    # which is relatively rare. For now, we use #0 for 2-PP columns.
-                    ge2_name = f"{prefix}.col{col}_ge2"
-                    ge2_id = registry.get_id(ge2_name)
-                    if ge2_id != -1:
-                        return ge2_id
                     else:
-                        # 2-PP columns: no ge2, return 0 (imprecise but safe)
-                        return registry.get_id("#0")
-                return registry.get_id("#0")
             if i == 0:
                 a_bit = get_col_sum(0)
@@ -2468,7 +2649,7 @@ def infer_float16_mul_inputs(gate: str, registry: SignalRegistry) -> List[int]:
                 cin = registry.get_id("#0")
             else:
                 a_bit = get_col_sum(i) if i < 21 else registry.get_id("#0")
-                b_bit = get_col_carry(i - 1) if i < 22 else registry.get_id("#0")
                 cin = registry.register(f"{prefix}.prod_fa{i-1}.cout")
             if '.xor1.layer1' in gate:
@@ -5360,17 +5541,158 @@ def build_float16_mul_tensors() -> Dict[str, torch.Tensor]:
             tensors[f"{prefix}.col{col}.weight"] = torch.tensor([1.0])
             tensors[f"{prefix}.col{col}.bias"] = torch.tensor([-0.5])
         else:
-            # Multi-bit column: use threshold gate to count
-            # For proper addition we need full adder trees
-            # Simplified: use weighted sum approach
-            tensors[f"{prefix}.col{col}_sum.weight"] = torch.tensor([1.0] * count)
-            tensors[f"{prefix}.col{col}_sum.bias"] = torch.tensor([-0.5])
-            # Carry generation for each threshold level
-            for t in range(1, count):
                 tensors[f"{prefix}.col{col}_ge{t}.weight"] = torch.tensor([1.0] * count)
                 tensors[f"{prefix}.col{col}_ge{t}.bias"] = torch.tensor([-float(t)])
     # Final product assembly using ripple carry
     for i in range(22):
         p = f"{prefix}.prod_fa{i}"

             if 0 <= j < 11:
                 pps.append(f"{prefix}.pp{i}_{j}")
+        count = len(pps)
+        if count == 1:
             if f'.col{col}' in gate and f'.col{col}_' not in gate:
                 return [registry.get_id(pps[0])]
             registry.register(f"{prefix}.col{col}")
+        elif count > 1:
+            # ge{t} gates: threshold >= t
             match = re.search(rf'\.col{col}_ge(\d+)$', gate)
             if match:
                 return [registry.get_id(pp) for pp in pps]
+            for t in range(1, count + 1):
                 registry.register(f"{prefix}.col{col}_ge{t}")
+            # not_ge{t} for even t
+            match = re.search(rf'\.col{col}_not_ge(\d+)$', gate)
+            if match:
+                t = int(match.group(1))
+                return [registry.get_id(f"{prefix}.col{col}_ge{t}")]
+            for t in range(2, count + 1, 2):
+                registry.register(f"{prefix}.col{col}_not_ge{t}")
+            # odd{t} gates: ge{t} AND (NOT ge{t+1} or just ge{t} if t+1 > count)
+            match = re.search(rf'\.col{col}_odd(\d+)$', gate)
+            if match:
+                t = int(match.group(1))
+                if t + 1 <= count:
+                    return [registry.get_id(f"{prefix}.col{col}_ge{t}"),
+                            registry.get_id(f"{prefix}.col{col}_not_ge{t+1}")]
+                else:
+                    return [registry.get_id(f"{prefix}.col{col}_ge{t}")]
+            odd_ranges = []
+            for t in range(1, count + 1, 2):
+                registry.register(f"{prefix}.col{col}_odd{t}")
+                odd_ranges.append(f"{prefix}.col{col}_odd{t}")
+            # col_sum = OR of all odd gates (parity)
+            if f'.col{col}_sum' in gate:
+                return [registry.get_id(r) for r in odd_ranges]
+            registry.register(f"{prefix}.col{col}_sum")
+            # col_bit1 gates (floor(sum/2) mod 2)
+            if count >= 2:
+                match = re.search(rf'\.col{col}_bit1_(\d+)$', gate)
+                if match:
+                    t = int(match.group(1))
+                    upper = t + 2
+                    if upper <= count:
+                        return [registry.get_id(f"{prefix}.col{col}_ge{t}"),
+                                registry.get_id(f"{prefix}.col{col}_not_ge{upper}")]
+                    else:
+                        return [registry.get_id(f"{prefix}.col{col}_ge{t}")]
+                bit1_ranges = []
+                for t in range(2, count + 1, 4):
+                    registry.register(f"{prefix}.col{col}_bit1_{t}")
+                    bit1_ranges.append(f"{prefix}.col{col}_bit1_{t}")
+                if f'.col{col}_bit1' in gate and f'.col{col}_bit1_' not in gate:
+                    return [registry.get_id(r) for r in bit1_ranges]
+                if bit1_ranges:
+                    registry.register(f"{prefix}.col{col}_bit1")
+            # col_bit2 gates (floor(sum/4) mod 2)
+            if count >= 4:
+                match = re.search(rf'\.col{col}_bit2_(\d+)$', gate)
+                if match:
+                    t = int(match.group(1))
+                    upper = t + 4
+                    if upper <= count:
+                        return [registry.get_id(f"{prefix}.col{col}_ge{t}"),
+                                registry.get_id(f"{prefix}.col{col}_not_ge{upper}")]
+                    else:
+                        return [registry.get_id(f"{prefix}.col{col}_ge{t}")]
+                bit2_ranges = []
+                for t in range(4, count + 1, 8):
+                    registry.register(f"{prefix}.col{col}_bit2_{t}")
+                    bit2_ranges.append(f"{prefix}.col{col}_bit2_{t}")
+                if f'.col{col}_bit2' in gate and f'.col{col}_bit2_' not in gate:
+                    return [registry.get_id(r) for r in bit2_ranges]
+                if bit2_ranges:
+                    registry.register(f"{prefix}.col{col}_bit2")
+            # col_bit3 gates (floor(sum/8) mod 2)
+            if count >= 8:
+                match = re.search(rf'\.col{col}_bit3_(\d+)$', gate)
+                if match:
+                    t = int(match.group(1))
+                    upper = t + 8
+                    if upper <= count:
+                        return [registry.get_id(f"{prefix}.col{col}_ge{t}"),
+                                registry.get_id(f"{prefix}.col{col}_not_ge{upper}")]
+                    else:
+                        return [registry.get_id(f"{prefix}.col{col}_ge{t}")]
+                bit3_ranges = []
+                for t in range(8, count + 1, 16):
+                    registry.register(f"{prefix}.col{col}_bit3_{t}")
+                    bit3_ranges.append(f"{prefix}.col{col}_bit3_{t}")
+                if f'.col{col}_bit3' in gate and f'.col{col}_bit3_' not in gate:
+                    return [registry.get_id(r) for r in bit3_ranges]
+                if bit3_ranges:
+                    registry.register(f"{prefix}.col{col}_bit3")
+    # Handle carry accumulator gates
+    if '.carry_acc' in gate:
+        match = re.search(r'\.carry_acc(\d+)_', gate)
+        if match:
+            i = int(match.group(1))
+            def get_pp_count(col):
+                if col < 0 or col > 20:
+                    return 0
+                return min(col + 1, 21 - col)
+            # Determine which carry bits come into position i
+            carry_inputs = []
+            if i >= 1 and get_pp_count(i-1) >= 2:
+                carry_inputs.append(registry.get_id(f"{prefix}.col{i-1}_bit1"))
+            if i >= 2 and get_pp_count(i-2) >= 4:
+                carry_inputs.append(registry.get_id(f"{prefix}.col{i-2}_bit2"))
+            if i >= 3 and get_pp_count(i-3) >= 8:
+                carry_inputs.append(registry.get_id(f"{prefix}.col{i-3}_bit3"))
+            n = len(carry_inputs)
+            # ge{t} gates
+            match_ge = re.search(rf'\.carry_acc{i}_ge(\d+)$', gate)
+            if match_ge:
+                return carry_inputs
+            # not_ge{t} gates
+            match_not = re.search(rf'\.carry_acc{i}_not_ge(\d+)$', gate)
+            if match_not:
+                t = int(match_not.group(1))
+                return [registry.get_id(f"{prefix}.carry_acc{i}_ge{t}")]
+            # Register ge gates
+            for t in range(1, n + 1):
+                registry.register(f"{prefix}.carry_acc{i}_ge{t}")
+            for t in range(2, n + 1, 2):
+                registry.register(f"{prefix}.carry_acc{i}_not_ge{t}")
+            # odd{t} gates
+            match_odd = re.search(rf'\.carry_acc{i}_odd(\d+)$', gate)
+            if match_odd:
+                t = int(match_odd.group(1))
+                if t + 1 <= n:
+                    return [registry.get_id(f"{prefix}.carry_acc{i}_ge{t}"),
+                            registry.get_id(f"{prefix}.carry_acc{i}_not_ge{t+1}")]
+                else:
+                    return [registry.get_id(f"{prefix}.carry_acc{i}_ge{t}")]
+            # Register odd gates
+            odd_ranges = []
+            for t in range(1, n + 1, 2):
+                registry.register(f"{prefix}.carry_acc{i}_odd{t}")
+                odd_ranges.append(f"{prefix}.carry_acc{i}_odd{t}")
+            # carry_acc_sum = OR of odd gates
+            if f'.carry_acc{i}_sum' in gate:
+                return [registry.get_id(r) for r in odd_ranges]
+            registry.register(f"{prefix}.carry_acc{i}_sum")
+            # carry_acc_carry = ge2
+            if f'.carry_acc{i}_carry' in gate:
+                return carry_inputs
+            if n >= 2:
+                registry.register(f"{prefix}.carry_acc{i}_carry")
     if '.prod_fa' in gate:
         match = re.search(r'\.prod_fa(\d+)\.', gate)
         if match:
             i = int(match.group(1))
             fa_prefix = f"{prefix}.prod_fa{i}"
+            def get_pp_count(col):
+                if col < 0 or col > 20:
+                    return 0
+                return min(col + 1, 21 - col)
             # Count partial products in each column to determine signal names
             # col 0 and col 20 have 1 PP each, others have more
             def get_col_sum(col):
                     return registry.get_id(f"{prefix}.col{col}_sum")
                 return registry.get_id("#0")
+            def get_b_bit(pos):
+                # Determine incoming carries for position pos
+                carry_inputs = []
+                if pos >= 1 and get_pp_count(pos-1) >= 2:
+                    carry_inputs.append("bit1")
+                if pos >= 2 and get_pp_count(pos-2) >= 4:
+                    carry_inputs.append("bit2")
+                if pos >= 3 and get_pp_count(pos-3) >= 8:
+                    carry_inputs.append("bit3")
+                if len(carry_inputs) == 0:
+                    return registry.get_id("#0")
+                elif len(carry_inputs) == 1:
+                    # Single carry, use it directly
+                    if carry_inputs[0] == "bit1":
+                        return registry.get_id(f"{prefix}.col{pos-1}_bit1")
+                    elif carry_inputs[0] == "bit2":
+                        return registry.get_id(f"{prefix}.col{pos-2}_bit2")
                     else:
+                        return registry.get_id(f"{prefix}.col{pos-3}_bit3")
+                else:
+                    # Multiple carries, use accumulator sum
+                    return registry.register(f"{prefix}.carry_acc{pos}_sum")
+            def get_extra_cin(pos):
+                # Extra carry from accumulator (when sum of carries >= 2)
+                carry_inputs = []
+                if pos >= 1 and get_pp_count(pos-1) >= 2:
+                    carry_inputs.append("bit1")
+                if pos >= 2 and get_pp_count(pos-2) >= 4:
+                    carry_inputs.append("bit2")
+                if pos >= 3 and get_pp_count(pos-3) >= 8:
+                    carry_inputs.append("bit3")
+                if len(carry_inputs) >= 2:
+                    return registry.register(f"{prefix}.carry_acc{pos}_carry")
+                return None
             if i == 0:
                 a_bit = get_col_sum(0)
                 cin = registry.get_id("#0")
             else:
                 a_bit = get_col_sum(i) if i < 21 else registry.get_id("#0")
+                b_bit = get_b_bit(i)
                 cin = registry.register(f"{prefix}.prod_fa{i-1}.cout")
             if '.xor1.layer1' in gate:
             tensors[f"{prefix}.col{col}.weight"] = torch.tensor([1.0])
             tensors[f"{prefix}.col{col}.bias"] = torch.tensor([-0.5])
         else:
+            # Multi-bit column: compute parity (sum mod 2) using threshold gates
+            # parity = (ge1 AND NOT ge2) OR (ge3 AND NOT ge4) OR ...
+            # This captures: sum is odd when in range [1], [3,4), [5,6), etc.
+            # Threshold gates: ge{t} = 1 if sum >= t
+            for t in range(1, count + 1):
                 tensors[f"{prefix}.col{col}_ge{t}.weight"] = torch.tensor([1.0] * count)
                 tensors[f"{prefix}.col{col}_ge{t}.bias"] = torch.tensor([-float(t)])
+            # NOT gates for even thresholds
+            for t in range(2, count + 1, 2):
+                tensors[f"{prefix}.col{col}_not_ge{t}.weight"] = torch.tensor([-1.0])
+                tensors[f"{prefix}.col{col}_not_ge{t}.bias"] = torch.tensor([0.0])
+            # AND gates for odd ranges: (ge1 AND NOT ge2), (ge3 AND NOT ge4), ...
+            odd_ranges = []
+            for t in range(1, count + 1, 2):
+                if t + 1 <= count:
+                    # ge{t} AND NOT ge{t+1}
+                    tensors[f"{prefix}.col{col}_odd{t}.weight"] = torch.tensor([1.0, 1.0])
+                    tensors[f"{prefix}.col{col}_odd{t}.bias"] = torch.tensor([-2.0])
+                    odd_ranges.append(t)
+                else:
+                    # ge{t} only (no upper bound needed if t is max)
+                    tensors[f"{prefix}.col{col}_odd{t}.weight"] = torch.tensor([1.0])
+                    tensors[f"{prefix}.col{col}_odd{t}.bias"] = torch.tensor([-0.5])
+                    odd_ranges.append(t)
+            # col_sum = OR of all odd ranges (parity = bit 0)
+            num_odd = len(odd_ranges)
+            tensors[f"{prefix}.col{col}_sum.weight"] = torch.tensor([1.0] * num_odd)
+            tensors[f"{prefix}.col{col}_sum.bias"] = torch.tensor([-0.5])
+            # col_bit1 = floor(sum/2) mod 2 = parity of [2,3], [6,7], [10,11], ...
+            # This is (ge2 AND NOT ge4) OR (ge6 AND NOT ge8) OR ...
+            if count >= 2:
+                bit1_ranges = []
+                for t in range(2, count + 1, 4):
+                    upper = t + 2
+                    if upper <= count:
+                        tensors[f"{prefix}.col{col}_bit1_{t}.weight"] = torch.tensor([1.0, 1.0])
+                        tensors[f"{prefix}.col{col}_bit1_{t}.bias"] = torch.tensor([-2.0])
+                        if f"{prefix}.col{col}_not_ge{upper}.weight" not in tensors:
+                            tensors[f"{prefix}.col{col}_not_ge{upper}.weight"] = torch.tensor([-1.0])
+                            tensors[f"{prefix}.col{col}_not_ge{upper}.bias"] = torch.tensor([0.0])
+                    else:
+                        tensors[f"{prefix}.col{col}_bit1_{t}.weight"] = torch.tensor([1.0])
+                        tensors[f"{prefix}.col{col}_bit1_{t}.bias"] = torch.tensor([-0.5])
+                    bit1_ranges.append(t)
+                if bit1_ranges:
+                    tensors[f"{prefix}.col{col}_bit1.weight"] = torch.tensor([1.0] * len(bit1_ranges))
+                    tensors[f"{prefix}.col{col}_bit1.bias"] = torch.tensor([-0.5])
+            # col_bit2 = floor(sum/4) mod 2 = parity of [4,7], [12,15], ...
+            # This is (ge4 AND NOT ge8) OR (ge12 AND NOT ge16) OR ...
+            if count >= 4:
+                bit2_ranges = []
+                for t in range(4, count + 1, 8):
+                    upper = t + 4
+                    if upper <= count:
+                        tensors[f"{prefix}.col{col}_bit2_{t}.weight"] = torch.tensor([1.0, 1.0])
+                        tensors[f"{prefix}.col{col}_bit2_{t}.bias"] = torch.tensor([-2.0])
+                        if f"{prefix}.col{col}_not_ge{upper}.weight" not in tensors:
+                            tensors[f"{prefix}.col{col}_not_ge{upper}.weight"] = torch.tensor([-1.0])
+                            tensors[f"{prefix}.col{col}_not_ge{upper}.bias"] = torch.tensor([0.0])
+                    else:
+                        tensors[f"{prefix}.col{col}_bit2_{t}.weight"] = torch.tensor([1.0])
+                        tensors[f"{prefix}.col{col}_bit2_{t}.bias"] = torch.tensor([-0.5])
+                    bit2_ranges.append(t)
+                if bit2_ranges:
+                    tensors[f"{prefix}.col{col}_bit2.weight"] = torch.tensor([1.0] * len(bit2_ranges))
+                    tensors[f"{prefix}.col{col}_bit2.bias"] = torch.tensor([-0.5])
+            # col_bit3 = floor(sum/8) mod 2 (for col10 with 11 PPs)
+            if count >= 8:
+                bit3_ranges = []
+                for t in range(8, count + 1, 16):
+                    upper = t + 8
+                    if upper <= count:
+                        tensors[f"{prefix}.col{col}_bit3_{t}.weight"] = torch.tensor([1.0, 1.0])
+                        tensors[f"{prefix}.col{col}_bit3_{t}.bias"] = torch.tensor([-2.0])
+                        if f"{prefix}.col{col}_not_ge{upper}.weight" not in tensors:
+                            tensors[f"{prefix}.col{col}_not_ge{upper}.weight"] = torch.tensor([-1.0])
+                            tensors[f"{prefix}.col{col}_not_ge{upper}.bias"] = torch.tensor([0.0])
+                    else:
+                        tensors[f"{prefix}.col{col}_bit3_{t}.weight"] = torch.tensor([1.0])
+                        tensors[f"{prefix}.col{col}_bit3_{t}.bias"] = torch.tensor([-0.5])
+                    bit3_ranges.append(t)
+                if bit3_ranges:
+                    tensors[f"{prefix}.col{col}_bit3.weight"] = torch.tensor([1.0] * len(bit3_ranges))
+                    tensors[f"{prefix}.col{col}_bit3.bias"] = torch.tensor([-0.5])
+    # Carry accumulator for multi-bit carries
+    # For position i, incoming carries are: bit1[i-1], bit2[i-2], bit3[i-3]
+    # We need to sum these and produce: carry_acc_sum (parity), carry_acc_carry (sum >= 2)
+    def get_pp_count(col):
+        if col < 0 or col > 20:
+            return 0
+        return min(col + 1, 21 - col)
+    for i in range(22):
+        # Determine which carry bits come into position i
+        carry_inputs = []
+        # bit1 from col[i-1]
+        if i >= 1 and get_pp_count(i-1) >= 2:
+            carry_inputs.append(f"bit1_{i-1}")
+        # bit2 from col[i-2]
+        if i >= 2 and get_pp_count(i-2) >= 4:
+            carry_inputs.append(f"bit2_{i-2}")
+        # bit3 from col[i-3]
+        if i >= 3 and get_pp_count(i-3) >= 8:
+            carry_inputs.append(f"bit3_{i-3}")
+        if len(carry_inputs) == 0:
+            # No carries, use #0
+            pass
+        elif len(carry_inputs) == 1:
+            # Single carry, no accumulator needed
+            pass
+        else:
+            # Multiple carries, need accumulator
+            n = len(carry_inputs)
+            # Parity (sum mod 2) using threshold gates
+            # ge{t} = sum >= t
+            for t in range(1, n + 1):
+                tensors[f"{prefix}.carry_acc{i}_ge{t}.weight"] = torch.tensor([1.0] * n)
+                tensors[f"{prefix}.carry_acc{i}_ge{t}.bias"] = torch.tensor([-float(t) + 0.5])
+            # NOT gates for even thresholds
+            for t in range(2, n + 1, 2):
+                tensors[f"{prefix}.carry_acc{i}_not_ge{t}.weight"] = torch.tensor([-1.0])
+                tensors[f"{prefix}.carry_acc{i}_not_ge{t}.bias"] = torch.tensor([0.0])
+            # AND gates for odd ranges: (ge1 AND NOT ge2), (ge3 AND NOT ge4), ...
+            odd_ranges = []
+            for t in range(1, n + 1, 2):
+                if t + 1 <= n:
+                    tensors[f"{prefix}.carry_acc{i}_odd{t}.weight"] = torch.tensor([1.0, 1.0])
+                    tensors[f"{prefix}.carry_acc{i}_odd{t}.bias"] = torch.tensor([-2.0])
+                else:
+                    tensors[f"{prefix}.carry_acc{i}_odd{t}.weight"] = torch.tensor([1.0])
+                    tensors[f"{prefix}.carry_acc{i}_odd{t}.bias"] = torch.tensor([-0.5])
+                odd_ranges.append(t)
+            # carry_acc_sum = OR of odd ranges
+            tensors[f"{prefix}.carry_acc{i}_sum.weight"] = torch.tensor([1.0] * len(odd_ranges))
+            tensors[f"{prefix}.carry_acc{i}_sum.bias"] = torch.tensor([-0.5])
+            # carry_acc_carry = ge2 (sum >= 2)
+            if n >= 2:
+                tensors[f"{prefix}.carry_acc{i}_carry.weight"] = torch.tensor([1.0] * n)
+                tensors[f"{prefix}.carry_acc{i}_carry.bias"] = torch.tensor([-1.5])
     # Final product assembly using ripple carry
     for i in range(22):
         p = f"{prefix}.prod_fa{i}"