File size: 6,000 Bytes

597e7c2

# ISA reference card — 8-bit threshold-logic CPU

This is the architecture exposed by the safetensors files. Every instruction below is *implemented entirely as threshold neurons*; the same gate-level circuits run whether you simulate in Python (`eval.py` / `play.py` / `test_cpu.py`) or compile the CPU's threshold network through `safetensors2verilog` to FPGA-synthesizable Verilog.

## Architectural state

| Field | Width | Notes |
|---|---|---|
| PC | N bits | program counter; N = address width (0–16) |
| IR | 16 bits | instruction register |
| R0–R3 | 8 bits each | general-purpose registers |
| FLAGS | 4 bits | Z, N, C, V |
| SP | N bits | stack pointer (CALL/RET) |
| CTRL | 4 bits | HALT, MEM_WE, MEM_RE, RESERVED |
| MEM | 2^N × 8 bits | byte-addressable memory |

State tensor layout (MSB-first within each multi-bit field):

```
[ PC[N] | IR[16] | R0[8] R1[8] R2[8] R3[8] | FLAGS[4] | SP[N] | CTRL[4] | MEM[2^N][8] ]
```

## Instruction encoding

```
15..12   11..10   9..8   7..0
opcode   rd       rs     imm8
```

| Class | Use of fields |
|---|---|
| **R-type** | `rd = rd op rs` — `imm8` ignored |
| **I-type** | `rd = op rd, imm8` — `rs` ignored |
| **Address-extended** | next 16-bit word is the absolute address (big-endian); `imm8` reserved. Applies to `LOAD`, `STORE`, `JMP`, `Jcc`, `CALL`. |

Address-extended instructions consume **4 bytes** (instruction word + address word). Untaken conditional jumps still skip the address word, so the PC always advances by 4.

## Opcode table

| Opcode | Mnemonic | Class | Operation |
|---|---|---|---|
| 0x0 | ADD     | R | R[rd] = R[rd] + R[rs] |
| 0x1 | SUB     | R | R[rd] = R[rd] - R[rs] |
| 0x2 | AND     | R | R[rd] = R[rd] & R[rs] |
| 0x3 | OR      | R | R[rd] = R[rd] \| R[rs] |
| 0x4 | XOR     | R | R[rd] = R[rd] ^ R[rs] |
| 0x5 | SHL     | R | R[rd] = R[rd] << 1 |
| 0x6 | SHR     | R | R[rd] = R[rd] >> 1 |
| 0x7 | MUL     | R | R[rd] = R[rd] * R[rs]   (low 8 bits) |
| 0x8 | DIV     | R | R[rd] = R[rd] / R[rs] |
| 0x9 | CMP     | R | flags = R[rd] - R[rs]   (no writeback) |
| 0xA | LOAD    | A | R[rd] = M[addr] |
| 0xB | STORE   | A | M[addr] = R[rs] |
| 0xC | JMP     | A | PC = addr |
| 0xD | Jcc     | A | PC = addr if cond.  imm8[2:0] selects condition |
| 0xE | CALL    | A | push PC; PC = addr |
| 0xF | HALT    | – | stop execution |

### Conditional-jump conditions (encoded in imm8[2:0] of the Jcc opcode)

| imm8[2:0] | Mnemonic | Fires when |
|---|---|---|
| 0 | JZ | Z flag set (last result was zero) |
| 1 | JNZ | Z flag clear |
| 2 | JC | carry-out set (last add overflowed unsigned) |
| 3 | JNC | carry-out clear |
| 4 | JN | result was negative (sign bit set) |
| 5 | JP | result was positive (sign bit clear) |
| 6 | JV | signed-overflow flag set |
| 7 | JNV | signed-overflow flag clear |

## Worked example: write your own program

The Python assembler in `cpu_programs.py` exposes one-method-per-mnemonic helpers on a tiny `Asm` class. Here's "store the value 7 to address 0x10, then halt":

```python
from cpu_programs import Asm

a = Asm(size=64)        # 64 bytes of memory
a.org(0)
# Set R0 to 7. There is no LDI; use XOR R0,R0 to zero it then ADD an
# immediate from memory.
a.label("seven")
a.org(32); a.db(7)        # memory byte at addr 32 holds the constant 7

a.org(0)
a.xor_(0, 0)              # R0 = 0
a.load(0, "seven")        # R0 = M[seven] = 7
a.store(0, "dest")        # M[dest] = R0
a.halt()

a.label("dest"); a.db(0)  # destination cell

bytes_ = a.assemble()
```

Then drop the assembled bytes into the CPU's initial memory and let the threshold-network forward pass run.

## Using the CPU as a threshold-network forward pass

The CPU is a single tensor program. State in, state out. The driver:

1. Builds an initial state tensor with the program loaded at `MEM[0..]`.
2. Calls the safetensors-derived threshold network, which internally loops one fetch–decode–execute cycle and re-feeds the state.
3. After ≤ N cycles (or earlier if the HALT control bit fires), reads the final memory contents.

Concretely, this is what `test_cpu.py` and `play.py` already do; both serve as runnable tutorials. The minimal driver loop is:

```python
from build import ThresholdComputer
from safetensors.torch import load_file

tensors = load_file("variants/neural_computer8_small.safetensors")
cpu = ThresholdComputer(tensors, data_bits=8)
state = cpu.initial_state(memory=bytes_)
state = cpu.run(state, max_cycles=200)
result = cpu.read_memory(state, addr=0x10)
print(result)   # 7
```

## Common pitfalls

- **No load-immediate.** `LOAD` reads from memory; there is no LDI / MOV-imm instruction. To put a constant in a register, place it in memory and `LOAD` it.
- **Address-extended instructions are 4 bytes wide.** Branch targets must point at the start of an instruction word, not into the middle of one.
- **`MUL` keeps only the low 8 bits.** Detect overflow via `CMP` against expected truncation.
- **`CMP` writes only flags**, never the destination register. Always followed by a `Jcc`.
- **`SHL` and `SHR` shift by 1.** No variable-amount shifter; chain them or compose with bit operations.

## Threshold-network artefacts you'll want next

- `python eval_all.py variants/<file>.safetensors` — gate-level fitness suite (5,900–7,800 tests per variant covering Boolean, arithmetic, ALU, control, modular, error-detection, threshold, and IEEE 754 float circuits).
- `python eval_all.py --cpu-program variants/<file>.safetensors` — assembled program through the threshold-gated CPU.
- `python -m safetensors2verilog <file>.safetensors --frontend threshold_logic --circuit arithmetic.ripplecarry8bit -o rc8.v` — extract one circuit, dependency-closed, into synthesizable Verilog.
- `python -m safetensors2verilog ... --inspect` — print the port contract for any extracted circuit (which pins exist, what widths).
- `python -m safetensors2verilog ... --equiv-check` — automatically build a Python-vs-iverilog cross-check testbench for the extracted circuit.