# ISA reference card — 8-bit threshold-logic CPU This is the architecture exposed by the safetensors files. Every instruction below is *implemented entirely as threshold neurons*; the same gate-level circuits run whether you simulate in Python (`eval.py` / `play.py` / `test_cpu.py`) or compile the CPU's threshold network through `safetensors2verilog` to FPGA-synthesizable Verilog. ## Architectural state | Field | Width | Notes | |---|---|---| | PC | N bits | program counter; N = address width (0–16) | | IR | 16 bits | instruction register | | R0–R3 | 8 bits each | general-purpose registers | | FLAGS | 4 bits | Z, N, C, V | | SP | N bits | stack pointer (CALL/RET) | | CTRL | 4 bits | HALT, MEM_WE, MEM_RE, RESERVED | | MEM | 2^N × 8 bits | byte-addressable memory | State tensor layout (MSB-first within each multi-bit field): ``` [ PC[N] | IR[16] | R0[8] R1[8] R2[8] R3[8] | FLAGS[4] | SP[N] | CTRL[4] | MEM[2^N][8] ] ``` ## Instruction encoding ``` 15..12 11..10 9..8 7..0 opcode rd rs imm8 ``` | Class | Use of fields | |---|---| | **R-type** | `rd = rd op rs` — `imm8` ignored | | **I-type** | `rd = op rd, imm8` — `rs` ignored | | **Address-extended** | next 16-bit word is the absolute address (big-endian); `imm8` reserved. Applies to `LOAD`, `STORE`, `JMP`, `Jcc`, `CALL`. | Address-extended instructions consume **4 bytes** (instruction word + address word). Untaken conditional jumps still skip the address word, so the PC always advances by 4. ## Opcode table | Opcode | Mnemonic | Class | Operation | |---|---|---|---| | 0x0 | ADD | R | R[rd] = R[rd] + R[rs] | | 0x1 | SUB | R | R[rd] = R[rd] - R[rs] | | 0x2 | AND | R | R[rd] = R[rd] & R[rs] | | 0x3 | OR | R | R[rd] = R[rd] \| R[rs] | | 0x4 | XOR | R | R[rd] = R[rd] ^ R[rs] | | 0x5 | SHL | R | R[rd] = R[rd] << 1 | | 0x6 | SHR | R | R[rd] = R[rd] >> 1 | | 0x7 | MUL | R | R[rd] = R[rd] * R[rs] (low 8 bits) | | 0x8 | DIV | R | R[rd] = R[rd] / R[rs] | | 0x9 | CMP | R | flags = R[rd] - R[rs] (no writeback) | | 0xA | LOAD | A | R[rd] = M[addr] | | 0xB | STORE | A | M[addr] = R[rs] | | 0xC | JMP | A | PC = addr | | 0xD | Jcc | A | PC = addr if cond. imm8[2:0] selects condition | | 0xE | CALL | A | push PC; PC = addr | | 0xF | HALT | – | stop execution | ### Conditional-jump conditions (encoded in imm8[2:0] of the Jcc opcode) | imm8[2:0] | Mnemonic | Fires when | |---|---|---| | 0 | JZ | Z flag set (last result was zero) | | 1 | JNZ | Z flag clear | | 2 | JC | carry-out set (last add overflowed unsigned) | | 3 | JNC | carry-out clear | | 4 | JN | result was negative (sign bit set) | | 5 | JP | result was positive (sign bit clear) | | 6 | JV | signed-overflow flag set | | 7 | JNV | signed-overflow flag clear | ## Worked example: write your own program The Python assembler in `cpu_programs.py` exposes one-method-per-mnemonic helpers on a tiny `Asm` class. Here's "store the value 7 to address 0x10, then halt": ```python from cpu_programs import Asm a = Asm(size=64) # 64 bytes of memory a.org(0) # Set R0 to 7. There is no LDI; use XOR R0,R0 to zero it then ADD an # immediate from memory. a.label("seven") a.org(32); a.db(7) # memory byte at addr 32 holds the constant 7 a.org(0) a.xor_(0, 0) # R0 = 0 a.load(0, "seven") # R0 = M[seven] = 7 a.store(0, "dest") # M[dest] = R0 a.halt() a.label("dest"); a.db(0) # destination cell bytes_ = a.assemble() ``` Then drop the assembled bytes into the CPU's initial memory and let the threshold-network forward pass run. ## Using the CPU as a threshold-network forward pass The CPU is a single tensor program. State in, state out. The driver: 1. Builds an initial state tensor with the program loaded at `MEM[0..]`. 2. Calls the safetensors-derived threshold network, which internally loops one fetch–decode–execute cycle and re-feeds the state. 3. After ≤ N cycles (or earlier if the HALT control bit fires), reads the final memory contents. Concretely, this is what `test_cpu.py` and `play.py` already do; both serve as runnable tutorials. The minimal driver loop is: ```python from build import ThresholdComputer from safetensors.torch import load_file tensors = load_file("variants/neural_computer8_small.safetensors") cpu = ThresholdComputer(tensors, data_bits=8) state = cpu.initial_state(memory=bytes_) state = cpu.run(state, max_cycles=200) result = cpu.read_memory(state, addr=0x10) print(result) # 7 ``` ## Common pitfalls - **No load-immediate.** `LOAD` reads from memory; there is no LDI / MOV-imm instruction. To put a constant in a register, place it in memory and `LOAD` it. - **Address-extended instructions are 4 bytes wide.** Branch targets must point at the start of an instruction word, not into the middle of one. - **`MUL` keeps only the low 8 bits.** Detect overflow via `CMP` against expected truncation. - **`CMP` writes only flags**, never the destination register. Always followed by a `Jcc`. - **`SHL` and `SHR` shift by 1.** No variable-amount shifter; chain them or compose with bit operations. ## Threshold-network artefacts you'll want next - `python eval_all.py variants/.safetensors` — gate-level fitness suite (5,900–7,800 tests per variant covering Boolean, arithmetic, ALU, control, modular, error-detection, threshold, and IEEE 754 float circuits). - `python eval_all.py --cpu-program variants/.safetensors` — assembled program through the threshold-gated CPU. - `python -m safetensors2verilog .safetensors --frontend threshold_logic --circuit arithmetic.ripplecarry8bit -o rc8.v` — extract one circuit, dependency-closed, into synthesizable Verilog. - `python -m safetensors2verilog ... --inspect` — print the port contract for any extracted circuit (which pins exist, what widths). - `python -m safetensors2verilog ... --equiv-check` — automatically build a Python-vs-iverilog cross-check testbench for the extracted circuit.