CharlesCNorton
eval_all: hash-keyed result cache (--cache-dir, --no-cache); README: bit-ordering scope rules; docs/ISA.md: opcode reference and end-to-end tutorial; docs/float-pipeline.md: composition gap notes
597e7c2

ISA reference card — 8-bit threshold-logic CPU

This is the architecture exposed by the safetensors files. Every instruction below is implemented entirely as threshold neurons; the same gate-level circuits run whether you simulate in Python (eval.py / play.py / test_cpu.py) or compile the CPU's threshold network through safetensors2verilog to FPGA-synthesizable Verilog.

Architectural state

Field Width Notes
PC N bits program counter; N = address width (0–16)
IR 16 bits instruction register
R0–R3 8 bits each general-purpose registers
FLAGS 4 bits Z, N, C, V
SP N bits stack pointer (CALL/RET)
CTRL 4 bits HALT, MEM_WE, MEM_RE, RESERVED
MEM 2^N × 8 bits byte-addressable memory

State tensor layout (MSB-first within each multi-bit field):

[ PC[N] | IR[16] | R0[8] R1[8] R2[8] R3[8] | FLAGS[4] | SP[N] | CTRL[4] | MEM[2^N][8] ]

Instruction encoding

15..12   11..10   9..8   7..0
opcode   rd       rs     imm8
Class Use of fields
R-type rd = rd op rsimm8 ignored
I-type rd = op rd, imm8rs ignored
Address-extended next 16-bit word is the absolute address (big-endian); imm8 reserved. Applies to LOAD, STORE, JMP, Jcc, CALL.

Address-extended instructions consume 4 bytes (instruction word + address word). Untaken conditional jumps still skip the address word, so the PC always advances by 4.

Opcode table

Opcode Mnemonic Class Operation
0x0 ADD R R[rd] = R[rd] + R[rs]
0x1 SUB R R[rd] = R[rd] - R[rs]
0x2 AND R R[rd] = R[rd] & R[rs]
0x3 OR R R[rd] = R[rd] | R[rs]
0x4 XOR R R[rd] = R[rd] ^ R[rs]
0x5 SHL R R[rd] = R[rd] << 1
0x6 SHR R R[rd] = R[rd] >> 1
0x7 MUL R R[rd] = R[rd] * R[rs] (low 8 bits)
0x8 DIV R R[rd] = R[rd] / R[rs]
0x9 CMP R flags = R[rd] - R[rs] (no writeback)
0xA LOAD A R[rd] = M[addr]
0xB STORE A M[addr] = R[rs]
0xC JMP A PC = addr
0xD Jcc A PC = addr if cond. imm8[2:0] selects condition
0xE CALL A push PC; PC = addr
0xF HALT stop execution

Conditional-jump conditions (encoded in imm8[2:0] of the Jcc opcode)

imm8[2:0] Mnemonic Fires when
0 JZ Z flag set (last result was zero)
1 JNZ Z flag clear
2 JC carry-out set (last add overflowed unsigned)
3 JNC carry-out clear
4 JN result was negative (sign bit set)
5 JP result was positive (sign bit clear)
6 JV signed-overflow flag set
7 JNV signed-overflow flag clear

Worked example: write your own program

The Python assembler in cpu_programs.py exposes one-method-per-mnemonic helpers on a tiny Asm class. Here's "store the value 7 to address 0x10, then halt":

from cpu_programs import Asm

a = Asm(size=64)        # 64 bytes of memory
a.org(0)
# Set R0 to 7. There is no LDI; use XOR R0,R0 to zero it then ADD an
# immediate from memory.
a.label("seven")
a.org(32); a.db(7)        # memory byte at addr 32 holds the constant 7

a.org(0)
a.xor_(0, 0)              # R0 = 0
a.load(0, "seven")        # R0 = M[seven] = 7
a.store(0, "dest")        # M[dest] = R0
a.halt()

a.label("dest"); a.db(0)  # destination cell

bytes_ = a.assemble()

Then drop the assembled bytes into the CPU's initial memory and let the threshold-network forward pass run.

Using the CPU as a threshold-network forward pass

The CPU is a single tensor program. State in, state out. The driver:

  1. Builds an initial state tensor with the program loaded at MEM[0..].
  2. Calls the safetensors-derived threshold network, which internally loops one fetch–decode–execute cycle and re-feeds the state.
  3. After ≤ N cycles (or earlier if the HALT control bit fires), reads the final memory contents.

Concretely, this is what test_cpu.py and play.py already do; both serve as runnable tutorials. The minimal driver loop is:

from build import ThresholdComputer
from safetensors.torch import load_file

tensors = load_file("variants/neural_computer8_small.safetensors")
cpu = ThresholdComputer(tensors, data_bits=8)
state = cpu.initial_state(memory=bytes_)
state = cpu.run(state, max_cycles=200)
result = cpu.read_memory(state, addr=0x10)
print(result)   # 7

Common pitfalls

  • No load-immediate. LOAD reads from memory; there is no LDI / MOV-imm instruction. To put a constant in a register, place it in memory and LOAD it.
  • Address-extended instructions are 4 bytes wide. Branch targets must point at the start of an instruction word, not into the middle of one.
  • MUL keeps only the low 8 bits. Detect overflow via CMP against expected truncation.
  • CMP writes only flags, never the destination register. Always followed by a Jcc.
  • SHL and SHR shift by 1. No variable-amount shifter; chain them or compose with bit operations.

Threshold-network artefacts you'll want next

  • python eval_all.py variants/<file>.safetensors — gate-level fitness suite (5,900–7,800 tests per variant covering Boolean, arithmetic, ALU, control, modular, error-detection, threshold, and IEEE 754 float circuits).
  • python eval_all.py --cpu-program variants/<file>.safetensors — assembled program through the threshold-gated CPU.
  • python -m safetensors2verilog <file>.safetensors --frontend threshold_logic --circuit arithmetic.ripplecarry8bit -o rc8.v — extract one circuit, dependency-closed, into synthesizable Verilog.
  • python -m safetensors2verilog ... --inspect — print the port contract for any extracted circuit (which pins exist, what widths).
  • python -m safetensors2verilog ... --equiv-check — automatically build a Python-vs-iverilog cross-check testbench for the extracted circuit.