CharlesCNorton

eval_all: hash-keyed result cache (--cache-dir, --no-cache); README: bit-ordering scope rules; docs/ISA.md: opcode reference and end-to-end tutorial; docs/float-pipeline.md: composition gap notes

597e7c2 5 days ago

raw

history blame contribute delete

6 kB

ISA reference card — 8-bit threshold-logic CPU

This is the architecture exposed by the safetensors files. Every instruction below is implemented entirely as threshold neurons; the same gate-level circuits run whether you simulate in Python (eval.py / play.py / test_cpu.py) or compile the CPU's threshold network through safetensors2verilog to FPGA-synthesizable Verilog.

Architectural state

Field	Width	Notes
PC	N bits	program counter; N = address width (0–16)
IR	16 bits	instruction register
R0–R3	8 bits each	general-purpose registers
FLAGS	4 bits	Z, N, C, V
SP	N bits	stack pointer (CALL/RET)
CTRL	4 bits	HALT, MEM_WE, MEM_RE, RESERVED
MEM	2^N × 8 bits	byte-addressable memory

State tensor layout (MSB-first within each multi-bit field):

[ PC[N] | IR[16] | R0[8] R1[8] R2[8] R3[8] | FLAGS[4] | SP[N] | CTRL[4] | MEM[2^N][8] ]

Instruction encoding

15..12   11..10   9..8   7..0
opcode   rd       rs     imm8

Class	Use of fields
R-type	`rd = rd op rs` — `imm8` ignored
I-type	`rd = op rd, imm8` — `rs` ignored
Address-extended	next 16-bit word is the absolute address (big-endian); `imm8` reserved. Applies to `LOAD`, `STORE`, `JMP`, `Jcc`, `CALL`.

Address-extended instructions consume 4 bytes (instruction word + address word). Untaken conditional jumps still skip the address word, so the PC always advances by 4.

Opcode table

Opcode	Mnemonic	Class	Operation
0x0	ADD	R	R[rd] = R[rd] + R[rs]
0x1	SUB	R	R[rd] = R[rd] - R[rs]
0x2	AND	R	R[rd] = R[rd] & R[rs]
0x3	OR	R	R[rd] = R[rd] \| R[rs]
0x4	XOR	R	R[rd] = R[rd] ^ R[rs]
0x5	SHL	R	R[rd] = R[rd] << 1
0x6	SHR	R	R[rd] = R[rd] >> 1
0x7	MUL	R	R[rd] = R[rd] * R[rs] (low 8 bits)
0x8	DIV	R	R[rd] = R[rd] / R[rs]
0x9	CMP	R	flags = R[rd] - R[rs] (no writeback)
0xA	LOAD	A	R[rd] = M[addr]
0xB	STORE	A	M[addr] = R[rs]
0xC	JMP	A	PC = addr
0xD	Jcc	A	PC = addr if cond. imm8[2:0] selects condition
0xE	CALL	A	push PC; PC = addr
0xF	HALT	–	stop execution

Conditional-jump conditions (encoded in imm8[2:0] of the Jcc opcode)

imm8[2:0]	Mnemonic	Fires when
0	JZ	Z flag set (last result was zero)
1	JNZ	Z flag clear
2	JC	carry-out set (last add overflowed unsigned)
3	JNC	carry-out clear
4	JN	result was negative (sign bit set)
5	JP	result was positive (sign bit clear)
6	JV	signed-overflow flag set
7	JNV	signed-overflow flag clear

Worked example: write your own program

The Python assembler in cpu_programs.py exposes one-method-per-mnemonic helpers on a tiny Asm class. Here's "store the value 7 to address 0x10, then halt":

from cpu_programs import Asm

a = Asm(size=64)        # 64 bytes of memory
a.org(0)
# Set R0 to 7. There is no LDI; use XOR R0,R0 to zero it then ADD an
# immediate from memory.
a.label("seven")
a.org(32); a.db(7)        # memory byte at addr 32 holds the constant 7

a.org(0)
a.xor_(0, 0)              # R0 = 0
a.load(0, "seven")        # R0 = M[seven] = 7
a.store(0, "dest")        # M[dest] = R0
a.halt()

a.label("dest"); a.db(0)  # destination cell

bytes_ = a.assemble()

Then drop the assembled bytes into the CPU's initial memory and let the threshold-network forward pass run.

Using the CPU as a threshold-network forward pass

The CPU is a single tensor program. State in, state out. The driver:

Builds an initial state tensor with the program loaded at MEM[0..].
Calls the safetensors-derived threshold network, which internally loops one fetch–decode–execute cycle and re-feeds the state.
After ≤ N cycles (or earlier if the HALT control bit fires), reads the final memory contents.

Concretely, this is what test_cpu.py and play.py already do; both serve as runnable tutorials. The minimal driver loop is:

from build import ThresholdComputer
from safetensors.torch import load_file

tensors = load_file("variants/neural_computer8_small.safetensors")
cpu = ThresholdComputer(tensors, data_bits=8)
state = cpu.initial_state(memory=bytes_)
state = cpu.run(state, max_cycles=200)
result = cpu.read_memory(state, addr=0x10)
print(result)   # 7

Common pitfalls

No load-immediate. LOAD reads from memory; there is no LDI / MOV-imm instruction. To put a constant in a register, place it in memory and LOAD it.
Address-extended instructions are 4 bytes wide. Branch targets must point at the start of an instruction word, not into the middle of one.
MUL keeps only the low 8 bits. Detect overflow via CMP against expected truncation.
CMP writes only flags, never the destination register. Always followed by a Jcc.
SHL and SHR shift by 1. No variable-amount shifter; chain them or compose with bit operations.

Threshold-network artefacts you'll want next

python eval_all.py variants/<file>.safetensors — gate-level fitness suite (5,900–7,800 tests per variant covering Boolean, arithmetic, ALU, control, modular, error-detection, threshold, and IEEE 754 float circuits).
python eval_all.py --cpu-program variants/<file>.safetensors — assembled program through the threshold-gated CPU.
python -m safetensors2verilog <file>.safetensors --frontend threshold_logic --circuit arithmetic.ripplecarry8bit -o rc8.v — extract one circuit, dependency-closed, into synthesizable Verilog.
python -m safetensors2verilog ... --inspect — print the port contract for any extracted circuit (which pins exist, what widths).
python -m safetensors2verilog ... --equiv-check — automatically build a Python-vs-iverilog cross-check testbench for the extracted circuit.