RV32I CPU in ONNX
A complete RISC-V RV32I integer CPU implemented as a pure ONNX computation graph. All 11 base-ISA opcodes (LUI, AUIPC, JAL, JALR, BRANCH, LOAD, STORE, OP-IMM, OP, MISC-MEM, SYSTEM). No custom operators. Standard ONNX Runtime 1.26 CPU EP runs it unmodified.
Real RV32I machine code, cross-compiled from C with riscv-none-elf-gcc,
executes inside the model. The output is a tensor of framebuffer
snapshots β one per outer-loop iteration.
The GIF above is the direct output of Run() on bouncing_demo.onnx
β a uint8[200, 32, 64] tensor returned in one call, with no inputs.
The pixel data was generated by bouncing.c (also in this repo) running
on the RV32I core inside the model.
Two models, one CPU
| File | Purpose | Inputs | Outputs |
|---|---|---|---|
rv32i_cpu.onnx |
Generic RV32I CPU. Load any RV32I binary into RAM. | pc_in, regs_in, ram_in, trip_count |
pc_out, regs_out, ram_out |
bouncing_demo.onnx |
Fully self-contained: bouncing ball ELF baked in. | (none) | uint8[200, 32, 64] frames |
How it works
State
The CPU state is three loop-carried tensors:
| Tensor | Shape | Dtype | Holds |
|---|---|---|---|
pc |
scalar | int32 |
byte program counter |
regs |
[32] |
int32 |
x0..x31 (x0 forced to zero on every writeback) |
ram |
[ram_size] |
uint8 |
program, data, framebuffer, MMIO |
Per-instruction body
Each iteration of the inner Loop fetches, decodes, and executes one
RV32I instruction. The dispatch is branchless: every opcode's
candidate next-state is computed in parallel, then Where cascades
pick the matching one. This trades wasted work for a flat, regular
graph.
fetch (Gather 4 bytes at PC β assemble little-endian int32)
β
decode (BitwiseAnd/BitShift to extract opcode, rd, rs1, rs2, funct3/7
and the five immediate flavours: I, S, B, U, J)
β
all 11 opcodes compute candidate (next_pc, next_regs) in parallel:
LUI Β· AUIPC Β· JAL Β· JALR Β· BRANCH Β·
OP-IMM (ADDI/SLTI/.../SRAI) Β· OP (ADD/SUB/.../AND) Β·
LOAD (LB/LH/LW/LBU/LHU) Β·
STORE (SB/SH/SW) β only candidate that writes ram
MISC-MEM (FENCE β NOP) Β· SYSTEM (ECALL/EBREAK β exit Loop)
β
Where-cascade dispatch: pick the matching candidate by opcode
β
emit (pc, regs, ram) for next iteration
The body is ~430 ONNX nodes. Model file is ~28 KB. Loop's cond_out is
wired so ECALL/EBREAK exits the loop early.
MMIO layout
The top 16 bytes of RAM are reserved as memory-mapped I/O:
| Offset (from end of RAM) | Purpose |
|---|---|
-16..-12 |
Tick counter |
-12..-8 |
Keys state |
-8..-4 |
putchar port |
-4..0 |
Halt port (any write sets halt flag) |
The framebuffer occupies the FB_BYTES bytes ending at RAM_SIZE - 16.
For the bouncing-ball demo: 64 Γ 32 = 2048 bytes.
Flavour-C wrapper (the movie model)
outer Loop (trip = N_frames)
body:
inner Loop (trip = insts_per_frame)
body:
β one CPU instruction (the per-instruction body above)
scan-output: framebuffer slice of RAM
β
stack scan outputs β uint8[N_frames, FB_BYTES]
β
reshape β uint8[N_frames, FB_H, FB_W]
ONNX's Loop op has two output kinds: carried outputs (state passed
between iterations) and scan outputs (values emitted per iteration
and concatenated along a new leading axis). The movie wrapper uses scan
outputs to emit one framebuffer per outer iteration. Single Run()
returns the entire animation as one tensor.
Usage
Run the bundled bouncing demo
import onnxruntime as ort
sess = ort.InferenceSession("bouncing_demo.onnx",
providers=["CPUExecutionProvider"])
frames, = sess.run(None, {}) # no inputs
print(frames.shape, frames.dtype)
# (200, 32, 64) uint8
Run your own RV32I program on the generic CPU
import onnxruntime as ort
import numpy as np
sess = ort.InferenceSession("rv32i_cpu.onnx",
providers=["CPUExecutionProvider"])
RAM_SIZE = 65536
ram = np.zeros(RAM_SIZE, dtype=np.uint8)
# Load your RV32I machine code starting at offset 0
code = open("hello.bin", "rb").read()
ram[:len(code)] = np.frombuffer(code, dtype=np.uint8)
regs = np.zeros(32, dtype=np.int32)
pc = np.array(0, dtype=np.int32)
pc, regs, ram = sess.run(None, {
"pc_in": pc, "regs_in": regs, "ram_in": ram,
"trip_count": np.array(50_000, dtype=np.int64),
})
# Read MMIO putchar (last char "printed")
print(f"last printed byte: {ram[RAM_SIZE - 8]}")
# Read framebuffer
fb = ram[RAM_SIZE - 16 - 2048 : RAM_SIZE - 16].reshape(32, 64)
Build your own RV32I program
The repo bundles crt0.S, link.ld, and example C sources (hello.c,
bouncing.c). With any riscv32-unknown-elf-gcc (or riscv-none-elf-gcc):
riscv-none-elf-gcc -march=rv32i -mabi=ilp32 -Os \
-ffreestanding -nostdlib -nostartfiles \
-Wl,--gc-sections -ffunction-sections -fdata-sections \
-T link.ld crt0.S hello.c -o hello.elf -lgcc
The ELF's PT_LOAD segments flatten into the ram_in initializer; the
entry point goes into pc_in. There's a tiny ELF loader in
the source repo at
tools/bake_elf.py.
Performance
Measured on a Windows ARM64 laptop with ONNX Runtime 1.26 CPU EP, opset 21:
| Workload | Throughput |
|---|---|
| RV32I instructions / sec | ~7,100 |
| Hello-world end-to-end | 2.4 s (160-byte program) |
| Bouncing ball (200 frames Γ 200 insts) | 5.6 s |
| Per-instruction body size | ~430 ONNX nodes |
| Generic CPU model size | ~28 KB |
| Movie model size (with baked ELF) | ~95 KB |
This is plenty fast for hand-crafted demos. It is not fast enough
for substantial workloads like a full operating system or a real-time
3D game β the per-node overhead of ORT's Loop body interpreter
dominates everything.
What this is for
This model is a building block in a larger experiment: how far can the
standard ONNX op set go as a general computation target? CHIP-8 was the
warm-up
(anthonypjshaw/chip8-onnx).
RV32I is the next step up β a real ISA with a real cross-compiler chain.
You can now write C, cross-compile it to RV32I, and execute the resulting
machine code by calling Run() on an ONNX model.
The architecture and the toolchain are the deliverable. The bouncing
ball is a demo of the chain; you can swap in any RV32I binary that
fits in 64 KB (or rebuild the model with a larger ram_size).
Implementation notes (gotchas that bit us)
These are things to know if you're building anything similar:
- ORT CPU EP doesn't implement
Where(bool, bool, bool)β i.e.Wherewhose data inputs are themselves bool. Per ORT'sOperatorKernels.md, CPU EP only supportsWheredata types{double, float, int32, int64, string, uint8}. The fix is to cast bool data throughint32,Where, then cast back to bool. This came up in the RV32I BRANCH dispatch where six comparison-result bools were selected by funct3. Other EPs:- CUDA EP has the same gap (no bool data).
- DirectML EP supports
Whereon every type including bool β on Windows, switching to DML eliminates the workaround entirely.
Where(bool, uint8, uint8)is fine on CPU EP β uint8 data is supported; only bool data isn't. SoWhereover byte-tensors (e.g. framebuffers, RAM slices) doesn't need any cast.BitShiftonly supports unsigned types. Wrap your shift inCast(int32βuint32) β BitShift β Cast(uint32βint32).- Branchless dispatch means
LOAD/STOREcompute addresses unconditionally, so for non-LOAD/non-STORE instructions the computed address can be out of RAM range and theGather/ScatterNDwill fault at runtime. Solution: clamp the address withWhere(is_load, addr, 0)before the memory access. LUIimmediate:inst & 0xFFFFF000overflows Pythonintforint32constants β usei32(-4096)instead ofi32(0xFFFFF000).- No hardware multiply in RV32I β link with
-lgccto pull in__mulsi3and friends, or build with-march=rv32im(you'd need to extend the CPU model to implement MUL/DIV).
Files
.
βββ rv32i_cpu.onnx # Generic RV32I CPU (28 KB)
βββ bouncing_demo.onnx # Self-contained bouncing-ball movie (95 KB)
βββ example_output.gif # Output of bouncing_demo.onnx
βββ hello.c # Smallest demo: MMIO putchar + checkerboard
βββ bouncing.c # The bouncing-ball source
βββ crt0.S # RV32I bare-metal entry stub
βββ link.ld # Linker script for the 64 KB target
License
MIT. Toolchain (xPack RISC-V GNU GCC) is under GPL β only used at build time, the resulting binaries are MIT.
