RV32I CPU in ONNX

A complete RISC-V RV32I integer CPU implemented as a pure ONNX computation graph. All 11 base-ISA opcodes (LUI, AUIPC, JAL, JALR, BRANCH, LOAD, STORE, OP-IMM, OP, MISC-MEM, SYSTEM). No custom operators. Standard ONNX Runtime 1.26 CPU EP runs it unmodified.

Real RV32I machine code, cross-compiled from C with riscv-none-elf-gcc, executes inside the model. The output is a tensor of framebuffer snapshots — one per outer-loop iteration.

The GIF above is the direct output of Run() on bouncing_demo.onnx — a uint8[200, 32, 64] tensor returned in one call, with no inputs. The pixel data was generated by bouncing.c (also in this repo) running on the RV32I core inside the model.

Two models, one CPU

File	Purpose	Inputs	Outputs
`rv32i_cpu.onnx`	Generic RV32I CPU. Load any RV32I binary into RAM.	`pc_in`, `regs_in`, `ram_in`, `trip_count`	`pc_out`, `regs_out`, `ram_out`
`bouncing_demo.onnx`	Fully self-contained: bouncing ball ELF baked in.	(none)	`uint8[200, 32, 64]` frames

How it works

State

The CPU state is three loop-carried tensors:

Tensor	Shape	Dtype	Holds
`pc`	scalar	`int32`	byte program counter
`regs`	`[32]`	`int32`	x0..x31 (x0 forced to zero on every writeback)
`ram`	`[ram_size]`	`uint8`	program, data, framebuffer, MMIO

Per-instruction body

Each iteration of the inner Loop fetches, decodes, and executes one RV32I instruction. The dispatch is branchless: every opcode's candidate next-state is computed in parallel, then Where cascades pick the matching one. This trades wasted work for a flat, regular graph.

fetch (Gather 4 bytes at PC → assemble little-endian int32)
       ↓
decode (BitwiseAnd/BitShift to extract opcode, rd, rs1, rs2, funct3/7
        and the five immediate flavours: I, S, B, U, J)
       ↓
all 11 opcodes compute candidate (next_pc, next_regs) in parallel:
  LUI · AUIPC · JAL · JALR · BRANCH ·
  OP-IMM (ADDI/SLTI/.../SRAI) · OP (ADD/SUB/.../AND) ·
  LOAD (LB/LH/LW/LBU/LHU) ·
  STORE (SB/SH/SW)         ← only candidate that writes ram
  MISC-MEM (FENCE → NOP) · SYSTEM (ECALL/EBREAK → exit Loop)
       ↓
Where-cascade dispatch: pick the matching candidate by opcode
       ↓
emit (pc, regs, ram) for next iteration

The body is ~430 ONNX nodes. Model file is ~28 KB. Loop's cond_out is wired so ECALL/EBREAK exits the loop early.

MMIO layout

The top 16 bytes of RAM are reserved as memory-mapped I/O:

Offset (from end of RAM)	Purpose
`-16..-12`	Tick counter
`-12..-8`	Keys state
`-8..-4`	`putchar` port
`-4..0`	Halt port (any write sets halt flag)

The framebuffer occupies the FB_BYTES bytes ending at RAM_SIZE - 16. For the bouncing-ball demo: 64 × 32 = 2048 bytes.

Flavour-C wrapper (the movie model)

outer Loop (trip = N_frames)
  body:
    inner Loop (trip = insts_per_frame)
      body:
        ← one CPU instruction (the per-instruction body above)
    scan-output: framebuffer slice of RAM
  ↓
stack scan outputs → uint8[N_frames, FB_BYTES]
  ↓
reshape → uint8[N_frames, FB_H, FB_W]

ONNX's Loop op has two output kinds: carried outputs (state passed between iterations) and scan outputs (values emitted per iteration and concatenated along a new leading axis). The movie wrapper uses scan outputs to emit one framebuffer per outer iteration. Single Run() returns the entire animation as one tensor.

Usage

Run the bundled bouncing demo

import onnxruntime as ort
sess = ort.InferenceSession("bouncing_demo.onnx",
                            providers=["CPUExecutionProvider"])
frames, = sess.run(None, {})  # no inputs
print(frames.shape, frames.dtype)
# (200, 32, 64) uint8

Run your own RV32I program on the generic CPU

import onnxruntime as ort
import numpy as np

sess = ort.InferenceSession("rv32i_cpu.onnx",
                            providers=["CPUExecutionProvider"])

RAM_SIZE = 65536
ram = np.zeros(RAM_SIZE, dtype=np.uint8)
# Load your RV32I machine code starting at offset 0
code = open("hello.bin", "rb").read()
ram[:len(code)] = np.frombuffer(code, dtype=np.uint8)

regs = np.zeros(32, dtype=np.int32)
pc = np.array(0, dtype=np.int32)

pc, regs, ram = sess.run(None, {
    "pc_in": pc, "regs_in": regs, "ram_in": ram,
    "trip_count": np.array(50_000, dtype=np.int64),
})

# Read MMIO putchar (last char "printed")
print(f"last printed byte: {ram[RAM_SIZE - 8]}")
# Read framebuffer
fb = ram[RAM_SIZE - 16 - 2048 : RAM_SIZE - 16].reshape(32, 64)

Build your own RV32I program

The repo bundles crt0.S, link.ld, and example C sources (hello.c, bouncing.c). With any riscv32-unknown-elf-gcc (or riscv-none-elf-gcc):

riscv-none-elf-gcc -march=rv32i -mabi=ilp32 -Os \
  -ffreestanding -nostdlib -nostartfiles \
  -Wl,--gc-sections -ffunction-sections -fdata-sections \
  -T link.ld crt0.S hello.c -o hello.elf -lgcc

The ELF's PT_LOAD segments flatten into the ram_in initializer; the entry point goes into pc_in. There's a tiny ELF loader in the source repo at tools/bake_elf.py.

Performance

Measured on a Windows ARM64 laptop with ONNX Runtime 1.26 CPU EP, opset 21:

Workload	Throughput
RV32I instructions / sec	~7,100
Hello-world end-to-end	2.4 s (160-byte program)
Bouncing ball (200 frames × 200 insts)	5.6 s
Per-instruction body size	~430 ONNX nodes
Generic CPU model size	~28 KB
Movie model size (with baked ELF)	~95 KB

This is plenty fast for hand-crafted demos. It is not fast enough for substantial workloads like a full operating system or a real-time 3D game — the per-node overhead of ORT's Loop body interpreter dominates everything.

What this is for

This model is a building block in a larger experiment: how far can the standard ONNX op set go as a general computation target? CHIP-8 was the warm-up (anthonypjshaw/chip8-onnx). RV32I is the next step up — a real ISA with a real cross-compiler chain. You can now write C, cross-compile it to RV32I, and execute the resulting machine code by calling Run() on an ONNX model.

The architecture and the toolchain are the deliverable. The bouncing ball is a demo of the chain; you can swap in any RV32I binary that fits in 64 KB (or rebuild the model with a larger ram_size).

Implementation notes (gotchas that bit us)

These are things to know if you're building anything similar:

ORT CPU EP doesn't implement Where(bool, bool, bool) — i.e. Where whose data inputs are themselves bool. Per ORT's OperatorKernels.md, CPU EP only supports Where data types {double, float, int32, int64, string, uint8}. The fix is to cast bool data through int32, Where, then cast back to bool. This came up in the RV32I BRANCH dispatch where six comparison-result bools were selected by funct3. Other EPs:
- CUDA EP has the same gap (no bool data).
- DirectML EP supports Where on every type including bool — on Windows, switching to DML eliminates the workaround entirely.
Where(bool, uint8, uint8) is fine on CPU EP — uint8 data is supported; only bool data isn't. So Where over byte-tensors (e.g. framebuffers, RAM slices) doesn't need any cast.
BitShift only supports unsigned types. Wrap your shift in Cast(int32→uint32) → BitShift → Cast(uint32→int32).
Branchless dispatch means LOAD/STORE compute addresses unconditionally, so for non-LOAD/non-STORE instructions the computed address can be out of RAM range and the Gather/ScatterND will fault at runtime. Solution: clamp the address with Where(is_load, addr, 0) before the memory access.
LUI immediate: inst & 0xFFFFF000 overflows Python int for int32 constants — use i32(-4096) instead of i32(0xFFFFF000).
No hardware multiply in RV32I — link with -lgcc to pull in __mulsi3 and friends, or build with -march=rv32im (you'd need to extend the CPU model to implement MUL/DIV).

Files

.
├── rv32i_cpu.onnx           # Generic RV32I CPU (28 KB)
├── bouncing_demo.onnx       # Self-contained bouncing-ball movie (95 KB)
├── example_output.gif       # Output of bouncing_demo.onnx
├── hello.c                  # Smallest demo: MMIO putchar + checkerboard
├── bouncing.c               # The bouncing-ball source
├── crt0.S                   # RV32I bare-metal entry stub
└── link.ld                  # Linker script for the 64 KB target

License

MIT. Toolchain (xPack RISC-V GNU GCC) is under GPL — only used at build time, the resulting binaries are MIT.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Other

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support