Phoenix NPU1 IRON kernels — verified-correct, open-source compiled AIE designs

Device-specific compiled xclbins + IRON designs for the AMD Ryzen 7 7840HS NPU (Phoenix / XDNA 1 / AIE-ML / AIE2). Compiled entirely on Linux with open-source toolchain (mlir-aie v1.3.2 + llvm-aie/Peano) — no license, no Windows, no AMD account required.

⚠️ These artifacts target NPU1 (Phoenix / Hawk Point, AIE2) specifically. They will NOT load on NPU2/NPU4 (Strix / Strix Halo / Krackan) — those use AIE2P and different overlays.

What's here

`kernels/` — pre-compiled xclbins (verified correct, exact vs numpy)

file	op	shape	cores	throughput
`gemm-256-1col`	int16 GEMM (i16→i32)	256³	1	30.5 GOPS
`gemm-512-4col`	int16 GEMM	512³	16 (4×4)	264.7 GOPS
`gemv-288`	int16 GEMV (batch=1)	288²	1	0.4 GOPS ⚠️
`fusion/fused-add-add`	fused 2-stage bf16 add	4096	2	fusion proof
`fusion/single-add`	single bf16 add	4096	1	fusion baseline

All verified bit-exact against numpy on a real 7840HS NPU (strace-confirmed AMDXDNA_EXEC_CMD).

`designs/` — IRON source + runner

npu_kernel.py — NpuKernel class: load (xclbin, insts), run on numpy arrays. Abstracts XRT; caller never touches pyxrt.
fused_add_add.py / single_add.py — the op-fusion experiment designs
bootstrap.sh / setup-env.sh — reproduce the IRON/Peano toolchain

How to run (on a Phoenix NPU, Linux)

git clone https://github.com/tibrezus/xdna-npu-toolkit
cd xdna-npu-toolkit/iron && ./bootstrap.sh && source setup-env.sh
# then use the NpuKernel runner against these xclbins
python3 -c "
import sys; sys.path.insert(0,'designs')
from npu_kernel import NpuKernel
from huggingface_hub import hf_hub_download
xc = hf_hub_download('mrtib/phoenix-npu1-iron-kernels','kernels/gemm-512-4col.xclbin')
ins= hf_hub_download('mrtib/phoenix-npu1-iron-kernels','kernels/gemm-512-4col.insts.txt')
k=NpuKernel(xc,ins)
import numpy as np
A=np.random.randint(-100,100,(512,512),np.int16); B=np.random.randint(-100,100,(512,512),np.int16)
out=k.run(A,B,out_sizes=[512*512*4],out_dtype=np.int32)[0].reshape(512,512)
print('PASS' if (out==A.astype(np.int32)@B.astype(np.int32)).all() else 'FAIL')
"

Honest caveats

These are building blocks (GEMM/GEMV/fusion primitives), not a full embedding model. Full-model NPU serving is work in progress (see xdna-npu-toolkit#12).
The gemv-288 (batch=1) is 8× slower than CPU — host↔NPU round-trip bound. Batched GEMM is the viable path; see the fusion experiment.
Single-core compiler (Peano); the 4-col design achieves multi-column throughput via the array topology, not a multi-core compiler.

Source

Repo: https://github.com/tibrezus/xdna-npu-toolkit (iron/ directory)
Toolchain: Xilinx/mlir-aie + Xilinx/llvm-aie
Investigation log: issue #8

Apache-2.0 (matching the mlir-aie upstream license).

Downloads last month: -; Downloads are not tracked for this model. How to track