Phoenix NPU1 IRON kernels β verified-correct, open-source compiled AIE designs
Device-specific compiled xclbins + IRON designs for the AMD Ryzen 7 7840HS
NPU (Phoenix / XDNA 1 / AIE-ML / AIE2). Compiled entirely on Linux with
open-source toolchain (mlir-aie v1.3.2 + llvm-aie/Peano) β no license, no
Windows, no AMD account required.
β οΈ These artifacts target NPU1 (Phoenix / Hawk Point, AIE2) specifically. They will NOT load on NPU2/NPU4 (Strix / Strix Halo / Krackan) β those use AIE2P and different overlays.
What's here
kernels/ β pre-compiled xclbins (verified correct, exact vs numpy)
| file | op | shape | cores | throughput |
|---|---|---|---|---|
gemm-256-1col |
int16 GEMM (i16βi32) | 256Β³ | 1 | 30.5 GOPS |
gemm-512-4col |
int16 GEMM | 512Β³ | 16 (4Γ4) | 264.7 GOPS |
gemv-288 |
int16 GEMV (batch=1) | 288Β² | 1 | 0.4 GOPS β οΈ |
fusion/fused-add-add |
fused 2-stage bf16 add | 4096 | 2 | fusion proof |
fusion/single-add |
single bf16 add | 4096 | 1 | fusion baseline |
All verified bit-exact against numpy on a real 7840HS NPU (strace-confirmed
AMDXDNA_EXEC_CMD).
designs/ β IRON source + runner
npu_kernel.pyβNpuKernelclass: load (xclbin, insts), run on numpy arrays. Abstracts XRT; caller never touchespyxrt.fused_add_add.py/single_add.pyβ the op-fusion experiment designsbootstrap.sh/setup-env.shβ reproduce the IRON/Peano toolchain
How to run (on a Phoenix NPU, Linux)
git clone https://github.com/tibrezus/xdna-npu-toolkit
cd xdna-npu-toolkit/iron && ./bootstrap.sh && source setup-env.sh
# then use the NpuKernel runner against these xclbins
python3 -c "
import sys; sys.path.insert(0,'designs')
from npu_kernel import NpuKernel
from huggingface_hub import hf_hub_download
xc = hf_hub_download('mrtib/phoenix-npu1-iron-kernels','kernels/gemm-512-4col.xclbin')
ins= hf_hub_download('mrtib/phoenix-npu1-iron-kernels','kernels/gemm-512-4col.insts.txt')
k=NpuKernel(xc,ins)
import numpy as np
A=np.random.randint(-100,100,(512,512),np.int16); B=np.random.randint(-100,100,(512,512),np.int16)
out=k.run(A,B,out_sizes=[512*512*4],out_dtype=np.int32)[0].reshape(512,512)
print('PASS' if (out==A.astype(np.int32)@B.astype(np.int32)).all() else 'FAIL')
"
Honest caveats
- These are building blocks (GEMM/GEMV/fusion primitives), not a full embedding model. Full-model NPU serving is work in progress (see xdna-npu-toolkit#12).
- The
gemv-288(batch=1) is 8Γ slower than CPU β hostβNPU round-trip bound. Batched GEMM is the viable path; see the fusion experiment. - Single-core compiler (Peano); the 4-col design achieves multi-column throughput via the array topology, not a multi-core compiler.
Source
- Repo: https://github.com/tibrezus/xdna-npu-toolkit (
iron/directory) - Toolchain: Xilinx/mlir-aie + Xilinx/llvm-aie
- Investigation log: issue #8
Apache-2.0 (matching the mlir-aie upstream license).