Phoenix NPU1 IRON kernels β€” verified-correct, open-source compiled AIE designs

Device-specific compiled xclbins + IRON designs for the AMD Ryzen 7 7840HS NPU (Phoenix / XDNA 1 / AIE-ML / AIE2). Compiled entirely on Linux with open-source toolchain (mlir-aie v1.3.2 + llvm-aie/Peano) β€” no license, no Windows, no AMD account required.

⚠️ These artifacts target NPU1 (Phoenix / Hawk Point, AIE2) specifically. They will NOT load on NPU2/NPU4 (Strix / Strix Halo / Krackan) β€” those use AIE2P and different overlays.

What's here

kernels/ β€” pre-compiled xclbins (verified correct, exact vs numpy)

file op shape cores throughput
gemm-256-1col int16 GEMM (i16β†’i32) 256Β³ 1 30.5 GOPS
gemm-512-4col int16 GEMM 512Β³ 16 (4Γ—4) 264.7 GOPS
gemv-288 int16 GEMV (batch=1) 288² 1 0.4 GOPS ⚠️
fusion/fused-add-add fused 2-stage bf16 add 4096 2 fusion proof
fusion/single-add single bf16 add 4096 1 fusion baseline

All verified bit-exact against numpy on a real 7840HS NPU (strace-confirmed AMDXDNA_EXEC_CMD).

designs/ β€” IRON source + runner

  • npu_kernel.py β€” NpuKernel class: load (xclbin, insts), run on numpy arrays. Abstracts XRT; caller never touches pyxrt.
  • fused_add_add.py / single_add.py β€” the op-fusion experiment designs
  • bootstrap.sh / setup-env.sh β€” reproduce the IRON/Peano toolchain

How to run (on a Phoenix NPU, Linux)

git clone https://github.com/tibrezus/xdna-npu-toolkit
cd xdna-npu-toolkit/iron && ./bootstrap.sh && source setup-env.sh
# then use the NpuKernel runner against these xclbins
python3 -c "
import sys; sys.path.insert(0,'designs')
from npu_kernel import NpuKernel
from huggingface_hub import hf_hub_download
xc = hf_hub_download('mrtib/phoenix-npu1-iron-kernels','kernels/gemm-512-4col.xclbin')
ins= hf_hub_download('mrtib/phoenix-npu1-iron-kernels','kernels/gemm-512-4col.insts.txt')
k=NpuKernel(xc,ins)
import numpy as np
A=np.random.randint(-100,100,(512,512),np.int16); B=np.random.randint(-100,100,(512,512),np.int16)
out=k.run(A,B,out_sizes=[512*512*4],out_dtype=np.int32)[0].reshape(512,512)
print('PASS' if (out==A.astype(np.int32)@B.astype(np.int32)).all() else 'FAIL')
"

Honest caveats

  • These are building blocks (GEMM/GEMV/fusion primitives), not a full embedding model. Full-model NPU serving is work in progress (see xdna-npu-toolkit#12).
  • The gemv-288 (batch=1) is 8Γ— slower than CPU β€” host↔NPU round-trip bound. Batched GEMM is the viable path; see the fusion experiment.
  • Single-core compiler (Peano); the 4-col design achieves multi-column throughput via the array topology, not a multi-core compiler.

Source

Apache-2.0 (matching the mlir-aie upstream license).

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support