benchmark
op
naive C with openmp
for for for
unroll, first try
h
register allocation
kernels
unroll, second try
simd
neon intrinsics
optional
naive neon assembly with pld
asm
pipeline optimize, first try
more register load mla
pipeline optimize, second try
interleave load mla
pipeline optimize, third try
loop tail
usual practice, load/save
233
usual practice, unroll
233
usual practice, save register
233