| # benchmark | |
| op | |
| # naive C with openmp | |
| for for for | |
| # unroll, first try | |
| h | |
| # register allocation | |
| kernels | |
| # unroll, second try | |
| simd | |
| # neon intrinsics | |
| optional | |
| # naive neon assembly with pld | |
| asm | |
| # pipeline optimize, first try | |
| more register load mla | |
| # pipeline optimize, second try | |
| interleave load mla | |
| # pipeline optimize, third try | |
| loop tail | |
| # usual practice, load/save | |
| 233 | |
| # usual practice, unroll | |
| 233 | |
| # usual practice, save register | |
| 233 | |