| # Microkernel naming conventions | |
| This documents deciphers XNNPACK's microkernels naming convention. | |
| ## General conventions | |
| Microkernel function names follow this convention: | |
| `xnn_<datatype>_<microkernel><activation?>_ukernel_<parameters>__<arch>` | |
| Where `<datatype>` can be: | |
| - `cs16` | |
| - `f16` - 16-bit half precision float | |
| - `f32` - 32-bit single precision float | |
| - `qc8` | |
| - `qs8` - quantized signed 8 bit | |
| - `qu8` - quantized unsigned 8 bit | |
| - `s16` | |
| - `u32` | |
| - `x8` | |
| - `x16` | |
| - `x24` | |
| - `x32` | |
| - `xx` | |
| `<microkernel>` is the type of microkernel, such as: | |
| - `gemm` | |
| - `igemm` | |
| - `avgpool` | |
| `<activation>` if supported for the microkernel is activation that is fused into | |
| the microkernel: | |
| - `linear` | |
| - `minmax` | |
| - `relu` | |
| `<parameters>` are microkernel specific, and can mean different things depending | |
| on the microkernel (see below for details). | |
| `<arch>` is the architecture the microkernel is optimized for, and can contain | |
| further subdivisions for additional instruction sets supported on the specified | |
| architecture, or processor information: | |
| - `scalar` | |
| - `aarch32_neon_cortex_a55` | |
| - `neonv8_mlal` | |
| - `wasm` | |
| - `avx512` | |
| - `avx512skx` | |
| ## GEMM and IGEMM microkernels | |
| The `<parameters>` for GEMM and IGEMM microkernels represent the `mr` and `nr` | |
| of the microkernel. You can think of it as the number of rows and columns of the | |
| output calculated by the microkernel. | |
| E.g. `xnn_f32_gemm_minmax_ukernel_4x8__aarch32_neon_cortex_a7` processes 32 | |
| elements of the output matrix. | |
| ## DWCONV microkernels | |
| These microkernels come in 2 varieties, uni-pass and multi-pass. | |
| Uni-pass have `XpYc` in their name, where `X` is the kernel tile, and `Y` is the | |
| channel tile. `p` stands for primary, `c` for channel. | |
| Multi-pass have `UfVmWlXcYsZr` in their name, where `U` is the first pass tile, | |
| `V` is the middle pass tile, `W` is the last pass tile, `X` is the channel tile, | |
| `Y` is the channel subtile, and `Z` is the channel round. `f` stands for first, | |
| `m` for middle, `l` for last, `c` for channel, `s` for subtile, `r` for round. | |
| The kernel size must be at least `W+1`, the middle pass runs for as many | |
| iterations as possible, and the last pass handles the remainder (at least 1). | |
| `c`, `s`, `r`, affects the tiling of channels. We run as many tiles of `c` as | |
| possible, followed by rounds of `s`. We determine how many tiles of `c` to run | |
| based on rounding the number of channels up to `r`. `r` is determined based on | |
| the natural tiling size of the microarchitecture (e.g. SSE/AVX) and the number | |
| of elements we can read OOB (`XNN_EXTRA_BYTES`). | |
| ## Average Pooling and Global Average Pooling | |
| These microkernels come in 2 varieties, uni-pass and multi-pass. | |
| Uni-pass have `Cx` in their name, where `C` is a number. This microkernel | |
| processes up to and including `C` elements. | |
| Multi-pass have `CpDx` in their name, where `C` and `D` are numbers. This | |
| microkernel processes `D` elements in the first pass, and middle pass (which can | |
| run multiple times), and up to `C` elements in the last pass. | |
| E.g. `xnn_f32_avgpool_minmax_ukernel_9x__neon_c4` can process up to 9 elements. | |