Androidonnxfork
/

test

Model card Files Files and versions

test / doc /dwconv.md

Androidonnxfork's picture

Androidonnxfork

Upload folder using huggingface_hub

8b7c501 over 2 years ago

|

history blame contribute delete

3.54 kB

	# Depthwise convolution microkernels

	This document describes how depthwise convolution (DWCONV) microkernels work.

	All depthwise convolution microkernels live in `src/*-dwconv`, e.g.
	[`src/f32-dwconv`](https://github.com/google/XNNPACK/tree/master/src/f32-dwconv).

	The simplest microkernel to look at is probably
	[`f32-dwconv-up2x3-scalar.c`](../src/f32-dwconv/gen/f32-dwconv-up2x3-scalar.c).

	Key parameters:

	- channel tile, how many channels the microkernel can process in each iteration
	- kernel tile, how many weights (kernel elements, each element is # channels values) the microkernel reads in each
	iteration. This can be greater than the actual number of kernel elements.

	## High level description

	Each call to the DWCONV microkernel will produce 1 row of output.

	For each element of this row of output, DWCONV will produce `channel_tile`
	number of outputs in the main loop, with a separate loop to handle remainders
	(remainder loop).

	In each iteration of the main loop, the microkernel will read `channel_tile` biases, `channel_tile * kernel_tile`
	inputs, `channel_tile * kernel_tile` weights, and, optionally, `channel_tile` of per-channel scales,
	perform the convolution, then write `channel_tile` outputs.

	In the remainder loop, the microkernel will read `remainder_channels` biases,
	`remainder_channels * kernel_tile` inputs, `remainder_channels * kernel_tile`
	weights, perform the convolution, and write `remainder_channels` outputs.

	## Microkernel arguments

	```
	void xnn_f32_dwconv_ukernel_up2x3__scalar(
	size_t channels,
	size_t output_width,
	const float** input,
	const float* weights,
	float* output,
	size_t input_stride,
	size_t output_increment,
	size_t input_offset,
	const float* zero,
	const union xnn_f32_default_params params[restrict XNN_MIN_ELEMENTS(1)])
	```

	- `channels`, number of output channels to compute
	- `output_width`, number of produced pixels
	- `input`, pointer to input indirection buffer
	- `weights`, pointer to weights
	- `output`, pointer to output
	- `input_stride`, number of bytes to add to the indirection buffer to advance to the input pointers corresponding to the
	next output element
	- `output_increment`, number of bytes to get to the next output element
	- `input_offset`, offset to add to pointers from indirection buffer, unless these pointers match the zero pointer
	- `zero`, pointer to zero buffer
	- `params`, min/max values for clamping the output

	## Packing

	Based on the high level description of the microkernel, we will have to pack the
	weights such that we have:

	- `channel_tile` biases
	- `channel_tile * kernel_tile` weights

	Repeated `round_up(channels, channel_tile)` times.

	## Indirection buffer

	The indirection buffer is packed such that the `channel_tile * kernel_tile`
	pointers to input required for computing a single output is adjacent to each
	other. A simple way to pack it will then be:

	```
	input kernel output

	ABC ab WX
	DEF cd YZ
	GHI

	uncompressed indirection buffer for first row of output
	ABDEBCEF
	```

	This requires `kernel_tile * output_width` pointers.

	We can compress this if we pack the input pointers column first:

	```
	column first uncompressed:
	ADBEBECF
	```

	Notice that `BE` is repeated. So we can elide it, provided that we tell the
	microkernel how much to skip over to get to the input pointers for the next
	output element (it is not just `kernel_tile`), that's what `input_stride` is
	for.

	```
	column first compressed:
	ADBECF
	```

	The weights similarly have to be packed column first.