# Table of Contents
1. [Introduction](#1)
2. [Minimal Yaml Configs](#2)
3. [Tensor Override](#3)
4. [Various Operating Modes](#4)
- [4.1 Operating_mode: inspection](#4-1)
- [4.2 Operating mode: full_auto](#4-2)
1. Introduction
This library contains a collection of scripts that are useful for
- improving mAp of quantized models
- designing mixed precision models
- investigation of quantization issues
It is based on ONNX quantizer only. To have good support of 4-bits and 16-bits, it is recommended to set 'quantization.target_opset' to 21 and thus ir_version 10.
The advanced quantization services 'inspection', 'full_auto' are only enabled when 'operation_mode' is 'quantization'.
2. Minimal Yaml Configs
The legacy 'quantization' section is mandatory. The description provided in this readme assumes that we use ONNX quantizer and not LiteRT.
Therefore, 'quantization.quantizer' must be 'Onnx_quantizer'.
If 'quantization.operating_mode' is not defined or is set to 'default', we simply run legacy quantization with the yaml specified parameters.
If some parameters are left undefined, default values are used:
WeightSymmetric: True
ActivationSymmetric: False
CalibMovingAverage: False
QuantizeBias = True
SmoothQuant: False
SmoothQuantAlpha: 0.5
SmoothQuantFolding: True
TensorQuantOverrides: None
op_types_to_quantize = None
nodes_to_quantize = None
nodes_to_exclude = None
calibrate_method = CalibrationMethod.MinMax
weight_type = QuantType.QInt8
activ_type = QuantType.QInt8
Calibration moving average is still controlled in configs.quantization.onnx_extra_options parameters.
'quantization.iterative_quant_parameters' are only taken into account for 'inspection', 'full_auto'.
'iterative_quant_parameters' first defines a parameter called 'inspection_split' used in 'inspection' for defining a subset of the quantization set
on which are performed SNR measurements.
To finish, 'iterative_quant_parameters' also defines 'accuracy_tolerance'. This parameter is only used for mode 'full_auto'.
In object detection, it corresponds to the mAp margin (in percent) we tolerate relatively to full 8 bits performance.
The purpose of this selection will be more detailed in operating modes description in this readme.
3. Tensor Override
It is managed by 'quantization.onnx_extra_options.weights_tensor_override' and 'quantization.onnx_extra_options.activations_tensor_override' parameters.
Examples for setting the override parameters:
weights_tensor_override: [['efficientnetv2-b0_1/block6a_se_reduce_1/convolution/ReadVariableOp:0', {'quant_type': Int16, 'scale': 0.0025, 'zero_point': -12}],
['efficientnetv2-b0_1/stem_conv_1/convolution/merged_input:0', {'quant_type': Int16, "axis": 0}]
]
activations_tensor_override: [['efficientnetv2-b0_1/block4a_se_reduce_1/mul_1:0', {'quant_type': Int16}],
['efficientnetv2-b0_1/block4a_se_expand_1/BiasAdd__79:0', {'quant_type': Int16, 'scale': 0.002, 'zero_point': 0}]
]
Common remarks:
'zero_point' must be:
- one integer scalar or a list of one integer value for per-tensor
- a list of several integer values for per-channel
'scale' must be:
- one float scalar or a list of one floating-point value for per-tensor
- a list of several floating-point values for per-channel
Remarks for the weights:
- When one specifies only 'quant_type' without 'axis': whatever 'quantization.granularity' value, the tensor will be quantized on a per-tensor basis. One 'scale' and one 'zero_point'
can optionally be specified. Since weights are quantized per-tensor, bias will automatically be quantized per-tensor too.
- if 'axis' is specified, then it must be an integer: 0 if the tensor corresponds to the weights of any type of Conv, 1 for a ConvTranspose, 1 too for a Gemm or a MatMul by default.
- when 'axis' is specified (meaning that we intend to quantize the given tensor per-channel), if 'scale' and 'zero_point' are specified too, they must be a list of values whose length corresponds
to the number of output channels for the considered tensor.
Remarks for activations:
- They are not quantized per-channel. So if 'scale' and 'zero_point' are defined, one value for each is enough.
Potential parameters conflict:
- With current onnx and onnxruntime version we must not set 'onnx_quant_parameters.nodes_to_quantize' and 'onnx_extra_options.weights_tensor_override'. It seems they are mutually exclusive.
4. Various Operating Modes
4.1 Operating_mode: inspection
This mode is intended to output in stm32ai_main.log the metrics for estimating quantized tensor quality whether they are weights, bias or activation tensors.
It quantizes the model with the parameters specified in the yaml sections 'quantization', including 'onnx_quant_parameters' and 'onnx_extra_options',
which define among others bit-width and potential tensor overrides.
It has to be noticed that bias is always quantized in INT32 whatever the scheme selected in the config file. So we expect biases are high in quality.
Knowing the tensor quality, one can further investigate potential quantization issues. So it is recommended at the first time to launch this mode without
any tensor override to get the complete view of tensors quantization quality.
Important remark: computing all tensors quality metrics can take time especially if there are many layers and the quantization set is large.
To reduce the latency of this task, it is possible to use the parameter 'inspection_split' in the section 'quantization.iterative_quant_parameters'
to only consider a split of the quantization set.
4.2 Operating mode: full_auto
The goal of this mode is to automatically search for which weight tensors could be quantized in 4 bits or must remain in 8 bits taken into account a mAp
degradation budget of 'quantization.iterative_quant_parameters.accuracy_tolerance' defined in percent with respect to full 8 bits.
Activations are kept in 8 bits and biases in 32 bits in any case.
It is clear that the better the weight quantizer the higher the number of weights tensors could be moved to 4-bits.
However this algorithm can also be run with the basic 'min-max' quantization algorithm embedded into ONNX.
Several steps are sequentially performed:
1. Compute float model mAp for reference
2. Perform a first quantization with parameters in the yaml. This provides a quantized reference mAp typically W8A8
3. Perform a second quantization with 4 bits for all weights: gives an idea how much it degrades
4. Rank the weight tensors with respect to a composite score involving some parameters like total number of weigths and number of weights per output channels.
5. Based on the ranking, search which weights can stay in 4 bits and which one should be on 8 bits
The score computation does not require any training step. It is done statically based on layer parameters.
Among these parameters, we can mention (not restrictive):
a. layer type (conv2D, point-wise, depthwise, dense...)
b. kernel size (2x2, 3x3...)
c. total number of weights
d. number of weights associated to each output channel
Currently, we just consider c. and d.
In a few words, we have a 3 steps ranking procedure:
1. First, rank in decreasing order of each tensor total number of weights
2. To break a possible tie, rank in ascending order of tensor number of weights per output channel
3. To break all remaining ties if any, rank in descending order of execution in the network forward pass
So, the smaller the total number of weights for a given tensor, the lower the ranking of this tensor.
To break a possible tie, we use d. considering that if there is a lot of weights per output channel for a given tensor, it is likely this tensor will be
difficult to quantize on low number of bits, with just one 'scale and one 'zero-point'.
Once the ranking is done, we can start the search which is iterative, by dichotomy. Lower ranking tensors are first candidates for a quantization on 8 bits.
We report the model with the minimum number of layer weights on 8 bits whose mAp is not worse than 'iterative_quant_parameters.accuracy_tolerance' compared to full 8 bits.
To run 'full_auto', and search for which weights could be quantized just on 4 bits, we need to have:
weight_type = Int8
activ_type = Int8
Although, all the parameters of 'quantization' section are taken into account, we recommend not to define 'onnx_quant_parameters.op_types_to_quantize',
'onnx_quant_parameters.nodes_to_quantize', 'onnx_quant_parameters.nodes_to_exclude' to give a change to the algorithm to explore more combinations.
Same remark for 'onnx_extra_options.weights_tensor_override' and 'onnx_extra_options.activations_tensor_override'.
On the contrary, it can be beneficial to enable SmoothQuant algorithm, because it helps balancing the quantization difficulty between the weights and the activations.
In our case, since we would like to have most of the weights on 4-bits, we need to move the quantization complexity towards the activations.
This requires a smoothing parameter 'onnx_extra_options.SmoothQuantAlpha' lower than 0.5. A typical setting would then be:
SmoothQuant: False
SmoothQuantAlpha: 0.1
SmoothQuantFolding: True