ResNet50 v2

Use case : `Image classification`

Model description

ResNets family is a well known architecture that uses skip connections to enable stronger gradients in much deeper networks. This variant has 50 layers.

The model is quantized in int8 using tensorflow lite converter. A mixed precision version is also provided using onnx-runtime and our own quantization scripts.

Network information

Network Information	Value
Framework	TensorFlow Lite
MParams	25.6 M
Quantization	int8
Provenance	https://www.tensorflow.org/api_docs/python/tf/keras/applications/ResNet50V2
Paper	https://arxiv.org/abs/1603.05027

The models are quantized using tensorflow lite converter.

Network inputs / outputs

For an image resolution of NxM and P classes

Input Shape	Description
(1, N, M, 3)	Single NxM RGB image with UINT8 values between 0 and 255

Output Shape	Description
(1, P)	Per-class confidence for P classes in FLOAT32

Recommended platforms

Platform	Supported	Recommended
STM32L0	[]	[]
STM32L4	[]	[]
STM32U5	[]	[]
STM32H7	[x]	[]
STM32MP1	[x]	[]
STM32MP2	[x]	[x]
STM32N6	[x]	[x]

Performances

Metrics

Measures are done with default STM32Cube.AI configuration with enabled input / output allocated option.
tfs stands for "training from scratch", meaning that the model weights were randomly initialized before training.
tl stands for "transfer learning", meaning that the model backbone weights were initialized from a pre-trained model, then only the last layer was unfrozen during the training.
fft stands for "full fine-tuning", meaning that the full model weights were initialized from a transfer learning pre-trained model, and all the layers were unfrozen during the training.

Reference NPU memory footprint on food101 and imagenet dataset (see Accuracy for details on dataset)

Model	Dataset	Format	Resolution	Series	Internal RAM	External RAM	Weights Flash	STEdgeAI Core version
ResNet50 v2 fft	food101	Int8	224x224x3	STM32N6	2308.06	3136	23833.67	3.0.0
ResNet50 v2 fft	food101	Int8/Int4	224x224x3	STM32N6	2308.06	2352	13268.39	3.0.0
ResNet50 v2	imagenet	Int8	224x224x3	STM32N6	2308.06	3136.0	25633.61	3.0.0
ResNet50 v2	imagenet	Int8/Int4	224x224x3	STM32N6	2308.06	2352	21154.53	3.0.0

Reference NPU inference time on food101 and imagenet dataset (see Accuracy for details on dataset)

Model	Dataset	Format	Resolution	Board	Execution Engine	Inference time (ms)	Inf / sec	STEdgeAI Core version
ResNet50 v2 fft	food101	Int8	224x224x3	STM32N6570-DK	NPU/MCU	238.49	4.19	3.0.0
ResNet50 v2 fft	food101	Int8/Int4	224x224x3	STM32N6570-DK	NPU/MCU	267.33	3.74	3.0.0
ResNet50 v2	imagenet	Int8	224x224x3	STM32N6570-DK	NPU/MCU	243.04	4.11	3.0.0
ResNet50 v2	imagenet	Int8/Int4	224x224x3	STM32N6570-DK	NPU/MCU	286.06	3.5	3.0.0

Reference MCU memory footprint based on Food-101 and imagenet dataset (see Accuracy for details on dataset)

Model	Dataset	Format	Resolution	Series	Activation RAM	Runtime RAM	Weights Flash	Code Flash	Total RAM	Total Flash	STEdgeAI Core version
ResNet50 v2 fft	food101	Int8	224x224x3	STM32H7	1816.2 KiB	14.56 KiB	23240.96 KiB	169.12 KiB	1830.76 KiB	23410.08 KiB	3.0.0
ResNet50 v2	imagenet	Int8	224x224x3	STM32H7	2142.07 KiB	41.03 KiB	25042.47 KiB	225.32 KiB	2183.1 KiB	25267.79 KiB	3.0.0

Reference MCU inference time based on Food-101 and imagenet dataset (see Accuracy for details on dataset)

Model	Dataset	Format	Resolution	Board	Execution Engine	Frequency	Inference time (ms)	STEdgeAI Core version
ResNet50 v2 fft	food101	Int8	224x224x3	STM32H747I-DISCO	1 CPU	400 MHz	11314.82	3.0.0
ResNet50 v2	imagenet	Int8	224x224x3	STM32H747I-DISCO	1 CPU	400 MHz	11370.07	3.0.0

Accuracy with Food-101 dataset

Dataset details: link, Quotation[1] , Number of classes: 101 , Number of images: 101 000

Model	Format	Resolution	Top 1 Accuracy
ResNet50 v2 fft	Float	224x224x3	82.2 %
ResNet50 v2 fft	Int8	224x224x3	81.03 %
ResNet50 v2 fft	Int8/Int4	224x224x3	80.17 %

Accuracy with imagenet dataset

Dataset details: link, Quotation[4]. Number of classes: 1000. To perform the quantization, we calibrated the activations with a random subset of the training set. For the sake of simplicity, the accuracy reported here was estimated on the 50000 labelled images of the validation set.

model	Format	Resolution	Top 1 Accuracy
ResNet50 v2	Float	224x224x3	68.73 %
ResNet50 v2	Int8	224x224x3	67.99 %
ResNet50 v2	Int8/Int4	224x224x3	67.45 %

Retraining and Integration in a simple example:

Please refer to the stm32ai-modelzoo-services GitHub here

References

[1] L. Bossard, M. Guillaumin, and L. Van Gool, "Food-101 -- Mining Discriminative Components with Random Forests." European Conference on Computer Vision, 2014.

Downloads last month: -; Downloads are not tracked for this model. How to track

Paper for STMicroelectronics/resnet50v2

Identity Mappings in Deep Residual Networks

Paper • 1603.05027 • Published Mar 16, 2016 • 5