Audio Classification
FBAGSTM commited on
Commit
3380ead
·
verified ·
1 Parent(s): 4485d78

Release AI-ModelZoo-4.0.0

Browse files
Files changed (1) hide show
  1. README.md +23 -19
README.md CHANGED
@@ -4,7 +4,7 @@ license_link: >-
4
  https://github.st.com/AIS/stm32ai-modelzoo/raw/master/audio_event_detection/LICENSE.md
5
  pipeline_tag: audio-classification
6
  ---
7
- # Quantized miniresnet
8
 
9
  ## **Use case** : `AED`
10
 
@@ -14,20 +14,21 @@ ResNets are well known image classification models, that use skip-connections be
14
 
15
  However, they are also widely used in AED and Audio classification, by converting the audio to a mel-spectrogram, and passing that as input to the model.
16
 
 
17
 
18
- MiniResNet is based on the ResNet implementation found in tensorflow, and is a resized version of a ResNet18 with a custom block function. These blocks are then assembled in stacks, and the user can specify the number of stacks desired, with more stacks resulting in a larger network.
19
 
20
  A note on pooling : In some of our pretrained models, we do not use a pooling function at the end of the convolutional backbone, as is traditionally done. Because of the small number of convolutional blocks, the number of filters is low even for larger model sizes, leading to a low embedding size after pooling.
21
  We found that in many cases we obtain a better performance / model size / inference time tradeoff by not performing any pooling. This makes the linear classification layer larger, but in cases with a relatively low number of classes, this remains cheaper than adding more convolutional blocks.
22
 
23
  Naturally, you are able to set the type of pooling you wish to use when training a model, whether from scratch or using transfer learning.
24
 
25
- The MiniResNet backbones provided in the model zoo are pretrained on [FSD50K](https://zenodo.org/records/4060432)
26
-
27
 
28
  Source implementation : https://keras.io/api/applications/resnet/
29
 
30
  Papers : [ResNet](https://arxiv.org/abs/1512.03385)
 
31
 
32
  ## Network information
33
 
@@ -35,8 +36,8 @@ Papers : [ResNet](https://arxiv.org/abs/1512.03385)
35
  | Network Information | Value |
36
  |-------------------------|-----------------|
37
  | Framework | TensorFlow Lite |
38
- | Params 1 stack | 135K |
39
- | Params 2 stacks | 450K |
40
  | Quantization | int8 |
41
  | Provenance | https://keras.io/api/applications/resnet/ |
42
 
@@ -45,7 +46,7 @@ The pre-trained networks expects patches of shape (64, 50, 1), with 64 mels and
45
 
46
  When training from scratch, you can specify whichever input shape you desire.
47
 
48
- It outputs embedding vectors of size 2048 for the 2 stacks version, and 3548 for the 1 stack version. If you use the model zoo scripts to perform transfer learning or training from scratch, a classification head with the specified number of classes will automatically be added to the network.
49
 
50
  ## Recommended platforms
51
 
@@ -58,26 +59,27 @@ It outputs embedding vectors of size 2048 for the 2 stacks version, and 3548 for
58
 
59
  ## Metrics
60
 
 
61
 
62
- Measures are done with default STEdgeAI Core configuration with enabled input / output allocated option.
63
 
64
 
65
  ### Reference MCU memory footprint based on ESC-10 dataset
66
 
67
 
68
- | Model | Format | Resolution | Series | Activation RAM (KiB) | Runtime RAM (KiB)| Weights Flash (KiB) | Code Flash (KiB) | Total RAM (KiB) | Total Flash (KiB)| STEdgeAI Core version |
69
  |-------------------|--------|------------|---------|----------------|-------------|---------------|------------|-------------|-------------|-----------------------|
70
- | [MiniResNet 1 stack ](https://github.com/STMicroelectronics/stm32ai-modelzoo/tree/main/audio_event_detection/miniresnetv1/esc10/miniresnetv1_s1_64x50_tl/miniresnetv1_s1_64x50_tl_int8.tflite) | int8 | 64x50x1 | B-U585I-IOT02A | 59.89 | 1.08 | 123.6 | 32.36 | 60.97 | 155.96 | 3.0.0 |
71
- | [MiniResNet 2 stacks ](https://github.com/STMicroelectronics/stm32ai-modelzoo/tree/main/audio_event_detection/miniresnetv1/esc10/miniresnetv1_s2_64x50_tl/miniresnetv1_s2_64x50_tl_int8.tflite) | int8 | 64x50x1 | B-U585I-IOT02A | 59.89 | 1.69 | 431.1 | 36.81 | 61.58 | 467.91 | 3.0.0 |
72
 
73
 
74
  ### Reference inference time based on ESC-10 dataset
75
 
76
 
77
- | Model | Format | Resolution | Board | Execution Engine | Frequency | Inference time (ms) | STEdgeAI Core version |
78
- |-------------------|--------|------------|------------------|------------------|-------------|-----------------|-----------------------|
79
- | [MiniResNet 1 stack ](https://github.com/STMicroelectronics/stm32ai-modelzoo/tree/main/audio_event_detection/miniresnetv1/esc10/miniresnetv1_s1_64x50_tl/miniresnetv1_s1_64x50_tl_int8.tflite) | int8 | 64x50x1 | B-U585I-IOT02A | 1 CPU | 160 MHz | 91.45 | 3.0.0 |
80
- | [MiniResNet 2 stacks ](https://github.com/STMicroelectronics/stm32ai-modelzoo/tree/main/audio_event_detection/miniresnetv1/esc10/miniresnetv1_s2_64x50_tl/miniresnetv1_s2_64x50_tl_int8.tflite) | int8 | 64x50x1 | B-U585I-IOT02A | 1 CPU | 160 MHz | 141.82 | 3.0.0 |
81
 
82
 
83
  ### Accuracy with ESC-10 dataset
@@ -88,10 +90,12 @@ The reason this metric is used instead of patch-level accuracy is because patch-
88
 
89
  | Model | Format | Resolution | Clip-level Accuracy |
90
  |-------|--------|------------|----------------|
91
- | [MiniResNet 1 stack ](https://github.com/STMicroelectronics/stm32ai-modelzoo/tree/main/audio_event_detection/miniresnetv1/esc10/miniresnetv1_s1_64x50_tl/miniresnetv1_s1_64x50_tl.keras) | float32 | 64x50x1 | 90.0% |
92
- | [MiniResNet 1 stack ](https://github.com/STMicroelectronics/stm32ai-modelzoo/tree/main/audio_event_detection/miniresnetv1/esc10/miniresnetv1_s1_64x50_tl/miniresnetv1_s1_64x50_tl_int8.tflite) | int8 | 64x50x1 | 90.0% |
93
- | [MiniResNet 2 stacks ](https://github.com/STMicroelectronics/stm32ai-modelzoo/tree/main/audio_event_detection/miniresnetv1/esc10/miniresnetv1_s2_64x50_tl/miniresnetv1_s2_64x50_tl.keras) | float32 | 64x50x1 | 92.5% |
94
- | [MiniResNet 2 stacks ](https://github.com/STMicroelectronics/stm32ai-modelzoo/tree/main/audio_event_detection/miniresnetv1/esc10/miniresnetv1_s2_64x50_tl/miniresnetv1_s2_64x50_tl_int8.tflite) | int8 | 64x50x1 | 92.5% |
 
 
95
 
96
  ## Retraining and Integration in a simple example:
97
 
 
4
  https://github.st.com/AIS/stm32ai-modelzoo/raw/master/audio_event_detection/LICENSE.md
5
  pipeline_tag: audio-classification
6
  ---
7
+ # Quantized miniresnetv2
8
 
9
  ## **Use case** : `AED`
10
 
 
14
 
15
  However, they are also widely used in AED and Audio classification, by converting the audio to a mel-spectrogram, and passing that as input to the model.
16
 
17
+ ResNetv2 changes the order of the skip-connections and ReLU activations in the ordinary ResNet architecture, with the main benefit being faster convergence during training.
18
 
19
+ miniresnetv2 is based on the ResNetv2 implementation found in tensorflow, and is a resized version of a ResNet18v2 with a custom block function. These blocks are then assembled in stacks, and the user can specify the number of stacks desired, with more stacks resulting in a larger network.
20
 
21
  A note on pooling : In some of our pretrained models, we do not use a pooling function at the end of the convolutional backbone, as is traditionally done. Because of the small number of convolutional blocks, the number of filters is low even for larger model sizes, leading to a low embedding size after pooling.
22
  We found that in many cases we obtain a better performance / model size / inference time tradeoff by not performing any pooling. This makes the linear classification layer larger, but in cases with a relatively low number of classes, this remains cheaper than adding more convolutional blocks.
23
 
24
  Naturally, you are able to set the type of pooling you wish to use when training a model, whether from scratch or using transfer learning.
25
 
26
+ The MiniResNetv2 backbones provided in the model zoo are pretrained on [FSD50K](https://zenodo.org/records/4060432)
 
27
 
28
  Source implementation : https://keras.io/api/applications/resnet/
29
 
30
  Papers : [ResNet](https://arxiv.org/abs/1512.03385)
31
+ [ResNetv2](https://arxiv.org/abs/1603.05027)
32
 
33
  ## Network information
34
 
 
36
  | Network Information | Value |
37
  |-------------------------|-----------------|
38
  | Framework | TensorFlow Lite |
39
+ | Params 1 stack | 125K |
40
+ | Params 2 stacks | 440K |
41
  | Quantization | int8 |
42
  | Provenance | https://keras.io/api/applications/resnet/ |
43
 
 
46
 
47
  When training from scratch, you can specify whichever input shape you desire.
48
 
49
+ It outputs embedding vectors of size 2048 for the 2 stacks version, and 3548 for the 1 stack version. If you use the train.py script to perform transfer learning or training from scratch, a classification head with the specified number of classes will automatically be added to the network.
50
 
51
  ## Recommended platforms
52
 
 
59
 
60
  ## Metrics
61
 
62
+ * Measures are done with default STEdgeAI Core configuration with enabled input / output allocated option.
63
 
64
+ * `tl` stands for "transfer learning", meaning that the model backbone weights were initialized from a pre-trained model, then only the last layer was unfrozen during the training.
65
 
66
 
67
  ### Reference MCU memory footprint based on ESC-10 dataset
68
 
69
 
70
+ | Model | Format | Resolution | Series | Activation RAM (KiB) | Runtime RAM (KiB) | Weights Flash (KiB) | Code Flash (KiB) | Total RAM (KiB) | Total Flash (kB) | STEdgeAI Core version |
71
  |-------------------|--------|------------|---------|----------------|-------------|---------------|------------|-------------|-------------|-----------------------|
72
+ | [miniresnet v2 1 stack ](https://github.com/STMicroelectronics/stm32ai-modelzoo/tree/main/audio_event_detection/miniresnetv2/esc10/miniresnetv2_s1_64x50_tl/miniresnetv2_s1_64x50_tl_int8.tflite) | int8 | 64x50x1 | B-U585I-IOT02A | 59.89 | 2.84 | 123.98 | 42.76 | 62.73| 166.74 | 3.0.0 |
73
+ | [miniresnet v2 2 stacks ](https://github.com/STMicroelectronics/stm32ai-modelzoo/tree/main/audio_event_detection/miniresnetv2/esc10/miniresnetv2_s2_64x50_tl/miniresnetv2_s2_64x50_tl_int8.tflite) | int8 | 64x50x1 | B-U585I-IOT02A | 59.89 | 4.59 | 431.98 | 49.22 | 64.48 | 481.2 | 3.0.0 |
74
 
75
 
76
  ### Reference inference time based on ESC-10 dataset
77
 
78
 
79
+ | Model | Format | Resolution | Board | Execution Engine | Frequency | Inference time (ms) | STEdgeAI Core version |
80
+ |-------------------|--------|------------|------------------|------------------|--------------|-------|-----------------------|
81
+ | [miniresnet v2 1 stack ](https://github.com/STMicroelectronics/stm32ai-modelzoo/tree/main/audio_event_detection/miniresnetv2/esc10/miniresnetv2_s1_64x50_tl/miniresnetv2_s1_64x50_tl_int8.tflite) | int8 | 64x50x1 | B-U585I-IOT02A | 1 CPU | 160 | 187.26 | 3.0.0 |
82
+ | [miniresnet v2 2 stacks ](https://github.com/STMicroelectronics/stm32ai-modelzoo/tree/main/audio_event_detection/miniresnetv2/esc10/miniresnetv2_s2_64x50_tl/miniresnetv2_s2_64x50_tl_int8.tflite) | int8 | 64x50x1 | B-U585I-IOT02A | 1 CPU | 160 | 307.34 | 3.0.0 |
83
 
84
 
85
  ### Accuracy with ESC-10 dataset
 
90
 
91
  | Model | Format | Resolution | Clip-level Accuracy |
92
  |-------|--------|------------|----------------|
93
+ | [miniresnet v2 1 stack ](https://github.com/STMicroelectronics/stm32ai-modelzoo/tree/main/audio_event_detection/miniresnetv2/esc10/miniresnetv2_s1_64x50_tl/miniresnetv2_s1_64x50_tl.keras) | float32 | 64x50x1 | 91.25% |
94
+ | [miniresnet v2 1 stack ](https://github.com/STMicroelectronics/stm32ai-modelzoo/tree/main/audio_event_detection/miniresnetv2/esc10/miniresnetv2_s1_64x50_tl/miniresnetv2_s1_64x50_tl_int8.tflite) | int8 | 64x50x1 | 92.5% |
95
+ | [miniresnet v2 2 stacks ](https://github.com/STMicroelectronics/stm32ai-modelzoo/tree/main/audio_event_detection/miniresnetv2/esc10/miniresnetv2_s2_64x50_tl/miniresnetv2_s2_64x50_tl.keras) | float32 | 64x50x1 | 93.75% |
96
+ | [miniresnet v2 2 stacks ](https://github.com/STMicroelectronics/stm32ai-modelzoo/tree/main/audio_event_detection/miniresnetv2/esc10/miniresnetv2_s2_64x50_tl/miniresnetv2_s2_64x50_tl_int8.tflite) | int8 | 64x50x1 | 93.75% |
97
+
98
+
99
 
100
  ## Retraining and Integration in a simple example:
101