Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,145 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
pipeline_tag: audio-classification
|
| 4 |
+
---
|
| 5 |
+
# Quantized Yamnet
|
| 6 |
+
|
| 7 |
+
## **Use case** : `AED`
|
| 8 |
+
|
| 9 |
+
# Model description
|
| 10 |
+
|
| 11 |
+
Yamnet is a very well-known audio classification model, pre-trained on Audioset and released by Google. The default model outputs embedding vectors of size 1024.
|
| 12 |
+
|
| 13 |
+
As the default Yamnet is a bit too large to fit on most microcontrollers (over 3M parameters), we provide in this model zoo a much downsized version of Yamnet which outputs embeddings of size 256.
|
| 14 |
+
|
| 15 |
+
We now also provide the original Yamnet (named Yamnet-1024 in this repo), with its original 3.2 million parameters, for use on the STM32N6.
|
| 16 |
+
|
| 17 |
+
Additionally, the default Yamnet provided by Google expects waveforms as input and has specific custom layers to perform conversion to mel-spectrogram and patch extraction.
|
| 18 |
+
These custom layers are not included in Yamnet-256 or Yamnet-1024, as STEDGEAI cannot convert them to C code, and more efficient implementations of these operations already exist on microcontrollers.
|
| 19 |
+
Thus, Yamnet-256 and Yamnet-1024 expect mel-spectrogram patches of size 64x96, format (n_mels, n_frames)
|
| 20 |
+
|
| 21 |
+
The model is quantized in int8 using tensorflow lite converter for Yamnet-256, and ONNX quantizer for Yamnet-1024.
|
| 22 |
+
|
| 23 |
+
We provide Yamnet-256s for two different datasets : ESC-10, which is a small research dataset, and FSD50K, a large generalist dataset using the audioset ontology.
|
| 24 |
+
For FSD50K, the model is trained to detect a small subset of the classes included in the dataset. This subset is : Knock, Glass, Gunshots and gunfire, Crying and sobbing, Speech.
|
| 25 |
+
|
| 26 |
+
The inference time & footprints are very similar in both cases, with the FSD50K model being very slightly smaller and faster.
|
| 27 |
+
|
| 28 |
+
## Network information
|
| 29 |
+
|
| 30 |
+
Yamnet-256
|
| 31 |
+
|
| 32 |
+
| Network Information | Value |
|
| 33 |
+
|-------------------------|-----------------|
|
| 34 |
+
| Framework | TensorFlow Lite |
|
| 35 |
+
| Parameters Yamnet-256 | 130 K |
|
| 36 |
+
| Quantization | int8 |
|
| 37 |
+
| Provenance | https://tfhub.dev/google/yamnet/1 |
|
| 38 |
+
|
| 39 |
+
Yamnet-1024
|
| 40 |
+
|
| 41 |
+
| Network Information | Value |
|
| 42 |
+
|-------------------------|-----------------|
|
| 43 |
+
| Framework | TensorFlow Lite |
|
| 44 |
+
| Parameters Yamnet-1024 | 3.2 M |
|
| 45 |
+
| Quantization | int8 |
|
| 46 |
+
| Provenance | https://tfhub.dev/google/yamnet/1 |
|
| 47 |
+
|
| 48 |
+
|
| 49 |
+
## Network inputs / outputs
|
| 50 |
+
|
| 51 |
+
|
| 52 |
+
The network expects spectrogram patches of 96 frames and 64 mels, of shape (64, 96, 1).
|
| 53 |
+
Additionally, the original Yamnet converts waveforms to spectrograms by using an FFT and window size of 25 ms, a hop length of 10ms, and by clipping frequencies between 125 and 7500 Hz.
|
| 54 |
+
|
| 55 |
+
Yamnet-256 outputs embedding vectors of size 256. If you use the model zoo scripts to perform transfer learning, a classification head with the specified number of classes will automatically be added to the network.
|
| 56 |
+
|
| 57 |
+
Yamnet-1024 is the original yamnet without the TF preprocessing layers attached, and outputs embedding vectors of size 1024. If you use the model zoo scripts to perform transfer learning, a classification head with the specified number of classes will automatically be added to the network.
|
| 58 |
+
|
| 59 |
+
|
| 60 |
+
## Recommended platforms
|
| 61 |
+
|
| 62 |
+
For Yamnet-256
|
| 63 |
+
| Platform | Supported | Recommended |
|
| 64 |
+
|----------|-----------|-----------|
|
| 65 |
+
| STM32U5 |[x]|[x]|
|
| 66 |
+
| STM32N6 |[x]|[x]|
|
| 67 |
+
|
| 68 |
+
For Yamnet-1024
|
| 69 |
+
| Platform | Supported | Recommended |
|
| 70 |
+
|----------|-----------|-----------|
|
| 71 |
+
| STM32N6 |[x]|[x]|
|
| 72 |
+
|
| 73 |
+
|
| 74 |
+
|
| 75 |
+
# Performances
|
| 76 |
+
|
| 77 |
+
## Metrics
|
| 78 |
+
Measures are done with default STEDGEAI configuration with enabled input / output allocated option.
|
| 79 |
+
|
| 80 |
+
### Reference **NPU** memory footprint based on ESC-10 dataset
|
| 81 |
+
|Model | Dataset | Format | Resolution | Series | Internal RAM (KiB) | External RAM (KiB) | Weights Flash (KiB) | STM32Cube.AI version | STEdgeAI Core version |
|
| 82 |
+
|----------|------------------|--------|-------------|------------------|------------------|---------------------|-------|----------------------|-------------------------|
|
| 83 |
+
| [Yamnet 256](https://github.com/STMicroelectronics/stm32ai-modelzoo/ST_pretrainedmodel_public_dataset/esc10/yamnet_256_64x96_tl/yamnet_256_64x96_tl_int8.tflite) | esc-10 | Int8 | 64x96x1 | STM32N6 | 144 | 0.0 | 176.59 | 10.0.0 | 2.0.0 |
|
| 84 |
+
| [Yamnet 1024](https://github.com/STMicroelectronics/stm32ai-modelzoo/ST_pretrainedmodel_public_dataset/esc10/yamnet_1024_64x96_tl/yamnet_1024_64x96_tl_qdq_int8.onnx) | esc-10 | Int8 | 64x96x1 | STM32N6 | 144 | 0.0 | 3497.24 | 10.0.0 | 2.0.0 |
|
| 85 |
+
|
| 86 |
+
### Reference **NPU** inference time based on ESC-10 dataset
|
| 87 |
+
| Model | Dataset | Format | Resolution | Board | Execution Engine | Inference time (ms) | Inf / sec | STM32Cube.AI version | STEdgeAI Core version |
|
| 88 |
+
|--------|------------------|--------|-------------|------------------|------------------|---------------------|-------|----------------------|-------------------------|
|
| 89 |
+
| [Yamnet 256](https://github.com/STMicroelectronics/stm32ai-modelzoo/ST_pretrainedmodel_public_dataset/esc10/yamnet_256_64x96_tl/yamnet_256_64x96_tl_int8.tflite) | esc-10 | Int8 | 64x96x1 | STM32N6570-DK | NPU/MCU | 1.07 | 934.58 | 10.0.0 | 2.0.0 |
|
| 90 |
+
| [Yamnet 1024](https://github.com/STMicroelectronics/stm32ai-modelzoo/ST_pretrainedmodel_public_dataset/esc10/yamnet_1024_64x96_tl/yamnet_1024_64x96_tl_qdq_int8.onnx) | esc-10 | Int8 | 64x96x1 | STM32N6570-DK | NPU/MCU | 9.88 | 101.21 | 10.0.0 | 2.0.0 |
|
| 91 |
+
|
| 92 |
+
|
| 93 |
+
### Reference **MCU** memory footprint based on ESC-10 dataset
|
| 94 |
+
| Model | Format | Resolution | Series | Activation RAM (kB) | Runtime RAM (kB) | Weights Flash (kB) | Code Flash (kB) | Total RAM (kB) | Total Flash (kB) | STM32Cube.AI version |
|
| 95 |
+
|-------------------|--------|------------|---------|----------------|-------------|---------------|------------|-------------|-------------|-----------------------|
|
| 96 |
+
|[Yamnet 256](https://github.com/STMicroelectronics/stm32ai-modelzoo/ST_pretrainedmodel_public_dataset/esc10/yamnet_256_64x96_tl/yamnet_256_64x96_tl_int8.tflite) | Int8 | 64x96x1 | B-U585I-IOT02A | 109.57 | 7.61 | 135.91 | 57.74 | 117.18 | 193.65 | 10.0.0 |
|
| 97 |
+
|[Yamnet 1024](https://github.com/STMicroelectronics/stm32ai-modelzoo/ST_pretrainedmodel_public_dataset/esc10/yamnet_1024_64x96_tl/yamnet_1024_64x96_tl_qdq_int8.onnx) | Int8 | 64x96x1 | STM32N6 | 108.59 | 35.41 | 3162.66 | 334.30 | 144.0 | 3496.96 | 10.0.0 |
|
| 98 |
+
|
| 99 |
+
### Reference inference time based on ESC-10 dataset
|
| 100 |
+
| Model | Format | Resolution | Board | Execution Engine | Frequency | Inference time | STM32Cube.AI version |
|
| 101 |
+
|-------------------|--------|------------|------------------|------------------|--------------|-----------------|-----------------------|
|
| 102 |
+
| [Yamnet 256](https://github.com/STMicroelectronics/stm32ai-modelzoo/ST_pretrainedmodel_public_dataset/esc10/yamnet_256_64x96_tl/yamnet_256_64x96_tl_int8.tflite) | Int8 | 64x96x1 | B-U585I-IOT02A | 1 CPU | 160 MHz | 281.95 ms | 10.0.0
|
| 103 |
+
|[Yamnet 1024](https://github.com/STMicroelectronics/stm32ai-modelzoo/ST_pretrainedmodel_public_dataset/esc10/yamnet_1024_64x96_tl/yamnet_1024_64x96_tl_qdq_int8.onnx) | Int8 | 64x96x1 | STM32N6 | 1 CPU + 1 NPU | 800MhZ/1000MhZ | 11.949 ms | 10.0.0
|
| 104 |
+
|
| 105 |
+
|
| 106 |
+
### Accuracy with ESC-10 dataset
|
| 107 |
+
|
| 108 |
+
A note on clip-level accuracy : In a traditional AED data processing pipeline, audio is converted to a spectral representation (in this model zoo, mel-spectrograms), which is then cut into patches. Each patch is fed to the inference network, and a label vector is output for each patch. The labels on these patches are then aggregated based on which clip the patch belongs to, to form a single aggregate label vector for each clip. Accuracy is then computed on these aggregate label vectors.
|
| 109 |
+
|
| 110 |
+
The reason this metric is used instead of patch-level accuracy is because patch-level accuracy varies immensely depending on the specific manner used to cut spectrogram into patches, and also because clip-level accuracy is the metric most often reported in research papers.
|
| 111 |
+
|
| 112 |
+
| Model | Format | Resolution | Clip-level Accuracy |
|
| 113 |
+
|-------|--------|------------|----------------|
|
| 114 |
+
| [Yamnet 256](https://github.com/STMicroelectronics/stm32ai-modelzoo/ST_pretrainedmodel_public_dataset/esc10/yamnet_256_64x96_tl/yamnet_256_64x96_tl.h5) | float32 | 64x96x1 | 94.9% |
|
| 115 |
+
| [Yamnet 256](https://github.com/STMicroelectronics/stm32ai-modelzoo/ST_pretrainedmodel_public_dataset/esc10/yamnet_256_64x96_tl/yamnet_256_64x96_tl_int8.tflite) | int8 | 64x96x1 | 94.9% |
|
| 116 |
+
| [Yamnet 1024](https://github.com/STMicroelectronics/stm32ai-modelzoo/ST_pretrainedmodel_public_dataset/esc10/yamnet_1024_64x96_tl/yamnet_1024_64x96_tl.h5) | float32 | 64x96x1 | 100.0% |
|
| 117 |
+
| [Yamnet 1024](https://github.com/STMicroelectronics/stm32ai-modelzoo/ST_pretrainedmodel_public_dataset/esc10/yamnet_1024_64x96_tl/yamnet_1024_64x96_tl_qdq_int8.onnx) | int8 | 64x96x1 | 100.0% |
|
| 118 |
+
|
| 119 |
+
|
| 120 |
+
|
| 121 |
+
### Accuracy with FSD50K dataset - Domestic AED use case
|
| 122 |
+
In this use case, the model is trained to detect a small subset of the classes included in the dataset. This subset is : Knock, Glass, Gunshots and gunfire, Crying and sobbing, Speech.
|
| 123 |
+
|
| 124 |
+
A note on clip-level accuracy : In a traditional AED data processing pipeline, audio is converted to a spectral representation (in this model zoo, mel-spectrograms), which is then cut into patches. Each patch is fed to the inference network, and a label vector is output for each patch. The labels on these patches are then aggregated based on which clip the patch belongs to, to form a single aggregate label vector for each clip. Accuracy is then computed on these aggregate label vectors.
|
| 125 |
+
|
| 126 |
+
The reason this metric is used instead of patch-level accuracy is because patch-level accuracy varies immensely depending on the specific manner used to cut spectrogram into patches, and also because clip-level accuracy is the metric most often reported in research papers.
|
| 127 |
+
|
| 128 |
+
**IMPORTANT NOTE** : The accuracy for the model with the "unknown class" added is significantly lower when performing inference on PC. This is because this additional class regroups a lot (appromiatively 194 in this specific case) of other classes, and thus drags performance down a bit.
|
| 129 |
+
|
| 130 |
+
However, contrary to what the numbers might suggest online performance on device is much improved in practice by this addition, in this specific case.
|
| 131 |
+
|
| 132 |
+
Note that accuracy with unknown class is lower. This is normal
|
| 133 |
+
| Model | Format | Resolution | Clip-level Accuracy |
|
| 134 |
+
|-------|--------|------------|----------------|
|
| 135 |
+
| [Yamnet 256 without unknown class](https://github.com/STMicroelectronics/stm32ai-modelzoo/ST_pretrainedmodel_public_dataset/fsd50k/yamnet_256_64x96_tl/without_unknown_class/yamnet_256_64x96_tl.h5) | float32 | 64x96x1 | 86.0% |
|
| 136 |
+
| [Yamnet 256 without unknown class](https://github.com/STMicroelectronics/stm32ai-modelzoo/ST_pretrainedmodel_public_dataset/fsd50k/yamnet_256_64x96_tl/without_unknown_class/yamnet_256_64x96_tl_int8.tflite) | float32 | 64x96x1 | 87.0% |
|
| 137 |
+
| [Yamnet 256 with unknown class](https://github.com/STMicroelectronics/stm32ai-modelzoo/ST_pretrainedmodel_public_dataset/fsd50k/yamnet_256_64x96_tl/with_unknown_class/yamnet_256_64x96_tl.h5) | float32 | 64x96x1 | 73.0% |
|
| 138 |
+
| [Yamnet 256 with unknown class](https://github.com/STMicroelectronics/stm32ai-modelzoo/ST_pretrainedmodel_public_dataset/fsd50k/yamnet_256_64x96_tl/with_unknown_class/yamnet_256_64x96_tl_int8.tflite) | int8 | 64x96x1 | 73.9% |
|
| 139 |
+
|
| 140 |
+
## Retraining and Integration in a simple example:
|
| 141 |
+
|
| 142 |
+
Please refer to the stm32ai-modelzoo-services GitHub [here](https://github.com/STMicroelectronics/stm32ai-modelzoo-services)
|
| 143 |
+
|
| 144 |
+
|
| 145 |
+
|