Buckets:
Title: A Quantized, Low-Memory Footprint, TinyML Object Detection Network for Low Power Microcontrollers The authors would like to thank armasuisse Science & Technology for funding this research.
URL Source: https://arxiv.org/html/2306.00001
Markdown Content: Julian Moosmann, Marco Giordano, Christian Vogt, Michele Magno Center for Project Based Learning - ETH Zürich
julian.moosmann, marco.giordano, christian.vogt, michele.magno@pbl.ee.ethz.ch
Abstract
This paper introduces a highly flexible, quantized, memory-efficient, and ultra-lightweight object detection network, called TinyissimoYOLO. It aims to enable object detection on microcontrollers in the power domain of milliwatts, with less than 0.5 MB times 0.5 MB 0.5\text{,}\text{MB}start_ARG 0.5 end_ARG start_ARG times end_ARG start_ARG MB end_ARG memory available for storing convolutional neural network (CNN) weights. The proposed quantized network architecture with 422 k times 422 k 422\text{,}\text{k}start_ARG 422 end_ARG start_ARG times end_ARG start_ARG k end_ARG parameters, enables real-time object detection on embedded microcontrollers, and it has been evaluated to exploit CNN accelerators. In particular, the proposed network has been deployed on the MAX78000 microcontroller achieving high frame-rate of up to 180 𝐟𝐩𝐬 times 180 𝐟𝐩𝐬 180\text{,}\textbf{fps}start_ARG 180 end_ARG start_ARG times end_ARG start_ARG fps end_ARG and an ultra-low energy consumption of only 196 μJ times 196 𝜇 J 196\text{,}\mu\mathrm{J}start_ARG 196 end_ARG start_ARG times end_ARG start_ARG italic_μ roman_J end_ARG per inference with an inference efficiency of more than 106 MAC/Cycle times 106 MAC/Cycle 106\text{,}\text{MAC/Cycle}start_ARG 106 end_ARG start_ARG times end_ARG start_ARG MAC/Cycle end_ARG. TinyissimoYOLO can be trained for any multi-object detection. However, considering the small network size, adding object detection classes will increase the size and memory consumption of the network, thus object detection with up to 3 classes is demonstrated. Furthermore, the network is trained using quantization-aware training and deployed with 8-bit quantization on different microcontrollers, such as STM32H7A3, STM32L4R9, Apollo4b and on the MAX78000’s CNN accelerator. Performance evaluations are presented in this paper.
Index Terms:
YOLO, ML, computer vision, object detection, CNN accelerator, microcontroller, quantization, quantization-aware training
I Introduction
Object detection on edge devices such as microcontroller units (μ 𝜇\mu italic_μ Cs) have the capability of reducing detection latency and increasing the overall energy efficiency by running on-device network inferences [1, 2]. In addition, the sensitive transmission of sensor data is reduced or substituted, reducing privacy issues to a minimum. Furthermore, emerging dedicated hardware accelerators for machine learning (ML) models are enabling edge processing and edge artificial intelligence (AI) reducing the energy consumption for inference and enabling real-time on-board processing on resource-constrained microcontrollers[3]. On such constrained devices, computational power is significantly reduced and does not allow for the deployment of classical object detection deep neural networks (DNNs) such as you only look once (YOLO)[4], region-based convolutional neural networks (R-CNNs)[5] or single shot detectors (SSDs)[6] because their memory requirements are significantly exceeding the available memory of few kilobytes typically available in such devices. However, their fundamental ideas are of importance for designing new DNNs, especially for the emerging field of edge AI.
In fact, to keep the power consumption in the order of a few milliwatts, the computational power on μ 𝜇\mu italic_μ Cs is significantly lower compared to CPUs and GPUs. Thus, object detection networks need to be carefully designed and optimized for such tiny devices in particular to achieve a small memory footprint and a high number of operations processed per cycle [7]. To overcome the memory constraints, several different methods have been recently reported in literature, such as pruning [8], [9], quantization [10], [11], new frameworks developed for memory efficient inference [12], patch-based inference scheduling [13] or neural architecture searches including search spaces specialized for memory-constrained devices [14], [15]. These techniques are used to reduce the memory and complexity of existing state-of-the-art networks to deploy on a tiny device while keeping the original network’s structure and still achieving similar inference accuracy while performance (especially inference speed) is often neglected [16]. Among other techniques, quantization is one of the most promising and popular, as it reduces both the memory requirements and increases the number of operations per second that μ 𝜇\mu italic_μ Cs can perform. Li et al. (2019) [17] have shown that quantizing an object detection network to 4-bit, they achieve state-of-the-art prediction accuracy with a mean average precision (mAP) loss of only 2%times 2 percent 2\text{,}\mathrm{\char 37}start_ARG 2 end_ARG start_ARG times end_ARG start_ARG % end_ARG to 5%times 5 percent 5\text{,}\mathrm{\char 37}start_ARG 5 end_ARG start_ARG times end_ARG start_ARG % end_ARG compared to its full precision counterpart while having 8x less memory occupation.
To the best of our knowledge, there is still no previous work that adopts those techniques to achieve a generalized object detection network, ready for deployment on edge devices and μ 𝜇\mu italic_μ Cs with less than 0.5 MB times 0.5 MB 0.5\text{,}\text{MB}start_ARG 0.5 end_ARG start_ARG times end_ARG start_ARG MB end_ARG of weight memory, that are able to achieve similar accuracy performance of larger networks.
This paper proposes a quantized and highly accurate object detection convolutional neural network (CNN) based on the architecture of YOLO[4] suitable for edge processors with limited memory and computational resources. The proposed network is composed of quantized convolutional layers with 3x3 kernels and a fully connected layer at the output. It is designed for having a low memory footprint of less than 0.5 MB times 0.5 MB 0.5\text{,}\text{MB}start_ARG 0.5 end_ARG start_ARG times end_ARG start_ARG MB end_ARG. The proposed network is trained and evaluated on the WiderFace dataset [18]. Furthermore, to showcase multi-object detection capability, while keeping the network small, it has been trained and evaluated on a sub-set of the PascalVOC [19] dataset (3 out of the 20 classes, namely: person, chair and car). Finally, the network is deployed quantized and memory-efficient on different μ 𝜇\mu italic_μ C architectures, such as the STM32H7A3 and STM32L4R9 from STMicroelectronics, Ambiq’s Apollo4b and on a novel microcontroller, MAX78000 from Analog Devices, which has a built-in CNN accelerator. The performance of the different architectures is compared and it will be shown, how the MAX78000 outperforms the other μ 𝜇\mu italic_μ Cs. Furthermore, this paper investigates the effect of mAP against relative object size dimensions within images. This evaluation helps to understand which object size should be chosen when training the generalized object detection network.
II TinyissimoYOLO
The ultra-lightweight object detection network designed in this paper for operation on μ 𝜇\mu italic_μ C, named TinyissimoYOLO, is a general object detection network and can be seen in Figure 1. It can be deployed on any hardware with at least 0.5 MB times 0.5 MB 0.5\text{,}\text{MB}start_ARG 0.5 end_ARG start_ARG times end_ARG start_ARG MB end_ARG of network parameter memory, i.e. flash, available. Not only is it capable of being deployed on any ARM Cortex-M microcontrollers, but it can also be deployed on μ 𝜇\mu italic_μ Cs with built-in hardware accelerators, such as the MAX78000 from Analog Devices, GAP9 [20] from Greenwaves, Sony’ Spresense [21] or Syntiant TinyML [22], among others. The network can be trained on any object detection class and supports multi-class object detection with the cost of minimal increasing memory.
To facilitate small size and to be deployable in an efficient way on all existing microcontrollers, as well as exploiting CNN emerging hardware accelerators that are starting to be embedded in recent microcontrollers, TinyissimoYOLO has been designed to consist of only convolutional layer-operations and fully connected linear layers, which are largely optimized in hardware and software toolchains. As we mentioned, the goal of the design is to reach outstanding accuracy performance with a memory envelope of 0.5 MB times 0.5 MB 0.5\text{,}\text{MB}start_ARG 0.5 end_ARG start_ARG times end_ARG start_ARG MB end_ARG and with a minimal number of operations and yet a high number of operations per cycle.
II-A Network design
Figure 1: TinyissimoYOLO proposed by this paper.
TinyissimoYOLO is inspired by the YOLO v1 [4] architecture. Major design decisions of the TinyissimoYOLO network encompass the following 4 points:
• input image size,
• hidden convolutional layers,
• fully connected linear layer before the output layer,
• output layer size.
Input image size has been chosen such that the network runs on all the previously mentioned μ 𝜇\mu italic_μ Cs. Limiting factor is the CNN accelerator of the MAX78000, which does not support CNN inputs greater than 90x91 without using a specialized mode. Thus the input of 88x88 is chosen because it is a tread-off between maximizing the image size and being able to do pooling on the input dimensions without rounding the dimensions down to the 4th pooling layer. The network’s hidden convolutional layer parameter memory scales by the number of input channels times the number of output channels times the filter size. Even though in the original YOLOv1 network high channel count layers are preferred, the small memory size on μ 𝜇\mu italic_μ Cs and accelerators will not allow such layers. For example, one convolutional layer with 256 input and output channels and a kernel of 3x3, consumes already more than 0.5 MB times 0.5 MB 0.5\text{,}\text{MB}start_ARG 0.5 end_ARG start_ARG times end_ARG start_ARG MB end_ARG of quantized 8-bit weights. To ensure staying below 0.5 MB times 0.5 MB 0.5\text{,}\text{MB}start_ARG 0.5 end_ARG start_ARG times end_ARG start_ARG MB end_ARG of network memory, only convolutional layers with 3x3 filters are used, since it scales the weight memory only by a factor of 9x, instead of 25x (for 5x5 kernel) or 49x (for a 7x7 kernel). Even though it would be possible to use bigger filter sizes (e.g. YOLO v1 used a filter of 7x7 in the input) we focused on using convolutional layers with 3x3 filters and keeping the network as simple as possible. Furthermore, we increase the number of channels with increasing network depth, starting with 16 output channels in the CNN’s input layer and ending with 128 output channels at the CNN’s output. Max pooling with a stride of two decreases the channel resolution every second layer. Thereafter, the fully connected layer predicts the output probabilities and coordinates out of the features extracted by the former convolutional layers. As such, the fully connected linear layer which connects the CNN to the output layer, is chosen to be 256 and therefore is larger than the output layer size of 4x4(2xBbox+C) for mapping the CNN’s features onto the regression, classification output. The network’s output layer uses the YOLO v1 output of the size SxS(BxBbox+C). Where SxS is the grid, within each cell B bounding box (Bbox) (with the objects width, height, centre in x- and y-axis, relative to the grid cell size and prediction certainty) of possible detections are calculated by the network as well as the C possibly detected classes. As the network’s input image size is set to 88x88 pixels a smaller output grid size of 4x4 and B=2 𝐵 2 B=2 italic_B = 2 is used. The value of B 𝐵 B italic_B was found as a trade-off between significantly more memory consumption and having less prediction accuracy. B=3 𝐵 3 B=3 italic_B = 3, increased the output layer memory consumption by a factor of 1.5x but did not increased the mAP significantly. Therefore, B=2 𝐵 2 B=2 italic_B = 2 has been used for training the network on WiderFace. To ensure deployability on the CNN accelerator of MAX78000 and training a multi-object detection network for 3 classes of PascalVOC, B=1 𝐵 1 B=1 italic_B = 1 has been chosen. The single-class TinyissimoYOLO with B=2 𝐵 2 B=2 italic_B = 2 is 106%times 106 percent 106\text{,}\mathrm{\char 37}start_ARG 106 end_ARG start_ARG times end_ARG start_ARG % end_ARG bigger than the multi-class TinyissimoYOLO with B=1 𝐵 1 B=1 italic_B = 1. For training the TinyissimoYOLO network, we use the YOLO loss function introduced in the YOLOv1 paper [4] by Redmon et al. (2016) to train the network. Evaluation is done by using the mAP as it has been used by Everingham et al. (2015) [19].
II-B Training and Dataset
For training, testing and validation of the TinyissimoYOLO network, the WiderFace dataset [18] was used. 90%times 90 percent 90\text{,}\mathrm{\char 37}start_ARG 90 end_ARG start_ARG times end_ARG start_ARG % end_ARG of WiderFace training dataset was used to train the network, while 10%times 10 percent 10\text{,}\mathrm{\char 37}start_ARG 10 end_ARG start_ARG times end_ARG start_ARG % end_ARG was used for validation. Testing of the network was done with the corresponding test dataset of WiderFace. Due to TinyissimoYOLO’s small input size of 88x88x3, and the 4x4 network output grid predictions, we evaluated the network’s accuracy by restricting the training data to images containing less or equal to 10 and 5 objects. As such, the objects are ensured to be visible in the downscaled 88x88 pixel images. With the restriction made on the training dataset of WiderFace, we evaluate the influence of the number of objects within an image on mAP. Additionally, for showing the network’s ability for general multi-object detection, the network is trained and evaluated on the PascalVOC dataset [19] with a restriction of maximal three objects per image, due to the increased difficulty of the multi-class object detection problem. To not only deploy the network on the STM32s and Apollo4b but also to fit in the MAX78000 CNN accelerator’s memory, the multi-object TinyissimoYOLO is trained and evaluated using only 3 of the 20 classes, namely person, chair and car and thus stays bellow the sub-0.5 MB times 0.5 MB 0.5\text{,}\text{MB}start_ARG 0.5 end_ARG start_ARG times end_ARG start_ARG MB end_ARG memory size restriction set by us. The 3 classes were chosen because of their object occurrences within the dataset and thus to train the network with a balanced dataset. Deploying the network on real hardware, experimental tests can be conducted with many different sceneries inside and outside of buildings.
Because the network is designed to be deployable on μ 𝜇\mu italic_μ Cs and on CNN accelerators in the most efficient way, the network is quantized from 32-floating point to 8-bit integers. This further reduces computation cycles and memory requirements. In order to optimize the network’s fine-tuning, the training is already conducted aware of this quantization. The network is trained 350 epochs with floating-point precision and thereafter switches to another 300 epochs trained quantization-aware. The starting learning rate as well as the weight decay is set to be 510−4 5 superscript 10 4 510^{-4}5 * 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT. The used optimizer is SGD [23]. The learning rate is scheduled with a multi-step learning rate for improved learning speed [24].
II-C μ 𝜇\mu italic_μ C implementation
By experimental evaluation, we demonstrate that TinyissimoYOLO is efficiently deployed on several low-power devices, such as STM32L4R9 with an ARM Cortex-M4F core, STM32H7 with an ARM Cortex M7 core, Apollo4b the world’s more energy-efficient ARM Cortex-M4F core, and MAX78000 (using its built-in CNN accelerator). Deploying the designed and quantized network on the μ 𝜇\mu italic_μ Cs’s ARM Cortex M cores, is straightforward, as we designed the network to be compatible with TensorFlow-lite micro [25]. Next to the μ 𝜇\mu italic_μ C only deployment, it is shown that the network can also operate well on a μ 𝜇\mu italic_μ C with a built-in CNN accelerator (MAX78000) by deploying it with the provided deployment tools from Analog Devices [26]. TinyissimoYOLO fits the MAX78000’s CNN accelerator’s limitations. The most important considerations for the MAX78000 are that the allowed network memory footprint of its weights can only be 442kB. Considering the data passed through the architecture, the input size is constrained to <<<90x91 input width and height, when using the non-streaming mode of the CNN accelerator. All network operations needed by TinyissimoYOLO are covered by the design of the hardware accelerator and its software deployment tools. A full pipeline demonstrator with MAX78000FTHR μ 𝜇\mu italic_μ C and on-device image sensor CameraCubeChip®is built and used for validating the system with real-life data.
III Experimental Results
The networks have been trained on a subset of the WiderFace [18] and the PascalVOC datasets as explained in section B. Training and Dataset. They have been evaluated on the whole test dataset of WiderFace as well as on the restricted datasets. Table I lists the evaluation of the trained TinyissimoYOLO networks and compares the mAP[19] when trained with a different number of max. objects allowed per image. On WiderFace, TinyissimoYOLO achieves a mAP[19] of 45%times 45 percent 45\text{,}\mathrm{\char 37}start_ARG 45 end_ARG start_ARG times end_ARG start_ARG % end_ARG when evaluated on the whole dataset and up to 73%times 73 percent 73\text{,}\mathrm{\char 37}start_ARG 73 end_ARG start_ARG times end_ARG start_ARG % end_ARG mAP when restricting the evaluation dataset of the network to max. 5 objects per image. Restricting the dataset during training to max. 10 or 5 objects per image, a mAP of 46%times 46 percent 46\text{,}\mathrm{\char 37}start_ARG 46 end_ARG start_ARG times end_ARG start_ARG % end_ARG and 44%times 44 percent 44\text{,}\mathrm{\char 37}start_ARG 44 end_ARG start_ARG times end_ARG start_ARG % end_ARG mAP is achieved when evaluating on the whole dataset. Evaluating the restricted trained networks on the restricted datasets up to 75%times 75 percent 75\text{,}\mathrm{\char 37}start_ARG 75 end_ARG start_ARG times end_ARG start_ARG % end_ARG mAP is achieved with an increase of 2%times 2 percent 2\text{,}\mathrm{\char 37}start_ARG 2 end_ARG start_ARG times end_ARG start_ARG % end_ARG compared to training the network on the whole dataset. The TinyissimoYOLO network trained and evaluated on the PascalVOC dataset for the object detection classes person, chair and car achieves mAP of up to 57%times 57 percent 57\text{,}\mathrm{\char 37}start_ARG 57 end_ARG start_ARG times end_ARG start_ARG % end_ARG, see Table II. Here a global restriction to maximal three objects per image is made because of the increased difficulty of the multi-class object detection problem.
TABLE I: TinyissimoYOLO network trained and evaluated on WiderFace (face detection). Maximal allowed objects per image have been restricted for training and evaluation.
TABLE II: TinyissimoYOLO network trained and evaluated on PasccalVOC with classes: person, chair and car. Maximal allowed objects per image have been restricted to 3 objects for training and evaluation.
mAP Dataset person chair car overall PascalVOC a normal-a{}^{\mathrm{a}}start_FLOATSUPERSCRIPT roman_a end_FLOATSUPERSCRIPT 57.4%times 57.4 percent 57.4\text{,}\mathrm{\char 37}start_ARG 57.4 end_ARG start_ARG times end_ARG start_ARG % end_ARG 30.2%times 30.2 percent 30.2\text{,}\mathrm{\char 37}start_ARG 30.2 end_ARG start_ARG times end_ARG start_ARG % end_ARG 65.1%times 65.1 percent 65.1\text{,}\mathrm{\char 37}start_ARG 65.1 end_ARG start_ARG times end_ARG start_ARG % end_ARG 58.5%times 58.5 percent 58.5\text{,}\mathrm{\char 37}start_ARG 58.5 end_ARG start_ARG times end_ARG start_ARG % end_ARG a a{}^{\mathrm{a}}start_FLOATSUPERSCRIPT roman_a end_FLOATSUPERSCRIPT training dataset with person / chair / car classes
Given the TiniyssimoYOLO’s performance on the datasets, the TinyissimoYOLO networks (trained on WiderFace and PascalVOC) are now compared when deployed quantized to 8-bits on the different μ 𝜇\mu italic_μ Cs. Since both networks yield the exact same performance results, no network distinction is made for the results on the following metrics:
• power efficiency [μ 𝜇\mu italic_μ W/MHz],
• inference efficiency [MAC/Cycles] (which tells how well the device can parallelise the network execution),
• inference latency [ms],
• energy per inference [mJ/Inf.].
The metrics are measured by deploying TinyissimoYOLO on the different target devices and measuring the power consumption of the μ 𝜇\mu italic_μ Cs only with the following μ 𝜇\mu italic_μ C configurations: MAX78000 @@@@ (1.2 V times 1.2 volt 1.2\text{,}\mathrm{V}start_ARG 1.2 end_ARG start_ARG times end_ARG start_ARG roman_V end_ARG; 50 MHz times 50 MHz 50\text{,}\text{MHz}start_ARG 50 end_ARG start_ARG times end_ARG start_ARG MHz end_ARG), STM32H7A3 @@@@ (3.3 V times 3.3 volt 3.3\text{,}\mathrm{V}start_ARG 3.3 end_ARG start_ARG times end_ARG start_ARG roman_V end_ARG; 192 MHz times 192 MHz 192\text{,}\text{MHz}start_ARG 192 end_ARG start_ARG times end_ARG start_ARG MHz end_ARG), STM32L4R9 @@@@ (3.3 V times 3.3 volt 3.3\text{,}\mathrm{V}start_ARG 3.3 end_ARG start_ARG times end_ARG start_ARG roman_V end_ARG; 120 MHz times 120 MHz 120\text{,}\text{MHz}start_ARG 120 end_ARG start_ARG times end_ARG start_ARG MHz end_ARG) and Apollo4b @@@@ (1.8 V times 1.8 volt 1.8\text{,}\mathrm{V}start_ARG 1.8 end_ARG start_ARG times end_ARG start_ARG roman_V end_ARG; 192 MHz times 192 MHz 192\text{,}\text{MHz}start_ARG 192 end_ARG start_ARG times end_ARG start_ARG MHz end_ARG). The metrics are chosen as proposed by Giordano et al. (2022) [27] to evaluate the architecture’s latency, power and energy efficiency as well as the hardware’s capability of parallelising the network’s computational workload to its processor(s). Figure 2 a), compares the latency of the network being executed on the different μ 𝜇\mu italic_μ Cs. Due to the usage of the custom CNN hardware accelerator, MAX78000 runs one inference within 5.5 ms times 5.5 millisecond 5.5\text{,}\mathrm{ms}start_ARG 5.5 end_ARG start_ARG times end_ARG start_ARG roman_ms end_ARG while the second-fastest is STM32H7A3 with 359 ms times 359 millisecond 359\text{,}\mathrm{ms}start_ARG 359 end_ARG start_ARG times end_ARG start_ARG roman_ms end_ARG. Therefore, MAX78000 outperforms the others by a factor of >>>65x. Figure 2 b) shows the inference efficiency of the different architectures. The accelerated CNN hardware of MAX78000 showcases its ability to parallelise the CNN workload with 107 MAC/Cycle times 107 MAC/Cycle 107\text{,}\text{MAC/Cycle}start_ARG 107 end_ARG start_ARG times end_ARG start_ARG MAC/Cycle end_ARG while all the others need at least 2 2 2 2-4 cycles times 4 cycles 4\text{,}\text{cycles}start_ARG 4 end_ARG start_ARG times end_ARG start_ARG cycles end_ARG to execute one MAC. In Figure 2 c) Apollo4b outperforms all the others by only consuming 59 μ W /MHz times 59 μ W /MHz 59\text{,}\text{$\mu$\mathrm{ W }/MHz}start_ARG 59 end_ARG start_ARG times end_ARG start_ARG italic_μ W /MHz end_ARG and thus being the most power efficient μ 𝜇\mu italic_μ C within this comparison. Figure 2 d) shows the energy efficiency of the devices. Despite the fact that the Apollo4b has best-in-class power efficiency, the MAX78000 hardware accelerator’s fast and inference-efficient execution manages to be the overall, most energy efficient with only 196 μJ times 196 𝜇 J 196\text{,}\mu\mathrm{J}start_ARG 196 end_ARG start_ARG times end_ARG start_ARG italic_μ roman_J end_ARG per Inference and outperforming by a factor of 32x compared to the second best, being the Apollo4b with 6.1 mJ times 6.1 millijoule 6.1\text{,}\mathrm{mJ}start_ARG 6.1 end_ARG start_ARG times end_ARG start_ARG roman_mJ end_ARG per Inference. Last but not least the TinyissimoYOLO network consumes 422 kB times 422 kilobyte 422\text{,}\mathrm{kB}start_ARG 422 end_ARG start_ARG times end_ARG start_ARG roman_kB end_ARG of memory when trained on 1 object class, e.g. WiderFace and 398 kB times 398 kilobyte 398\text{,}\mathrm{kB}start_ARG 398 end_ARG start_ARG times end_ARG start_ARG roman_kB end_ARG when trained on 3 object classes, e.g. PascalVOC (person, chair and car). The input image of 88x88x3 pixels consumes an additional 25 kB times 25 kilobyte 25\text{,}\mathrm{kB}start_ARG 25 end_ARG start_ARG times end_ARG start_ARG roman_kB end_ARG of memory space, while 350 kB times 350 kilobyte 350\text{,}\mathrm{kB}start_ARG 350 end_ARG start_ARG times end_ARG start_ARG roman_kB end_ARG of RAM (when deployed on the ARM Cortex-M cores) is sufficient for executing an inference.
Figure 2: TinyissimoYOLO performance comparison when deployed quantized to 8-bit on the different architectures. The CNN accelerated MAX78000 μ 𝜇\mu italic_μ C outperforms the other architectures in terms of latency, inference efficiency and energy per inference.
IV Conclusion
This work presented TinyissomoYOLO, a multi-object detection network showcased to be used for edge applications with small amount of objects to be detected simultaneously. The network can be deployed on any μ 𝜇\mu italic_μ C with a minimal required flash of less than 0.5 MB times 0.5 MB 0.5\text{,}\text{MB}start_ARG 0.5 end_ARG start_ARG times end_ARG start_ARG MB end_ARG for its network parameters.
Evaluations showed, training the network on input images with objects adequate for its input size (restricting the dataset to maximal 10 or 5 objects per image) increases not only network performance on a restricted evaluation but also when evaluating the network on the original validation dataset, achieving up to 45%times 45 percent 45\text{,}\mathrm{\char 37}start_ARG 45 end_ARG start_ARG times end_ARG start_ARG % end_ARG mAP for WiderFace. Evaluating the network on a restricted PascalVOC dataset, the network achieves 59%times 59 percent 59\text{,}\mathrm{\char 37}start_ARG 59 end_ARG start_ARG times end_ARG start_ARG % end_ARG mAP with 57%times 57 percent 57\text{,}\mathrm{\char 37}start_ARG 57 end_ARG start_ARG times end_ARG start_ARG % end_ARG / 30%times 30 percent 30\text{,}\mathrm{\char 37}start_ARG 30 end_ARG start_ARG times end_ARG start_ARG % end_ARG / 65%times 65 percent 65\text{,}\mathrm{\char 37}start_ARG 65 end_ARG start_ARG times end_ARG start_ARG % end_ARG on the classes person / chair / car respectively. Comparing low power μ 𝜇\mu italic_μ C devices when running the TinyissimoYOLO network on those, reveals, the MAX78000’s CNN accelerated hardware is the most energy efficient among the evaluated devices and is able to achieve fast inference times running in real-time on μ 𝜇\mu italic_μ Cs with speeds up to 180 fps times 180 fps 180\text{,}\text{fps}start_ARG 180 end_ARG start_ARG times end_ARG start_ARG fps end_ARG.
References
- [1] D.L. Dutta and S.Bharali, “TinyML meets IoT: A comprehensive survey,” Internet of Things, vol.16, p. 100461, 2021. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S2542660521001025
- [2] S.Boner, C.Vogt, and M.Magno, “Tiny tcn model for gesture recognition with multi-point low power tof-sensors,” in 2022 IEEE 4th International Conference on Artificial Intelligence Circuits and Systems (AICAS).IEEE, 2022, pp. 356–359.
- [3] M.Merenda, C.Porcaro, and D.Iero, “Edge machine learning for AI-enabled IoT devices: A review,” Sensors, vol.20, no.9, p. 2533, 2020, number: 9 Publisher: Multidisciplinary Digital Publishing Institute. [Online]. Available: https://www.mdpi.com/1424-8220/20/9/2533
- [4] J.Redmon, S.Divvala, R.Girshick, and A.Farhadi, “You only look once: Unified, real-time object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 779–788.
- [5] S.Ren, K.He, R.Girshick, and J.Sun, “Faster r-CNN: Towards real-time object detection with region proposal networks,” in Advances in Neural Information Processing Systems, vol.28.Curran Associates, Inc., 2015.
- [6] W.Liu, D.Anguelov, D.Erhan, C.Szegedy, S.Reed, C.-Y. Fu, and A.C. Berg, “SSD: Single shot MultiBox detector,” in Computer Vision – ECCV 2016, ser. Lecture Notes in Computer Science, B.Leibe, J.Matas, N.Sebe, and M.Welling, Eds.Springer International Publishing, 2016, pp. 21–37.
- [7] X.Wang, M.Magno, L.Cavigelli, and L.Benini, “Fann-on-mcu: An open-source toolkit for energy-efficient neural network inference at the edge of the internet of things,” IEEE Internet of Things Journal, vol.7, no.5, pp. 4403–4417, 2020.
- [8] M.Zhu and S.Gupta, “To prune, or not to prune: exploring the efficacy of pruning for model compression,” arXiv preprint arXiv:1710.01878, 2017. [Online]. Available: http://arxiv.org/abs/1710.01878
- [9] P.Molchanov, S.Tyree, T.Karras, T.Aila, and J.Kautz, “Pruning convolutional neural networks for resource efficient inference.” [Online]. Available: http://arxiv.org/abs/1611.06440
- [10] A.Gholami, S.Kim, Z.Dong, Z.Yao, M.W. Mahoney, and K.Keutzer, “A survey of quantization methods for efficient neural network inference.” [Online]. Available: http://arxiv.org/abs/2103.13630
- [11] S.Han, H.Mao, and W.J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding,” arXiv preprint arXiv:1510.00149, 2015. [Online]. Available: http://arxiv.org/abs/1510.00149
- [12] J.Lin, W.-M. Chen, Y.Lin, J.Cohn, C.Gan, and S.Han, “MCUNet: Tiny deep learning on IoT devices,” Advances in Neural Information Processing Systems, 2020. [Online]. Available: http://arxiv.org/abs/2007.10319
- [13] J.Lin, W.-M. Chen, H.Cai, C.Gan, and S.Han, “MCUNetV2: Memory-efficient patch-based inference for tiny deep learning,” arXiv preprint arXiv:2110.15352, 2021. [Online]. Available: http://arxiv.org/abs/2110.15352
- [14] H.Cai, C.Gan, T.Wang, Z.Zhang, and S.Han, “Once-for-all: Train one network and specialize it for efficient deployment,” arXiv preprint arXiv:1908.09791, 2019. [Online]. Available: http://arxiv.org/abs/1908.09791
- [15] H.Cai, T.Wang, Z.Wu, K.Wang, J.Lin, and S.Han, “On-device image classification with proxyless neural architecture search and quantization-aware fine-tuning,” in Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, 2019, pp. 0–0.
- [16] M.Scherer, M.Magno, J.Erb, P.Mayer, M.Eggimann, and L.Benini, “Tinyradarnn: Combining spatial and temporal convolutional neural networks for embedded gesture recognition with short range radars,” IEEE Internet of Things Journal, vol.8, no.13, pp. 10 336–10 346, 2021.
- [17] R.Li, Y.Wang, F.Liang, H.Qin, J.Yan, and R.Fan, “Fully quantized network for object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 2810–2819.
- [18] S.Yang, P.Luo, C.-C. Loy, and X.Tang, “WIDER FACE: A face detection benchmark,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 5525–5533.
- [19] M.Everingham, S.M.A. Eslami, L.Van Gool, C.K.I. Williams, J.Winn, and A.Zisserman, “The pascal visual object classes challenge: A retrospective,” International journal of computer vision, vol. 111, no.1, pp. 98–136, 2015. [Online]. Available: https://doi.org/10.1007/s11263-014-0733-5
- [20] A.Reuther, P.Michaleas, M.Jones, V.Gadepally, S.Samsi, and J.Kepner, “Survey of machine learning accelerators,” in 2020 IEEE High Performance Extreme Computing Conference (HPEC), 2020, pp. 1–12, ISSN: 2643-1971.
- [21] Overview - spresense - sony developer world. [Online]. Available: https://developer.sony.com/develop/spresense/
- [22] A.Yousefi, L.Franca-Neto, W.McDonald, A.Gupta, M.Moturi, and D.Garrett, “The intelligence of things enabled by syntiant’s tinyml board,” tinyML Summit, 2019.
- [23] L.Bottou, “On-line learning and stochastic approximations,” in On-Line Learning in Neural Networks, 1st ed., D.Saad, Ed.Cambridge University Press, 1998, pp. 9–42.
- [24] P.Goyal, P.Dollár, R.Girshick, P.Noordhuis, L.Wesolowski, A.Kyrola, A.Tulloch, Y.Jia, and K.He, “Accurate, large minibatch SGD: Training ImageNet in 1 hour.” [Online]. Available: http://arxiv.org/abs/1706.02677
- [25] R.David, J.Duke, A.Jain, V.Janapa Reddi, N.Jeffries, J.Li, N.Kreeger, I.Nappier, M.Natraj, T.Wang, P.Warden, and R.Rhodes, “TensorFlow lite micro: Embedded machine learning for TinyML systems,” Proceedings of Machine Learning and Systems, vol.3, pp. 800–811, 2021.
- [26] ADI MAX78000/MAX78002 model synthesis. Original-date: 2020-05-19T21:57:23Z. [Online]. Available: https://github.com/MaximIntegratedAI/ai8x-synthesis
- [27] M.Giordano, L.Piccinelli, and M.Magno, “Survey and comparison of milliwatts micro controllers for tiny machine learning at the edge,” in 2022 IEEE 4th International Conference on Artificial Intelligence Circuits and Systems (AICAS), 2022, pp. 94–97.
Xet Storage Details
- Size:
- 38.4 kB
- Xet hash:
- 77cac23ef633abcc0523260640495ea11586921e817d65942ce4dd78294b331d
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.

