TinyML Benchmark: Executing Fully Connected Neural Networks on Commodity Microcontrollers

—Recent advancements in the ﬁeld of ultra-low-power machine learning (TinyML) promises to unlock an entirely new class of edge applications. However, continued progress is restrained by the lack of benchmarking Machine Learning (ML) models on TinyML hardware, which is fundamental to this ﬁeld reaching maturity. In this paper, we designed 3 types of fully connected Neural Networks (NNs), trained each NN using 10 datasets (produces 30 NNs), and present the benchmark by reporting the onboard model performance on 7 popular MCU-boards (similar boards are used to design TinyML hardware). We open-sourced and made the complete benchmark results freely available online 1 to enable the TinyML community researchers and developers to systematically compare, evaluate, and improve various aspects during the design phase of ML-powered IoT hardware.


I. INTRODUCTION
TinyML aims to bring ML inference on ultra-low-power IoT devices, typically under a milliWatt, thereby breaking the traditional power barrier that prevents widely distributed machine intelligence. By performing offline inference near data source, TinyML enables greater responsiveness while avoiding the energy cost associated with wireless communication, which is far higher than that of computing. Since TinyML has a significant role to play in future technology, a widely accepted benchmark is required to unlock the full potential of the field. Neural Networks on MCUs. Mounting interest in TinyML has led to some maturity in the field, thus releasing software stacks such as Edge-ML [1], Open-NN [2], RCE-NN [3], Edge2Train [4], TensorFlow Micro inference runtime [5]. Particularly the TFMicro attracts attention due to its ability to allow the portable and straightforward execution of NNs on commodity MCUs. For the TinyML benchmark, over the code generation-based methods such as uTensor [6], we use TFMicro as it provides portability across MCU vendors, at the cost of a fairly minimal memory overhead. Also, TFMicro uses an interpreter to execute an NN graph, which means the same model graph can be deployed across different hardware platforms such as GPUs, TPUs, and also MCUs. Although NNs (DNNs, CNNs, RNNs,  [12], Edge2Guard [13], Covidaway [14] can be deployed and executed on a range of IoT devices. Here, the trained ML model is first ported to produce its plain C version, then written/exported inside a .h file. When the users aim to port tree-based decision tree, random forest models, the SRAM optimized method [15] can be used. In [16]- [18], SRAM optimized porting of Decision Trees (DT) and Random Forest (RF) is performed, and the ported models are efficiently executed on MCU boards.
Use Cases. We categorize the TinyML application landscape into four categories: First is the fairly ubiquitous audio-based always-on ML inference apps such as context recognition, and keywords/wake-words/control-words spotting [19] on consumer devices like wearables [20], action cameras, smart speakers [21,22], etc. Second is the industry telemetry data-based use cases where models deployed on MCUs monitor IMUs, motor bearing vibrations, or other sensors to detect anomalies and predict equipment faults [17]. The third is image data-based use cases such as object counting, text recognition, visual wake words. Fourth is physiological/behavior data-based use cases such as activity recognition using IMU or EMG data. The large label space object counting or image classification tasks are best suited for always-on low-power edge applications but require computation power plus memory than what is available in today's TinyML hardware. But since the ultra-low-power inference chipsets are continuously advancing, such resourcedemanding and other common ML use cases can be considered futuristic TinyML tasks. Table I presents the MCU boards (B1 -B7), datasets (D1 -D10) and NNs (3 types) used for TinyML benchmark. The chosen MCUs are popular example hardware that is widely used to design IoT devices, and billions of similar specification devices exist globally. In TensorFlow, we defined 3 types of fully connected NNs (FC 1X10, FC 10+50, FC 10X10) and trained using D1 -D10 datasets. The resultant 30 models are converted into TFLite format and then converted into a C byte array. Later these models are compiled and flashed using Arduino IDE. Finally, each model are executed on B1 -B7, and the experimental results are reported in Fig. 1. For statistical validation, the results correspond to the average of 5 runs. In the remainder of this section, we analyze the results. Inference Performance on MCUs. Fig. 1. a, presents the average time taken by MCUs to infer using D1 -D10 datasets. For all 3 NN types, Teensy 4.0 (B1) is the fastest as it performed inference in 3.14 µs, 11.13 µs, 18.12 µs respectively. For the same data samples, Raspberry Pi Pico (B7) is the slowest (≈99 -175 x times slower than B1), as it took 313.77 µs, 1953.96 µs, 2801.82 µs. Although B7 has a faster clock than Arduino Nano 33 (B6), it is still slow as Cortex M4 is superior to Cortex M0+. Although B1 -B3 has the same Cortex M7 processor, B1 still is significantly faster as it has the highest clock speed of 600 MHz. Fig. 1. b, presents the complete inference time on the secondfastest STM32 Nucleo H7 (B2) for each of the 30 models. When considering FC 1x10, for the 4 features D1, it took 5.16 µs to infer, and for the highest 74 features D10, it took 872.85 µs to infer. When considering FC 10x10, for D1, it took 20.15 µs, and 3369.54 µs for D10. Portenta (B3) and B2 are on quite a par since they share processors from the same ARM Cortex-M7 family, but B2 is faster across all the NN topologies.

II. TINYML BENCHMARK
Onboard Accuracy. We fed test sets to each of the 30 models when executing on B1 -B7 via COM Port to perform inference. We report that the same models, from board to board, show only 0.4 -1.6 % variation in onboard accuracy. Also, the models during execution on MCUs, show the same level of accuracy, F1 score as its original TFlite models when evaluated on Google Colab.
Memory Consumption on MCUs. The run-time variables generated during NN execution are stored in the SRAM. The chosen boards have only 192 kB to a max of 1 MB SRAM, which restricts the deployment and execution of large models. SRAM in MCUs is always limited since adding more memory leads to higher power leakage and manufacturing costs. Before flashing, when compiling the NNs and IoT applications, the memory requirements for target boards are calculated by the compiler (such as Atmel Studio, Keil MDK) in use. In Fig. 1. c-e, we provide the time taken by Arduino IDE to compile each of the 30 models for B2, along with the complete Flash and SRAM requirements. The models trained using the datasets with more features, classes consumed higher compilation time, and higher fash memory.
Price-performance Ratio. Portenta (B3) that costs ≈100 $ (the price of NVIDIA Jetson Nano GPU) is the most expensive board but still does not outperform the ≈20 $ Teensy 4.0 (B1). Moreover, B1 can be the most fastest yet reasonably priced board as it can be overclocked up to 1 GHz. The ≈30 $ STM32 Nucleo H7 (B2) is the second-fastest, contains many IO pins and dev-related features like STLink debugger. ESP32 (B5) has the best price-performance, as it is only ≈3 $ and just ≈17 -91 µs slower (see Fig. 1. a) than B1. The Raspberry Pi Pico is as cheap as ESP32 but ≈292 -2692 µs slower.

III. CONCLUSION
TinyML is a rapidly evolving field that requires comparability amongst low-power hardware innovations, particularly when executing neural workloads. In this paper, to enable continued progress and stability in this field, we presented and analyzed the onboard performance of 30 NN models on 7 popular MCU boards. We open-sourced the complete benchmark results that can be utilized to speed up the design phase (going from idea to product) of ML-powered IoT hardware.
ACKNOWLEDGEMENT This publication has emanated from research supported in part by a research grant from Science Foundation Ireland (SFI) under Grant Number SFI/16/RC/3918 (Confirm) and also by a research grant from Science Foundation Ireland (SFI) under Grant Number SFI/12/RC/2289_P2 (Insight), with both grants co-funded by the European Regional Development Fund.