On-device learning, optimization, efficient deployment and execution of machine learning algorithms on resource-constrained IoT hardware
Date
2022-07-12Author
Sudharsan, Bharath
Metadata
Show full item recordUsage
This item's downloads: 311 (view details)
Abstract
Edge analytics refers to the application of data analytics and Machine Learning (ML)
algorithms on IoT devices. The concept of edge analytics is gaining popularity due to its
ability to perform AI-based analytics at the device level, enabling autonomous decisionmaking
without depending on the cloud. However, the majority of Internet of Things (IoT)
devices are embedded systems (hardware) with a low-cost microcontroller unit (MCU) or
a small CPU as its brain, which often are incapable of handling complex ML algorithms.
This thesis aims to improve the intelligence of such resource-constrained IoT devices by
providing novel algorithms, frameworks, strategies to: create self-learning ML-based IoT
devices; efficiently deploy and execute a range of Neural Networks (NNs) and also non-
NN ML algorithms on IoT devices; enable practicing communication efficient distributed
ML using IoT devices.
The memory footprint (SRAM, Flash, and EEPROM) of MCU-based devices is often
very limited, restricting onboard ML model training for large trainsets with high feature dimensions.
To cope with memory issues, the current edge analytics approaches train highquality
ML models on the cloud GPUs (uses large volume historical data), then deploy
the deep optimized version of the resultant models on edge devices for inference. Such
approaches are inefficient in concept drift situations where the data generated at the device
level vary frequently, and trained models are clueless on how to behave if previously
unseen data arrives. The First Contribution of this thesis aims to solve this challenge. We
provide Train++ Algorithm and ML-MCU Framework, that trains ML models locally at
the device level (on MCUs and small CPUs) using the full n-samples of high-dimensional
data. Train++ and ML-MCU transforms even the most resource-constrained MCU-based
IoT edge devices into intelligent devices that can locally build their own knowledge base
on-the-fly using the live data, thus creating smart self-learning and autonomous problemsolving
devices. As a part of the first contribution, to perform online machine learning
(OL) in non-ideal real-world settings, we designed Imbal-OL, an OL plugin that understands
the supplied data stream and balances the class size before sending it for learning
using our Train++, ML-MCU, or others.
The hardware resource of IoT devices are orders of magnitude less than the resources
required for the standalone execution of a large, high-quality NN. Currently, to alleviate
various critical issues caused by the poor hardware specifications of IoT devices, before
deployment the NNs are optimized using various methods such as pruning, quantization,
sparsification, model architecture tuning, etc. Even after applying state-of-the-art
optimization methods, there are numerous cases where the models after deep compression/
optimization still exceed a device’s memory capacity by a margin of just a few bytes,
and users cannot optimize further since the model is already compressed to its maximum.
The Second Contribution of this thesis aims to solve this challenge. We propose
an approach for the efficient execution of already deeply compressed, large NNs on tiny
IoT devices. After optimizing NNs using state-of-the-art deep model compression methods,
when the resultant models are executed by MCUs or small CPUs using the model execution sequence produced by our approach, higher levels of conserved SRAM can
be achieved. As a part of the second contribution, we provide an SRAM-optimized ML
classifier (non-NN) porting, stitching, and efficient deployment approach. The proposed
method enables large classifiers to be comfortably executed on MCU-based IoT devices
and perform ultra-fast classifications while consuming 0 bytes of SRAM.
Training a problem-solving ML model using large datasets is computationally expensive
and requires a scalable distributed training platform to complete training within a
reasonable time frame. In this scenario, communicating model updates among workers
has always been a bottleneck. The magnitude of impact on the quality of resultant
models is higher when distributed training on low hardware specification devices and in
uncertain real-world IoT networks where congestion, latency, bandwidth issues are common.
The Third Contribution of this thesis aims to solve this challenge. We provide
Globe2Train (G2T), a framework with two components named G2T-Cloud (G2T-C) and
G2T-Device (G2T-D) that can efficiently connect together multiple IoT devices and collectively
train to produce the target ML models at very high speeds. The G2T framework
components jointly eliminate staleness and improve training scalability and speed by tolerating
the real-world network uncertainties and by reducing the communication-to-computation
ratio. As a part of the third contribution, we provide ElastiQuant, an elastic quantization
strategy that aims to further reduce the impact caused by limitations in distributed IoT
training scenarios.