How to keep smartphones cool when running machine learning models
Researchers from the University of Austin and Carnegie Mellon have proposed a new way to run computationally expensive machine learning models on mobile devices such as smartphones and on low-powered edge devices, without triggering thermal limiting – a common protection mechanism in professional and consumer devices. , designed to lower the temperature of the host device by slowing its performance, until acceptable operating temperatures are again achieved.
The new approach could help more complex ML models perform inference and various other types of tasks without threatening the stability of, say, the host smartphone.
The central idea is to use dynamic networkswhere the weights of a model can be accessed by both a “low pressure” and “full intensity” version of the local machine learning model.
In cases where the operation of the local installation of a machine learning model should cause a critical increase in the temperature of the device, the model will dynamically switch to a less demanding model until the temperature has stabilized , then would revert to the full-fledged model. version.
Researchers performed proof-of-concept tests for computer vision and natural language processing (NLP) models on an Honor V30 Pro 2019 smartphone and a 4GB Raspberry Pi 4B.
From the results (for the smartphone), we can see in the image below the temperature of the host device going up and down with use. Red lines represent a running model without Dynamic shifting.
Although the results may seem quite similar, they are not: what makes the temperature ripple for the blue lines (i.e. using the new article method) is the back and forth between simpler and more complex model versions. At no time during the operation is thermal throttling triggered.
What causes the temperature to rise and fall in the case of the red lines is the automatic engagement of thermal throttling in the device, which slows down the operation of the model and increases its latency.
In terms of model usage, we can see in the image below that the latency of the unaided model is significantly higher when thermally throttled:
At the same time, the image above shows almost no latency variation for the model managed by Dynamic Shifting, which remains responsive throughout.
For the end user, high latency can mean increased wait time, which can lead to task abandonment and dissatisfaction with the application hosting it.
In the case of NLP (rather than computer vision) systems, high response times can be even more troubling, as tasks may rely on rapid response (such as machine translation or utilities to help users disabled).
For truly time-sensitive applications, such as real-time VR/AR, high latency would effectively kill the primary utility of the model.
The researchers state:
“We argue that thermal throttling poses a serious threat to latency-critical mobile ML applications. For example, when rendering real-time visuals for video streaming or gaming, a sudden increase in processing latency per frame will have a substantial negative effect on the user experience. Additionally, modern mobile operating systems often provide special services and applications for the visually impaired, such as VoiceOver on iOS and TalkBack on Android.
“The user typically interacts with mobile phones relying entirely on speech, so the quality of these services is highly dependent on the responsiveness or latency of the application.”
The paper is titled Play It Cool: Dynamic shifting prevents thermal throttling, and is a collaboration between two UoA researchers; one from Carnegie Mellon; and one representing both institutions.
Mobile processor-based AI
Although dynamic change and multi-scale architectures are an established and active area of study, most initiatives have focused on high-end networks of computing devices, and the focus of effort at present is divided between intense optimization of local resources (i.e. based on neural networks, usually for inference purposes rather than training, and dedicated mobile hardware enhancement.
The tests performed by the researchers were conducted on CPU chips rather than GPU chips. Despite the growing interest in exploiting local GPU resources in mobile machine learning applications (and even training directly on mobile devices, which could improve the quality of the final model), GPUs generally consume more power , a critical factor in AI’s effort to be independent (of cloud services) and useful in a resource-constrained device.
Test weight sharing
The networks tested for the project were thinning networks and DynaBERT, representing computer vision and an NLP-based task, respectively.
Although there have been various initiatives to make iterations of BERT that can run efficiently and economically on mobile devices, some of the attempts have been criticized as torturous workarounds, and the researchers of the new paper note that using of BERT in the mobile space is a challenge, and that “BERT models are in general too computationally intensive for mobile phones”.
DynaBERT is a Chinese initiative to optimize Google’s powerful NLP/NLU framework in the context of a resource-poor environment; but even this implementation of BERT, the researchers found, was very demanding.
Nevertheless, both on the smartphone and on the Raspberry PI device, the authors conducted two experiments. In the CV experiment, a single randomly selected image was continuously and repetitively processed in ResNet50 as a classification task, and was able to run stably and without invoking thermal throttling for the entire hour of execution of the experiment.
The paper states:
“Although it may sacrifice some precision, the proposed dynamic shifting has faster inference speed. More importantly, our Dynamic Shifting approach benefits from consistent inference.
For NLP tests, the authors configured the experiment to switch between the two smaller models of the DynaBERT suite, but found that at 1.4X latency, BERT throttles at around 70°. They therefore set the downshift to occur when the operating temperature reaches 65°.
The BERT experiment consisted of letting the facility perform streaming inference on a question/answer pair from GLUE’s ONLI dataset.
Latency and accuracy tradeoffs were more severe with the BERT aspirational task than with the computer vision implementation, and accuracy came at the expense of a more severe need to control device temperature. , to avoid throttling:
The authors observe:
“Dynamic change, in general, cannot prevent BERT models from thermally throttling due to the enormous computational intensity of the model. However, under certain limitations, dynamic shifting can still be useful when deploying BERT models on mobile phones.
The authors found that the BERT models caused the Honor V30 phone’s CPU temperature to rise to 80° in less than 32 seconds and invoked thermal throttling within six minutes of activity. Therefore, the authors only used half-width BERT models.
The experiments were repeated on the Raspberry PI setup, and the technique was also able in this environment to prevent thermal throttling from triggering. However, the authors note that the Raspberry PI does not operate under the same extreme thermal stresses as a compact smartphone and appear to have added this series of experiments as a further demonstration of the method’s effectiveness in modestly equipped processing environments. .
First published June 23, 2022.