Efficient AI For Edge Computing — Part 1: Background

Introduction

Artificial Intelligence (AI) is one of the most exciting technologies today, and AI couldn’t have gotten to where it is without the help of Deep Learning — artificial neural networks with many (deep) layers that perform a type of computation called Convolution, to learn and perform tasks such as image recognition in computer vision (CV) and language translation in natural language processing (NLP). Given the importance of the basic computation block of convolutions being used in these networks, they are therefore commonly referred to as Convolutional Neural Networks (CNNs). The recent successes of Deep Learning/CNNs and its applications have been so overwhelming that it has become synonymous with the word AI itself.

However, AI (CNNs) typically requires high-performance GPUs in order to support its computation needs, as networks usually have millions of parameters and it would need a GPU’s massive multi-threading ability to make it real-world practical; further, to be able to scale AI to even more applications, they would need to be running directly on edge devices which usually do not have high-performance GPUs onboard. Therefore, making AI efficient enough and without suffering significant accuracy trade-off is a hot topic in recent years, and this is what Phiar excels at — our AI has to be highly efficient in order to run completely on a consumer-grade vehicle’s processors, in real-time, all while maintaining a high accuracy level that’s needed for all of the related warning and AR in-cockpit experiences. This is the first in a multipart blog series looking at some of the most important concepts and methods in how to make your AI more efficient, both at a high level as well as selective deep dives.

Background

CNNs have been the talk of the town in recent years, but the technology is not something new and has endured a slow development in the past. For example, training artificial neural networks with back-propagation was proposed in the 1980s, training networks with convolutional layers in the 1990s, and many more CNN-related papers in the computer vision field during the 2000s. However, CNNs were notoriously difficult to train — often it required weeks or even months to finish training a network due to its high model complexity, mostly from the use of convolutions (more on this in the next part of this series); CNN accuracies were no better than statistical machine learning-based methods, which were much easier to train and compute (minutes to hours); finally, there was no practical way to actually deploy trained CNNs in any real-world application as it takes too long for a trained CNN just to produce a single prediction.

LeNet’s convolutional neural network architecture, from LeCun et al., 1998.

These issues hampered the CNN community until the year 2012, when the seminal work of AlexNet solved ALL of the above issues and took first place in ImageNet image classification competition by a landslide margin to the rest of its competitions, bringing CNNs to the spotlight ever since and never looked back. To summarize some of AlexNet’s breakthrough contributions in solving the issues that plagued prior CNNs’ advancement and development:

Training time: using GPUs to concurrently optimize a single CNN’s millions of parameters, allowing a network — even with many layers, to be trained in days instead of weeks to months.
Accuracy: introducing the ReLU activation allowed the network to overcome the vanishing gradient issue with many (deep) layers, thereby the network can be optimized properly to produce highly accurate results.
Deployment: using GPUs to train a network faster also means GPUs can be used for the network to compute its results much quicker (deployment) than before (sub-second instead of minutes), which allows CNNs to be practically used in real-world setups with GPUs.

AlexNet was the only deep learning/CNN-based approach, significantly more accurate than the rest.

While AlexNet had solved some of CNNs’ most difficult problems, two major issues/restrictions are still restricting current CNN-based solutions to be more scalable in real-world applications (these restrictions also created many lucrative opportunities):

Massive amount of training data: a CNN has millions (most of the time, even more!) of parameters, so a network requires a large number of images for training a CNN without overfitting the data. For example, the benchmark-setting ImageNet dataset back in 2012 already contained 1.2 million images with 1,000 categories, and that wasn’t even enough for a CNN like AlexNet to train on — they had to perform data augmentation (artificially increasing the amount of data by randomly manipulating the given images such as rotation, flipping, cropping, etc.) to artificially increase the training data to 10x more. Today, most large companies have their own datasets that are more than hundreds of millions of training images. This has always been a major issue for any deep learning-based approaches because the data required would always take the development team a lot of time to collect the data of various conditions, and then to get that data hand-labeled for network training. This is a highly resource-consuming process for any company or research institution, especially for smaller organizations such as startups and academic labs.
Opportunities for data labeling: while mass training data requirement is an issue for most, if not all, deep learning/CNN based approaches, this also created lucrative business opportunities for companies to provide data labeling and management related services, the most notable example being Scale AI which has raised over $600 million in just five years of time, with estimated market size of $10 billion in the next five years.

Example of a labeled image with semantic segmentation information Image from BDAN.

High-performance GPUs: while a high-performance GPU’s powerful multi-threaded processing ability remains the go-to choice to run a CNN for both training and deployment usage, they are typically expensive and require desktop-to-server-like setups, which make them more suitable for enterprise and cloud-based solutions. However, more and more AI use cases are now mobile-related, which will need AI models to run on edge devices such as smartphones, vehicles, cameras, and more, and edge devices typically don’t have the physical space nor the BOM (build of material) cost budget for such high-performance GPUs.
Opportunities for GPUs: because most CNNs require the use of high-performance GPU for its multithreading capabilities, companies such as Nvidia have grown quickly by expanding their GPU’s original gaming applications into AI applications, which resulted in their annual revenue growing from $4 billion in 2012 to now more than $24 billion in 2022, an increase of 6-fold. In fact, I can’t think of any other brands of GPUs being used for training deep learning networks today.

Network inference time (lower the better) CPU and GPU comparison. Image from Stanford University.

Cloud AI vs Edge AI

Given that CNNs typically require the use of high-performance GPUs, that requirement had kept most of the serious AI deployments to be in the cloud, where there could be servers of GPUs to be set up and to provide cloud AI services, which are just not physically possible for any edge devices. However, why do we care about running AI directly on the edge devices such as smartphones, cameras, and vehicles as compared to relying on cloud AI to provide AI services remotely using their onsite servers? There are several reasons why running AI at the edge could be more beneficial than at the cloud:

Speed and reliability: if the computation is done at the cloud, the data will first need to be transferred to, and then back from the cloud to the end device for use. This process exposes potential issues such as data transmission delay or loss, which could result in inconsistent service performance that is unreliable. On the other hand, AI running at the edge would not have this issue because there is no need to transmit data to and from the cloud.
Data privacy and integrity: because the data is being transmitted wirelessly to and from the cloud, this could expose the system to MiM attacks. This issue can be especially detrimental to safety-critical systems such as ADAS and self-driving-related services. Once again, because AI running at the edge does not transmit data to and from the cloud, the chance of being hacked and manipulated is greatly reduced.
Scalability and cost: as a cloud AI service provider, expanding their business would mean physically increasing the number of their processing capabilities — more GPUs and servers, which becomes ever more costly both in terms of processors as well as the space required. However, if AI is computed directly at the edge devices, then the AI service provider only needs to focus on the software, not hardware.
Consumer-friendly: cloud-based AI is typically operating more on a B2B basis, where the cloud AI provider provides services to a customer, which then uses this AI service to build their own business or consumer-facing products, but if AI services are running directly at the edge on devices such as smartphones and security cameras, then the AI service provider now not only is able to continue to provide their services to enterprise customers but can also directly work with consumer-facing products since the AI is running on those devices, allowing the business to directly reach billions of end devices.

AI Needs To Be Efficient To Run At The Edge

There are clear benefits of running AI at the edge, but most deep learning AI/CNNs need to run on a high-performance GPU for fast inferences (getting results). These GPUs are bulky and expensive and are not practical for AI deployment at the edge, what do we do? Well let’s see, do edge devices come with GPUs as part of their compute, if so we can just use those onboard GPUs to run the AIs on, right? Yes and no. While most edge devices (i.e. smartphones, or compute platforms inside your vehicles) do have GPUs onboard, they are typically orders of magnitude weaker than the GPUs that you’d normally use for training a CNN — a state-of-the-art CNN for image recognition may take about 0.1 seconds to make an inference using an Nvidia Titan Xp GPU, will take more than 1 second to do the same on a mobile-grade GPU, not to mention we usually would need a CNN to run in real-time (typically means 30 images/frames per second (FPS)) for important applications, meaning it would need to be able to make an inference in less than 0.03 second! To make matters worse, onboard GPUs are usually already being heavily utilized by background as well as video-related applications, so there’s not much compute resource left anyway. Edge devices nowadays often come with a new onboard processor called the Neural Processing Unit (NPU). In general, the NPU is more optimized for signal processing and AI-related computations; however, NPUs are not that much more powerful than its onboard GPU neighbor, it is acting more as a separate silicon so that AI computations don’t have to contend for computing resources against other applications all using the same GPU.

Therefore, the best way for us to move forward here is to somehow make our AI running much more efficiently, so that it can make inferences at a much faster rate whether on whatever compute resources that are still available from the onboard GPUs, the NPUs, or even the CPUs (if you don’t want to wait years for those edge processors to finally become powerful enough, if ever). CNNs are slow to make inferences because of their inherent design that requires a lot of computation to be performed, but there are a number of ways to significantly improve their efficiency, allowing the end result to be edge-capable at satisfactory speeds. We will be discussing some of the popular methods such as network compression, pruning, knowledge distillation, quantization, multi-task learning, and separable convolutions in the next part of this A Guide To Developing Efficient AI For Edge Computing series.

Source: Medium