Search Content

Efficient and Online Deep Learning through Model Plasticity and Stability

Description

The rapid advancement of Deep Neural Networks (DNNs), computing, and sensing technology has enabled many new applications, such as the self-driving vehicle, the surveillance drone, and the robotic system. Compared to conventional edge devices (e.g. cell phone or smart home devices), these emerging devices are required to deal with much…

The rapid advancement of Deep Neural Networks (DNNs), computing, and sensing technology has enabled many new applications, such as the self-driving vehicle, the surveillance drone, and the robotic system. Compared to conventional edge devices (e.g. cell phone or smart home devices), these emerging devices are required to deal with much more complicated and dynamic situations in real-time with bounded computation resources. However, there are several challenges, including but not limited to efficiency, real-time adaptation, model stability, and automation of architecture design.

To tackle the challenges mentioned above, model plasticity and stability are leveraged to achieve efficient and online deep learning, especially in the scenario of learning streaming data at the edge:

First, a dynamic training scheme named Continuous Growth and Pruning (CGaP) is proposed to compress the DNNs through growing important parameters and pruning unimportant ones, achieving up to 98.1% reduction in the number of parameters.

Second, this dissertation presents Progressive Segmented Training (PST), which targets catastrophic forgetting problems in continual learning through importance sampling, model segmentation, and memory-assisted balancing. PST achieves state-of-the-art accuracy with 1.5X FLOPs reduction in the complete inference path.

Third, to facilitate online learning in real applications, acquisitive learning (AL) is further proposed to emphasize both knowledge inheritance and acquisition: the majority of the knowledge is first pre-trained in the inherited model and then adapted to acquire new knowledge. The inherited model's stability is monitored by noise injection and the landscape of the loss function, while the acquisition is realized by importance sampling and model segmentation. Compared to a conventional scheme, AL reduces accuracy drop by >10X on CIFAR-100 dataset, with 5X reduction in latency per training image and 150X reduction in training FLOPs.

Finally, this dissertation presents evolutionary neural architecture search in light of model stability (ENAS-S). ENAS-S uses a novel fitness score, which addresses not only the accuracy but also the model stability, to search for an optimal inherited model for the application of continual learning. ENAS-S outperforms hand-designed DNNs when learning from a data stream at the edge.

In summary, in this dissertation, several algorithms exploiting model plasticity and model stability are presented to improve the efficiency and accuracy of deep neural networks, especially for the scenario of continual learning.

ContributorsDu, Xiaocong (Author) / Cao, Yu (Thesis advisor) / Seo, Jae-Sun (Committee member) / Chakrabarti, Chaitali (Committee member) / Fan, Deliang (Committee member) / Arizona State University (Publisher)

Created2020

Hardware-friendly Deep Learning for Edge Computing

Description

The Internet-of-Things (IoT) boosts the vast amount of streaming data. However, even considering the growth of the cloud computing infrastructure, IoT devices will generate two orders of magnitude more than the capacity that centralized data center servers can process or store. This trend inevitability calls for the need for offloading…

The Internet-of-Things (IoT) boosts the vast amount of streaming data. However, even considering the growth of the cloud computing infrastructure, IoT devices will generate two orders of magnitude more than the capacity that centralized data center servers can process or store. This trend inevitability calls for the need for offloading IoT data processing to a decentralized edge computing infrastructure. On the other hand, deep-learning-based applications gain great progress by taking advantage of heavy centralized computing resources for training large models to fit increasingly complicated tasks. Even though large-scale deep learning models perform well in terms of accuracy, their high computational complexity makes it impossible to offload them onto edge devices for real-time inference and timely response. To enable timely IoT services on edge devices, this dissertation addresses the challenge from two perspectives. On the hardware side, a new field-programmable gate array (FPGA)-based framework for binary neural network and an application-specific integrated circuit (ASIC) accelerator for natural scene text interpretation are proposed, with the awareness of the computing resources and power constraint on edge. On the algorithm side, this work presents both the methodology of building more compact models and finding better computation-accuracy trade-off for existing models.

ContributorsLi, Yixing (Author) / Ren, Fengbo (Thesis advisor) / Vrudhula, Sarma (Committee member) / Seo, Jae-Sun (Committee member) / Li, Baoxin (Committee member) / Arizona State University (Publisher)

Created2021

Processing-in-Memory for Data-Intensive Applications, From Device to Algorithm

Description

Over the past decades, the amount of data required to be processed and analyzed by computing systems has been increasing dramatically to exascale (10^18 bytes/s or ops). However, modern computing platforms' inability to deliver both energy-efficient and high-performance computing solutions leads to a gap between meets and needs, especially in…

Over the past decades, the amount of data required to be processed and analyzed by computing systems has been increasing dramatically to exascale (10^18 bytes/s or ops). However, modern computing platforms' inability to deliver both energy-efficient and high-performance computing solutions leads to a gap between meets and needs, especially in resource-constraint Internet of Things (IoT) devices. Unfortunately, such a gap will keep widening mainly due to limitations in both devices and architectures. With this motivation, this dissertation's focus is on cross-layer (device/circuit/architecture/application) co-design of energy-efficient and high-performance Processing-in-Memory (PIM) platforms for implementing complex big data applications, i.e., deep learning, bioinformatics, graph processing tasks, and data encryption. The dissertation shows how to leverage innovations from device, circuit, and architecture to integrate memory and logic to break the existing memory and power walls and dramatically increase computing efficiency of today’s non-Von-Neumann computing systems.The proposed PIM platforms transform current volatile and non-volatile random access memory arrays to computational units capable of working as both memory and low-area-overhead, massively parallel, fast, reconfigurable in-memory logic. Instead of integrating complex logic units in cost-sensitive memory, the explored designs exploit hardware-friendly bit-line computing methods to implement complete Boolean logic functions between operands within a memory array in a reduced clock cycle, overcoming the multi-cycle logic issue in modern PIM platforms. Besides, new customized in-memory algorithms and mapping methods are developed to convert the crucial iteratively-used big data application's functions to bit-wise PIM-supported logic. To quantitatively analyze the performance of various PIM platforms running big data applications, a generic and comprehensive evaluation framework is presented. The overall system computing performance (throughput, latency, energy efficiency) for each application is explored through the developed framework. The device-to-algorithm co-simulation results on neural network acceleration demonstrate that the proposed platforms can obtain 36.8× higher energy-efficiency and 22× speed-up compared to state-of-the-art Graphics Processing Unit (GPU). In accelerating bioinformatics tasks such as biological sequence alignment, the presented PIM designs result in ~2×, 43.8×, 458× more throughput per Watt compared to state-of-the-art Application-Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (FPGA), and GPU platforms, respectively.

ContributorsAngizi, Shaahin (Author) / Fan, Deliang (Thesis advisor) / Seo, Jae-Sun (Committee member) / Awad, Amro (Committee member) / Zhang, Wei (Committee member) / Arizona State University (Publisher)

Created2021

Quantization and Evaluation of AI Algorithms for Hardware Acceleration

Description

Artificial intelligence is one of the leading technologies that mimics the problem solving and decision making capabilities of the human brain. Machine learning algorithms, especially deep learning algorithms, are leading the way in terms of performance and robustness. They are used for various purposes, mainly for computer vision, speech recognition,…

Artificial intelligence is one of the leading technologies that mimics the problem solving and decision making capabilities of the human brain. Machine learning algorithms, especially deep learning algorithms, are leading the way in terms of performance and robustness. They are used for various purposes, mainly for computer vision, speech recognition, and object detection. The algorithms are usually tested inaccuracy, and they utilize full floating-point precision (32 bits). The hardware would require a high amount of power and area to accommodate many parameters with full precision. In this exploratory work, the convolution autoencoder is quantized for the working of an event base camera. The model is designed so that the autoencoder can work on-chip, which would sufficiently decrease the latency in processing. Different quantization methods are used to quantize and binarize the weights and activations of this neural network model to be portable and power efficient. The sparsity term is added to make the model as robust and energy-efficient as possible. The network model was able to recoup the lost accuracy due to binarizing the weights and activation's to quantize the layers of the encoder selectively. This method of recouping the accuracy gives enough flexibility to introduce the network on the chip to get real-time processing from systems like event-based cameras. Lately, computer vision, especially object detection have made strides in their object detection accuracy. The algorithms can sufficiently detect and predict the objects in real-time. However, end-to-end detection of the algorithm is challenging due to the large parameter need and processing requirements. A change in the Non Maximum Suppression algorithm in SSD(Single Shot Detector)-Mobilenet-V1 resulted in less computational complexity without change in the quality of output metric. The Mean Average Precision(mAP) calculated suggests that this method can be implemented in the post-processing of other networks.

ContributorsKuzhively, Ajay Balu (Author) / Cao, Yu (Thesis advisor) / Seo, Jae-Sun (Committee member) / Fan, Delian (Committee member) / Arizona State University (Publisher)

Created2021

End-to-End Performance Benchmarking Tool for High-Speed Memory Access in Deep Learning

Description

Due to high DRAM access latency and energy, several convolutional neural network(CNN) accelerators face performance and energy efficiency challenges, which are critical for embedded implementations. As these applications exploit larger datasets, memory accesses of these emerging applications are increasing. As a result, it is difficult to predict the combined…

Due to high DRAM access latency and energy, several convolutional neural network(CNN) accelerators face performance and energy efficiency challenges, which are critical for embedded implementations. As these applications exploit larger datasets, memory accesses of these emerging applications are increasing. As a result, it is difficult to predict the combined dynamic random access memory (DRAM) workload behavior, which can sabotage memory optimizations in software. To understand the impact of external memory access on CNN accelerators which reduces the high DRAMaccess latency and energy, simulators such as RAMULATOR and VAMPIRE have been proposed in prior work. In this work, we utilize these simulators to benchmark external memory access in CNN accelerators. Experiments are performed generating trace files based on the number of parameters and data precision and also using trace file generated for CNN Accelerator Altera Arria 10 GX 1150 FPGA data to complete the end to end workflow using the mentioned simulators. Besides that, certain modifications were made in the default VAMPIRE code to implement certain functionalities such as PREA(Precharge All) and REF(Refresh). Then, precalculated energies were computed for DDR3, DDR4, and HBM based on the micron model to mention it in the dram specification file inputted to the VAMPIRE tool. An experimental study was performed and a comparison is made between DDR3, DDR4, and HBM, it was proved that DDR4 is nearly 31% more energy-efficient than DDR3 and HBMis 54% energy-efficient than DDR3. Performed modeling and experimental analysis on a large set of data and then split it into a set of data and compared the results of the small sets multiplied with the number of sets and the large data set and concluded that the results were nearly the same. Finally, a GUI is developed by wrapping both the simulators. GUI provides user-friendly access and can analyze the parameters without much prior knowledge and understanding of the working.

ContributorsPannala, Manvitha (Author) / Cao, Yu (Thesis advisor) / Chakrabarti, Chaitali (Committee member) / Seo, Jae-Sun (Committee member) / Arizona State University (Publisher)

Created2021

FPGA-Based Edge-Computing Acceleration

Description

The rapid growth of Internet-of-things (IoT) and artificial intelligence applications have called forth a new computing paradigm--edge computing. Edge computing applications, such as video surveillance, autonomous driving, and augmented reality, are highly computationally intensive and require real-time processing. Current edge systems are typically based on commodity general-purpose hardware such as…

The rapid growth of Internet-of-things (IoT) and artificial intelligence applications have called forth a new computing paradigm--edge computing. Edge computing applications, such as video surveillance, autonomous driving, and augmented reality, are highly computationally intensive and require real-time processing. Current edge systems are typically based on commodity general-purpose hardware such as Central Processing Units (CPUs) and Graphical Processing Units (GPUs) , which are mainly designed for large, non-time-sensitive jobs in the cloud and do not match the needs of the edge workloads. Also, these systems are usually power hungry and are not suitable for resource-constrained edge deployments. Such application-hardware mismatch calls forth a new computing backbone to support the high-bandwidth, low-latency, and energy-efficient requirements. Also, the new system should be able to support a variety of edge applications with different characteristics. This thesis addresses the above challenges by studying the use of Field Programmable Gate Array (FPGA) -based computing systems for accelerating the edge workloads, from three critical angles. First, it investigates the feasibility of FPGAs for edge computing, in comparison to conventional CPUs and GPUs. Second, it studies the acceleration of common algorithmic characteristics, identified as loop patterns, using FPGAs, and develops a benchmark tool for analyzing the performance of these patterns on different accelerators. Third, it designs a new edge computing platform using multiple clustered FPGAs to provide high-bandwidth and low-latency acceleration of convolutional neural networks (CNNs) widely used in edge applications. Finally, it studies the acceleration of the emerging neural networks, randomly-wired neural networks, on the multi-FPGA platform. The experimental results from this work show that the new generation of workloads requires rethinking the current edge-computing architecture. First, through the acceleration of common loops, it demonstrates that FPGAs can outperform GPUs in specific loops types up to 14 times. Second, it shows the linear scalability of multi-FPGA platforms in accelerating neural networks. Third, it demonstrates the superiority of the new scheduler to optimally place randomly-wired neural networks on multi-FPGA platforms with 81.1 times better throughput than the available scheduling mechanisms.

ContributorsBiookaghazadeh, Saman (Author) / Zhao, Ming (Thesis advisor) / Ren, Fengbo (Thesis advisor) / Li, Baoxin (Committee member) / Seo, Jae-Sun (Committee member) / Arizona State University (Publisher)

Created2021

Continual Learning with Novelty Detection: Algorithms and Applications to Image and Microelectronic Design

Description

Machine learning techniques have found extensive application in dynamic fields like drones, self-driving vehicles, surveillance, and more. Their effectiveness stems from meticulously crafted deep neural networks (DNNs), extensive data gathering efforts, and resource-intensive model training processes. However, due to the unpredictable nature of the environment, these systems will inevitably encounter…

Machine learning techniques have found extensive application in dynamic fields like drones, self-driving vehicles, surveillance, and more. Their effectiveness stems from meticulously crafted deep neural networks (DNNs), extensive data gathering efforts, and resource-intensive model training processes. However, due to the unpredictable nature of the environment, these systems will inevitably encounter input samples that deviate from the distribution of their original training data, resulting in instability and performance degradation.To effectively detect the emergence of out-of-distribution (OOD) data, this dissertation first proposes a novel, self-supervised approach that evaluates the Mahalanobis distance between the in-distribution (ID) and OOD in gradient space. A binary classifier is then introduced to guide the label selection for gradients calculation, which further boosts the detection performance. Next, to continuously adapt the new OOD into the existing knowledge base, an unified framework for novelty detection and continual learning is proposed. The binary classifier, trained to distinguish OOD data from ID, is connected sequentially with the pre-trained model to form a “N + 1” classifier, where “N” represents prior knowledge which contains N classes and “1” refers to the newly arrival OOD. This continual learning process continues as “N+1+1+1+...”, assimilating the knowledge of each new OOD instance into the system. Finally, this dissertation demonstrates the practical implementation of novelty detection and continual learning within the domain of thermal analysis. To rapidly address the impact of voids in thermal interface material (TIM), a continuous adaptation approach is proposed, which integrates trainable nodes into the graph at the locations where abnormal thermal behaviors are detected. With minimal training overhead, the model can quickly adapts to the change caused by the defects and regenerate accurate thermal prediction. In summary, this dissertation proposes several algorithms and practical applications in continual learning aimed at enhancing the stability and adaptability of the system. All proposed algorithms are validated through extensive experiments conducted on benchmark datasets such as CIFAR-10, CIFAR-100, TinyImageNet for continual learning, and real thermal data for thermal analysis.

ContributorsSun, Jingbo (Author) / Cao, Yu (Thesis advisor) / Chhabria, Vidya (Thesis advisor) / Chakrabarti, Chaitali (Committee member) / Fan, Deliang (Committee member) / Seo, Jae-Sun (Committee member) / Arizona State University (Publisher)

Created2024

ASU Electronic Theses and Dissertations

Efficient and Online Deep Learning through Model Plasticity and Stability

Hardware-friendly Deep Learning for Edge Computing

Processing-in-Memory for Data-Intensive Applications, From Device to Algorithm

Quantization and Evaluation of AI Algorithms for Hardware Acceleration

End-to-End Performance Benchmarking Tool for High-Speed Memory Access in Deep Learning

FPGA-Based Edge-Computing Acceleration

Continual Learning with Novelty Detection: Algorithms and Applications to Image and Microelectronic Design