Search Content

End-to-End Performance Benchmarking Tool for High-Speed Memory Access in Deep Learning

Description

Due to high DRAM access latency and energy, several convolutional neural network(CNN) accelerators face performance and energy efficiency challenges, which are critical for embedded implementations. As these applications exploit larger datasets, memory accesses of these emerging applications are increasing. As a result, it is difficult to predict the combined…

Due to high DRAM access latency and energy, several convolutional neural network(CNN) accelerators face performance and energy efficiency challenges, which are critical for embedded implementations. As these applications exploit larger datasets, memory accesses of these emerging applications are increasing. As a result, it is difficult to predict the combined dynamic random access memory (DRAM) workload behavior, which can sabotage memory optimizations in software. To understand the impact of external memory access on CNN accelerators which reduces the high DRAMaccess latency and energy, simulators such as RAMULATOR and VAMPIRE have been proposed in prior work. In this work, we utilize these simulators to benchmark external memory access in CNN accelerators. Experiments are performed generating trace files based on the number of parameters and data precision and also using trace file generated for CNN Accelerator Altera Arria 10 GX 1150 FPGA data to complete the end to end workflow using the mentioned simulators. Besides that, certain modifications were made in the default VAMPIRE code to implement certain functionalities such as PREA(Precharge All) and REF(Refresh). Then, precalculated energies were computed for DDR3, DDR4, and HBM based on the micron model to mention it in the dram specification file inputted to the VAMPIRE tool. An experimental study was performed and a comparison is made between DDR3, DDR4, and HBM, it was proved that DDR4 is nearly 31% more energy-efficient than DDR3 and HBMis 54% energy-efficient than DDR3. Performed modeling and experimental analysis on a large set of data and then split it into a set of data and compared the results of the small sets multiplied with the number of sets and the large data set and concluded that the results were nearly the same. Finally, a GUI is developed by wrapping both the simulators. GUI provides user-friendly access and can analyze the parameters without much prior knowledge and understanding of the working.

ContributorsPannala, Manvitha (Author) / Cao, Yu (Thesis advisor) / Chakrabarti, Chaitali (Committee member) / Seo, Jae-Sun (Committee member) / Arizona State University (Publisher)

Created2021

Energy Efficient ASIC/FPGA Neural Network Accelerators

Description

Convolutional neural networks(CNNs) achieve high accuracy on large datasets but requires significant computation and storage requirement for training/testing. While many applications demand low latency and energy-efficient processing of the images, deploying these complex algorithms on the hardware is a challenging task. This dissertation first presents a compiler-based CNN training accelerator…

Convolutional neural networks(CNNs) achieve high accuracy on large datasets but requires significant computation and storage requirement for training/testing. While many applications demand low latency and energy-efficient processing of the images, deploying these complex algorithms on the hardware is a challenging task. This dissertation first presents a compiler-based CNN training accelerator using DDR3 and HBM2 memory. An optimized RTL library is implemented to perform training-specific tasks and an RTL compiler is developed to generate FPGA-synthesizable RTL based on user-defined constraints. High Bandwidth Memory(HBM) provides efficient off-chip communication and improves the training performance. The impact of HBM2 on CNN training workloads is analyzed and compressively compared with DDR3. For training ResNet-20/VGG-like CNNs for the CIFAR-10 dataset, the proposed CNN training accelerator on Stratix-10 GX FPGA(DDR3) demonstrates 479 GOPS performance, and on Stratix-10 MX FPGA(HBM) shows 4.5/9.7 X energy-efficiency improvement compared to Tesla V100 GPU. Next, the FPGA online learning accelerator is presented. Adopting model segmentation techniques from Progressive Segmented Training(PST), the online learning accelerator achieved a 4.2X reduction in training latency. Furthermore, this dissertation presents an 8-bit floating-point (FP8) training processor which implements (1) Highly parallel tensor cores that maintain high PE utilization, (2) Hardware-efficient channel gating for dynamic output activation sparsity (3) Dynamic weight sparsity based on group Lasso (4) Gradient skipping based on FP prediction error. The 28nm prototype chip demonstrates significant improvements in FLOPs reduction (7.3×), energy efficiency (16.4 TFLOPS/W), and overall training latency speedup (4.7×) for both supervised training and self-supervised training tasks. In addition to the training accelerators, this dissertation also presents a CNN inference accelerator on ASIC(FixyNN) and FPGA(FixyFPGA). FixyNN consists of a fixed-weight feature extractor that generates ubiquitous CNN features and a conventional programmable CNN accelerator. In the fixed-weight feature extractor, the network weights are hard-coded into hardware and used as a fixed operand for the multiplication. Experimental results demonstrate FixyNN can achieve very high energy efficiencies up to 26.6 TOPS/W, and FixyFPGA achieves $2.34\times$ higher GOPS on ImageNet classification. In summary, this dissertation comprehensively discusses novel architectures of high-performance and energy-efficient ASIC/FPGA CNN inference/training accelerators.

ContributorsKolala Venkataramaniah, Shreyas (Author) / Seo, Jae-Sun (Thesis advisor) / Cao, Yu (Committee member) / Chakrabarti, Chaitali (Committee member) / Fan, Deliang (Committee member) / Arizona State University (Publisher)

Created2022

Accelerating Genome Quantification in FPGA

Description

The growth in speed and density of programmable logic devices, such as Field programmable gate arrays (FPGA), enables sophisticated designs to be created within a short time frame. The flexibility of a programmable device alleviates the difficulty of the integration of a design with a wide range of components on…

The growth in speed and density of programmable logic devices, such as Field programmable gate arrays (FPGA), enables sophisticated designs to be created within a short time frame. The flexibility of a programmable device alleviates the difficulty of the integration of a design with a wide range of components on a single chip. FPGAs bring both performance and power efficiency, especially for compute or data-intensive applications. Efficient and accurate mRNA quantification is an essential step for molecular signature identification, disease outcome prediction, and drug development, which is a typical compute- and data-intensive compute workload. In this work, I propose to accelerate mRNA quantification with FPGA implementation. I analyze the performance of mRNA Quantification with FPGA, which shows better or similar performance compared to that of CPU implementation.

ContributorsKim, Kiju (Author) / Fan, Deliang (Thesis advisor) / Cao, Kevin (Committee member) / Zhang, Wei (Committee member) / Arizona State University (Publisher)

Created2022

QU-Net: A Lightweight U-Net based Region Proposal System

Description

In recent years, there has been significant progress in deep learning and computer vision, with many models proposed that have achieved state-of-art results on various image recognition tasks. However, to explore the full potential of the advances in this field, there is an urgent need to push the processing of…

In recent years, there has been significant progress in deep learning and computer vision, with many models proposed that have achieved state-of-art results on various image recognition tasks. However, to explore the full potential of the advances in this field, there is an urgent need to push the processing of deep networks from the cloud to edge devices. Unfortunately, many deep learning models cannot be efficiently implemented on edge devices as these devices are severely resource-constrained. In this thesis, I present QU-Net, a lightweight binary segmentation model based on the U-Net architecture. Traditionally, neural networks consider the entire image to be significant. However, in real-world scenarios, many regions in an image do not contain any objects of significance. These regions can be removed from the original input allowing a network to focus on the relevant regions and thus reduce computational costs. QU-Net proposes the salient regions (binary mask) that the deeper models can use as the input. Experiments show that QU-Net helped achieve a computational reduction of 25% on the Microsoft Common Objects in Context (MS COCO) dataset and 57% on the Cityscapes dataset. Moreover, QU-Net is a generalizable model that outperforms other similar works, such as Dynamic Convolutions.

ContributorsSanthosh Kumar Varma, Rahul (Author) / Yang, Yezhou (Thesis advisor) / Fan, Deliang (Committee member) / Yang, Yingzhen (Committee member) / Arizona State University (Publisher)

Created2021

Exploration of Security and Privacy Challenges through Adversarial Weight Perturbation in Deep Learning Models

Description

Adversarial threats of deep learning are increasingly becoming a concern due to the ubiquitous deployment of deep neural networks(DNNs) in many security-sensitive domains. Among the existing threats, adversarial weight perturbation is an emerging class of threats that attempts to perturb the weight parameters of DNNs to breach security and privacy.In…

Adversarial threats of deep learning are increasingly becoming a concern due to the ubiquitous deployment of deep neural networks(DNNs) in many security-sensitive domains. Among the existing threats, adversarial weight perturbation is an emerging class of threats that attempts to perturb the weight parameters of DNNs to breach security and privacy.In this thesis, the first weight perturbation attack introduced is called Bit-Flip Attack (BFA), which can maliciously flip a small number of bits within a computer’s main memory system storing the DNN weight parameter to achieve malicious objectives. Our developed algorithm can achieve three specific attack objectives: I) Un-targeted accuracy degradation attack, ii) Targeted attack, & iii) Trojan attack. Moreover, BFA utilizes the rowhammer technique to demonstrate the bit-flip attack in an actual computer prototype. While the bit-flip attack is conducted in a white-box setting, the subsequent contribution of this thesis is to develop another novel weight perturbation attack in a black-box setting. Consequently, this thesis discusses a new study of DNN model vulnerabilities in a multi-tenant Field Programmable Gate Array (FPGA) cloud under a strict black-box framework. This newly developed attack framework injects faults in the malicious tenant by duplicating specific DNN weight packages during data transmission between off-chip memory and on-chip buffer of a victim FPGA. The proposed attack is also experimentally validated in a multi-tenant cloud FPGA prototype. In the final part, the focus shifts toward deep learning model privacy, popularly known as model extraction, that can steal partial DNN weight parameters remotely with the aid of a memory side-channel attack. In addition, a novel training algorithm is designed to utilize the partially leaked DNN weight bit information, making the model extraction attack more effective. The algorithm effectively leverages the partial leaked bit information and generates a substitute prototype of the victim model with almost identical performance to the victim.

ContributorsRakin, Adnan Siraj (Author) / Fan, Deliang (Thesis advisor) / Chakrabarti, Chaitali (Committee member) / Seo, Jae-Sun (Committee member) / Cao, Yu (Committee member) / Arizona State University (Publisher)

Created2022

In-Memory Computing Based Hardware Accelerators Advancing the Implementation of AI/ML Tasks in Edge Devices

Description

Artificial Intelligence (AI) and Machine Learning (ML) techniques have come a long way since their inception and have been used to build intelligent systems for a wide range of applications in everyday life. However they are very computationintensive and require transfer of large volume of data from memory to the…

Artificial Intelligence (AI) and Machine Learning (ML) techniques have come a long way since their inception and have been used to build intelligent systems for a wide range of applications in everyday life. However they are very computationintensive and require transfer of large volume of data from memory to the computation units. This memory access time constitute significant part of the computational latency and a performance bottleneck. To address this limitation and the ever-growing demand for implementation in hand-held and edge-devices, In-memory computing (IMC) based AI/ML hardware accelerators have emerged. First, the dissertation presents an IMC static random access memory (SRAM) based hardware modeling and optimization framework. A unified systematic study closely models the IMC hardware, and investigates how a number of design variables and non-idealities (e.g. device mismatch and ADC quantization) affect the Deep Neural Network (DNN) accuracy of the IMC design. The framework allows co-optimized selection of different design variables accounting for sources of noise in IMC hardware and robust implementation of a high accuracy DNN. Next, it presents a kNN hardware accelerator in 65nm Complementary Metal-Oxide-Semiconductor (CMOS) technology. The accelerator combines an IMC SRAM that is developed for binarized deep neural networks and other digital hardware that performs top-k sorting. The simulated k Nearest Neighbor accelerator design processes up to 17.9 million query vectors per second while consuming 11.8 mW, demonstrating >4.8× energy-efficiency improvement over prior works. This dissertation also presents a novel floating-point precision IMC (FP-IMC) macro with a hybrid architecture that configurably supports two Floating Point (FP) precisions. Implementing FP precision MAC has been a challenge owing to its complexity. The design is implemented on 28nm CMOS, and taped-out on chip demonstrating 12.1 TFLOPS/W and 66.1 TFLOPS/W for 8-bit Floating Point (FP8) and Block Floating point (BF8) respectively. Finally, another iteration of the FP design is presented that is modeled to support multiple precision modes from FP8 up to FP32. Two approaches to the architectural design were compared illustrating the throughput-area overhead trade-off. The simulated design shows a 2.1 × normalized energy-efficiency compared to the on-chip implementation of the FP-IMC.

ContributorsSaikia, Jyotishman (Author) / Seo, Jae-Sun (Thesis advisor) / Chakrabarti, Chaitali (Thesis advisor) / Fan, Deliang (Committee member) / Cao, Yu (Committee member) / Arizona State University (Publisher)

Created2023

Collaborative Learning and Optimization for Edge Intelligence

Description

With the proliferation of mobile computing and Internet-of-Things (IoT), billions of mobile and IoT devices are connected to the Internet, generating zillions of Bytes of data at the network edge. Driving by this trend, there is an urgent need to push the artificial intelligence (AI) frontiers to the network edge…

With the proliferation of mobile computing and Internet-of-Things (IoT), billions of mobile and IoT devices are connected to the Internet, generating zillions of Bytes of data at the network edge. Driving by this trend, there is an urgent need to push the artificial intelligence (AI) frontiers to the network edge to unleash the potential of the edge big data fully. This dissertation aims to comprehensively study collaborative learning and optimization algorithms to build a foundation of edge intelligence. Under this common theme, this dissertation is broadly organized into three parts. The first part of this study focuses on model learning with limited data and limited computing capability at the network edge. A global model initialization is first obtained by running federated learning (FL) across many edge devices, based on which a semi-supervised algorithm is devised for an edge device to carry out quick adaptation, aiming to address the insufficiency of labeled data and to learn a personalized model efficiently. In the second part of this study, collaborative learning between the edge and the cloud is studied to achieve real-time edge intelligence. More specifically, a distributionally robust optimization (DRO) approach is proposed to enable the synergy between local data processing and cloud knowledge transfer. Two attractive uncertainty models are investigated corresponding to the cloud knowledge transfer: the distribution uncertainty set based on the cloud data distribution and the prior distribution of the edge model conditioned on the cloud model. Collaborative learning algorithms are developed along this line. The final part focuses on developing an offline model-based safe Inverse Reinforcement Learning (IRL) algorithm for connected Autonomous Vehicles (AVs). A reward penalty is introduced to penalize unsafe states, and a risk-measure-based approach is proposed to mitigate the model uncertainty introduced by offline training. The experimental results demonstrate the improvement of the proposed algorithm over the existing baselines in terms of cumulative rewards.

ContributorsZhang, Zhaofeng (Author) / Zhang, Junshan (Thesis advisor) / Zhang, Yanchao (Thesis advisor) / Dasarathy, Gautam (Committee member) / Fan, Deliang (Committee member) / Arizona State University (Publisher)

Created2023

Increasing the Efficiency of Heterogeneous System Operation: from Scheduling to Implementation of Split Federated Learning

Description

This thesis addresses the problems of (a) scheduling multiple streaming jobs with soft deadline constraints to minimize the risk/energy consumption in heterogeneous Systems-on-chip (SoCs), and (b) training a neural network model with high accuracy and low training time using split federated learning (SFL) with heterogeneous clients. Designing a scheduler for…

This thesis addresses the problems of (a) scheduling multiple streaming jobs with soft deadline constraints to minimize the risk/energy consumption in heterogeneous Systems-on-chip (SoCs), and (b) training a neural network model with high accuracy and low training time using split federated learning (SFL) with heterogeneous clients. Designing a scheduler for heterogeneous SoC SoCs built with different types of processing elements (PEs) is quite challenging, especially when it has to balance the conflicting requirements of low energy consumption, low risk, and high throughput for randomly streaming jobs at run time. Two probabilistic deadline-aware schedulers are designed for heterogeneous SoCs for such jobs with soft deadline constraints with the goals of optimizing job-level risk and energy efficiency. The key idea of the probabilistic scheduler is to calculate the task-to-PE allocation probabilities when a job arrives in the system. This allocation probability, generated by manually designed or neural network (NN) based allocation function, is used to compute the intra-job and inter-job contentions to derive the task-level slack. The tasks are allocated to the PEs that can complete the task within the task-level slack with minimum risk or minimum energy consumption. SFL is an edge-friendly decentralized NN training scheme, where the model is split and only a small client-side model is trained in the clients. The communication overhead in SFL is significant since the intermediate activations and gradients of every sample are transmitted in every epoch. Two communication reduction methods have been proposed, namely, loss-aware selective updating to reduce the number of training epochs and bottleneck layer (BL) to reduce the feature size.Next, the SFL system is trained with heterogeneous clients having different data rates and operating on non-IID data. The communication time of clients in low-end group with slow data rates dominates the training time. To reduce the training time without sacrificing accuracy significantly, HeteroSFL is built with HetBL and bi- directional knowledge sharing (BDKS). HetBL compresses data with different factors in low- and high-end groups using narrow and wide bottleneck layers respectively. BDKS is proposed to mitigate the label distribution skew across different groups. BDKS can also be applied in Federated Learning to address the label distribution skew.

ContributorsChen, Xing (Author) / Chakrabarti, Chaitali (Thesis advisor, Committee member) / Ogras, Umit (Committee member) / Fan, Deliang (Committee member) / Zhang, Jeff (Committee member) / Arizona State University (Publisher)

Created2023

On Addressing Security Issues in Edge Neural Network Systems

Description

In recent years, the proliferation of deep neural networks (DNNs) has revolutionized the field of artificial intelligence, enabling advancements in various domains. With the emergence of efficient learning techniques such as quantization and distributed learning, DNN systems have become increasingly accessible for deployment on edge devices. This accessibility brings significant…

In recent years, the proliferation of deep neural networks (DNNs) has revolutionized the field of artificial intelligence, enabling advancements in various domains. With the emergence of efficient learning techniques such as quantization and distributed learning, DNN systems have become increasingly accessible for deployment on edge devices. This accessibility brings significant benefits, including real-time inference on the edge, which mitigates communication latency, and on-device learning, which addresses privacy concerns and enables continuous improvement. However, the resource limitations of edge devices pose challenges in equipping them with robust safety protocols, making them vulnerable to various attacks. Two notable attacks that affect edge DNN systems are Bit-Flip Attacks (BFA) and architecture stealing attacks. BFA compromises the integrity of DNN models, while architecture stealing attacks aim to extract valuable intellectual property by reverse engineering the model's architecture. Furthermore, in Split Federated Learning (SFL) scenarios, where training occurs on distributed edge devices, Model Inversion (MI) attacks can reconstruct clients' data, and Model Extraction (ME) attacks can extract sensitive model parameters. This thesis aims to address these four attack scenarios and develop effective defense mechanisms. To defend against BFA, both passive and active defensive strategies are discussed. Furthermore, for both model inference and training, architecture stealing attacks are mitigated through novel defense techniques, ensuring the integrity and confidentiality of edge DNN systems. In the context of SFL, the thesis showcases defense mechanisms against MI attacks for both supervised and self-supervised learning applications. Additionally, the research investigates ME attacks in SFL and proposes countermeasures to enhance resistance against potential ME attackers. By examining and addressing these attack scenarios, this research contributes to the security and privacy enhancement of edge DNN systems. The proposed defense mechanisms enable safer deployment of DNN models on resource-constrained edge devices, facilitating the advancement of real-time applications, preserving data privacy, and fostering the widespread adoption of edge computing technologies.

ContributorsLi, Jingtao (Author) / Chakrabarti, Chaitali (Thesis advisor) / Fan, Deliang (Committee member) / Cao, Yu (Committee member) / Trieu, Ni (Committee member) / Arizona State University (Publisher)

Created2023

Enabling Deep Learning at Edge: From Efficient and Dynamic Inference to On-Device Learning

Description

In recent years, Artificial Intelligence (AI) (e.g., Deep Neural Networks (DNNs), Transformer) has shown great success in real-world applications due to its superior performance in various cognitive tasks. The impressive performance achieved by AI models normally accompanies the cost of enormous model size and high computational complexity, which significantly hampers…

In recent years, Artificial Intelligence (AI) (e.g., Deep Neural Networks (DNNs), Transformer) has shown great success in real-world applications due to its superior performance in various cognitive tasks. The impressive performance achieved by AI models normally accompanies the cost of enormous model size and high computational complexity, which significantly hampers their implementation on resource-limited Cyber-Physical Systems (CPS), Internet-of-Things (IoT), or Edge systems due to their tightly constrained energy, computing, size, and memory budget. Thus, the urgent demand for enhancing the \textbf{Efficiency} of DNN has drawn significant research interests across various communities. Motivated by the aforementioned concerns, this doctoral research has been mainly focusing on Enabling Deep Learning at Edge: From Efficient and Dynamic Inference to On-Device Learning. Specifically, from the inference perspective, this dissertation begins by investigating a hardware-friendly model compression method that effectively reduces the size of AI model while simultaneously achieving improved speed on edge devices. Additionally, due to the fact that diverse resource constraints of different edge devices, this dissertation further explores dynamic inference, which allows for real-time tuning of inference model size, computation, and latency to accommodate the limitations of each edge device. Regarding efficient on-device learning, this dissertation starts by analyzing memory usage during transfer learning training. Based on this analysis, a novel framework called "Reprogramming Network'' (Rep-Net) is introduced that offers a fresh perspective on the on-device transfer learning problem. The Rep-Net enables on-device transferlearning by directly learning to reprogram the intermediate features of a pre-trained model. Lastly, this dissertation studies an efficient continual learning algorithm that facilitates learning multiple tasks without the risk of forgetting previously acquired knowledge. In practice, through the exploration of task correlation, an interesting phenomenon is observed that the intermediate features are highly correlated between tasks with the self-supervised pre-trained model. Building upon this observation, a novel approach called progressive task-correlated layer freezing is proposed to gradually freeze a subset of layers with the highest correlation ratios for each task leading to training efficiency.

ContributorsYang, Li (Author) / Fan, Deliang (Thesis advisor) / Seo, Jae-Sun (Committee member) / Zhang, Junshan (Committee member) / Cao, Yu (Committee member) / Arizona State University (Publisher)

Created2023