This collection includes both ASU Theses and Dissertations, submitted by graduate students, and the Barrett, Honors College theses submitted by undergraduate students. 

Displaying 1 - 10 of 57
Filtering by

Clear all filters

Description
Due to high DRAM access latency and energy, several convolutional neural network(CNN) accelerators face performance and energy efficiency challenges, which are critical for embedded implementations. As these applications exploit larger datasets, memory accesses of these emerging applications are increasing. As a result, it is difficult to predict the combined

Due to high DRAM access latency and energy, several convolutional neural network(CNN) accelerators face performance and energy efficiency challenges, which are critical for embedded implementations. As these applications exploit larger datasets, memory accesses of these emerging applications are increasing. As a result, it is difficult to predict the combined dynamic random access memory (DRAM) workload behavior, which can sabotage memory optimizations in software. To understand the impact of external memory access on CNN accelerators which reduces the high DRAMaccess latency and energy, simulators such as RAMULATOR and VAMPIRE have been proposed in prior work. In this work, we utilize these simulators to benchmark external memory access in CNN accelerators. Experiments are performed generating trace files based on the number of parameters and data precision and also using trace file generated for CNN Accelerator Altera Arria 10 GX 1150 FPGA data to complete the end to end workflow using the mentioned simulators. Besides that, certain modifications were made in the default VAMPIRE code to implement certain functionalities such as PREA(Precharge All) and REF(Refresh). Then, precalculated energies were computed for DDR3, DDR4, and HBM based on the micron model to mention it in the dram specification file inputted to the VAMPIRE tool. An experimental study was performed and a comparison is made between DDR3, DDR4, and HBM, it was proved that DDR4 is nearly 31% more energy-efficient than DDR3 and HBMis 54% energy-efficient than DDR3. Performed modeling and experimental analysis on a large set of data and then split it into a set of data and compared the results of the small sets multiplied with the number of sets and the large data set and concluded that the results were nearly the same. Finally, a GUI is developed by wrapping both the simulators. GUI provides user-friendly access and can analyze the parameters without much prior knowledge and understanding of the working.
ContributorsPannala, Manvitha (Author) / Cao, Yu (Thesis advisor) / Chakrabarti, Chaitali (Committee member) / Seo, Jae-Sun (Committee member) / Arizona State University (Publisher)
Created2021
171744-Thumbnail Image.png
Description
Convolutional neural networks(CNNs) achieve high accuracy on large datasets but requires significant computation and storage requirement for training/testing. While many applications demand low latency and energy-efficient processing of the images, deploying these complex algorithms on the hardware is a challenging task. This dissertation first presents a compiler-based CNN training accelerator

Convolutional neural networks(CNNs) achieve high accuracy on large datasets but requires significant computation and storage requirement for training/testing. While many applications demand low latency and energy-efficient processing of the images, deploying these complex algorithms on the hardware is a challenging task. This dissertation first presents a compiler-based CNN training accelerator using DDR3 and HBM2 memory. An optimized RTL library is implemented to perform training-specific tasks and an RTL compiler is developed to generate FPGA-synthesizable RTL based on user-defined constraints. High Bandwidth Memory(HBM) provides efficient off-chip communication and improves the training performance. The impact of HBM2 on CNN training workloads is analyzed and compressively compared with DDR3. For training ResNet-20/VGG-like CNNs for the CIFAR-10 dataset, the proposed CNN training accelerator on Stratix-10 GX FPGA(DDR3) demonstrates 479 GOPS performance, and on Stratix-10 MX FPGA(HBM) shows 4.5/9.7 X energy-efficiency improvement compared to Tesla V100 GPU. Next, the FPGA online learning accelerator is presented. Adopting model segmentation techniques from Progressive Segmented Training(PST), the online learning accelerator achieved a 4.2X reduction in training latency. Furthermore, this dissertation presents an 8-bit floating-point (FP8) training processor which implements (1) Highly parallel tensor cores that maintain high PE utilization, (2) Hardware-efficient channel gating for dynamic output activation sparsity (3) Dynamic weight sparsity based on group Lasso (4) Gradient skipping based on FP prediction error. The 28nm prototype chip demonstrates significant improvements in FLOPs reduction (7.3×), energy efficiency (16.4 TFLOPS/W), and overall training latency speedup (4.7×) for both supervised training and self-supervised training tasks. In addition to the training accelerators, this dissertation also presents a CNN inference accelerator on ASIC(FixyNN) and FPGA(FixyFPGA). FixyNN consists of a fixed-weight feature extractor that generates ubiquitous CNN features and a conventional programmable CNN accelerator. In the fixed-weight feature extractor, the network weights are hard-coded into hardware and used as a fixed operand for the multiplication. Experimental results demonstrate FixyNN can achieve very high energy efficiencies up to 26.6 TOPS/W, and FixyFPGA achieves $2.34\times$ higher GOPS on ImageNet classification. In summary, this dissertation comprehensively discusses novel architectures of high-performance and energy-efficient ASIC/FPGA CNN inference/training accelerators.
ContributorsKolala Venkataramaniah, Shreyas (Author) / Seo, Jae-Sun (Thesis advisor) / Cao, Yu (Committee member) / Chakrabarti, Chaitali (Committee member) / Fan, Deliang (Committee member) / Arizona State University (Publisher)
Created2022
171895-Thumbnail Image.png
Description
Adversarial threats of deep learning are increasingly becoming a concern due to the ubiquitous deployment of deep neural networks(DNNs) in many security-sensitive domains. Among the existing threats, adversarial weight perturbation is an emerging class of threats that attempts to perturb the weight parameters of DNNs to breach security and privacy.In

Adversarial threats of deep learning are increasingly becoming a concern due to the ubiquitous deployment of deep neural networks(DNNs) in many security-sensitive domains. Among the existing threats, adversarial weight perturbation is an emerging class of threats that attempts to perturb the weight parameters of DNNs to breach security and privacy.In this thesis, the first weight perturbation attack introduced is called Bit-Flip Attack (BFA), which can maliciously flip a small number of bits within a computer’s main memory system storing the DNN weight parameter to achieve malicious objectives. Our developed algorithm can achieve three specific attack objectives: I) Un-targeted accuracy degradation attack, ii) Targeted attack, & iii) Trojan attack. Moreover, BFA utilizes the rowhammer technique to demonstrate the bit-flip attack in an actual computer prototype. While the bit-flip attack is conducted in a white-box setting, the subsequent contribution of this thesis is to develop another novel weight perturbation attack in a black-box setting. Consequently, this thesis discusses a new study of DNN model vulnerabilities in a multi-tenant Field Programmable Gate Array (FPGA) cloud under a strict black-box framework. This newly developed attack framework injects faults in the malicious tenant by duplicating specific DNN weight packages during data transmission between off-chip memory and on-chip buffer of a victim FPGA. The proposed attack is also experimentally validated in a multi-tenant cloud FPGA prototype. In the final part, the focus shifts toward deep learning model privacy, popularly known as model extraction, that can steal partial DNN weight parameters remotely with the aid of a memory side-channel attack. In addition, a novel training algorithm is designed to utilize the partially leaked DNN weight bit information, making the model extraction attack more effective. The algorithm effectively leverages the partial leaked bit information and generates a substitute prototype of the victim model with almost identical performance to the victim.
ContributorsRakin, Adnan Siraj (Author) / Fan, Deliang (Thesis advisor) / Chakrabarti, Chaitali (Committee member) / Seo, Jae-Sun (Committee member) / Cao, Yu (Committee member) / Arizona State University (Publisher)
Created2022
190780-Thumbnail Image.png
Description
Artificial Intelligence (AI) and Machine Learning (ML) techniques have come a long way since their inception and have been used to build intelligent systems for a wide range of applications in everyday life. However they are very computationintensive and require transfer of large volume of data from memory to the

Artificial Intelligence (AI) and Machine Learning (ML) techniques have come a long way since their inception and have been used to build intelligent systems for a wide range of applications in everyday life. However they are very computationintensive and require transfer of large volume of data from memory to the computation units. This memory access time constitute significant part of the computational latency and a performance bottleneck. To address this limitation and the ever-growing demand for implementation in hand-held and edge-devices, In-memory computing (IMC) based AI/ML hardware accelerators have emerged. First, the dissertation presents an IMC static random access memory (SRAM) based hardware modeling and optimization framework. A unified systematic study closely models the IMC hardware, and investigates how a number of design variables and non-idealities (e.g. device mismatch and ADC quantization) affect the Deep Neural Network (DNN) accuracy of the IMC design. The framework allows co-optimized selection of different design variables accounting for sources of noise in IMC hardware and robust implementation of a high accuracy DNN. Next, it presents a kNN hardware accelerator in 65nm Complementary Metal-Oxide-Semiconductor (CMOS) technology. The accelerator combines an IMC SRAM that is developed for binarized deep neural networks and other digital hardware that performs top-k sorting. The simulated k Nearest Neighbor accelerator design processes up to 17.9 million query vectors per second while consuming 11.8 mW, demonstrating >4.8× energy-efficiency improvement over prior works. This dissertation also presents a novel floating-point precision IMC (FP-IMC) macro with a hybrid architecture that configurably supports two Floating Point (FP) precisions. Implementing FP precision MAC has been a challenge owing to its complexity. The design is implemented on 28nm CMOS, and taped-out on chip demonstrating 12.1 TFLOPS/W and 66.1 TFLOPS/W for 8-bit Floating Point (FP8) and Block Floating point (BF8) respectively. Finally, another iteration of the FP design is presented that is modeled to support multiple precision modes from FP8 up to FP32. Two approaches to the architectural design were compared illustrating the throughput-area overhead trade-off. The simulated design shows a 2.1 × normalized energy-efficiency compared to the on-chip implementation of the FP-IMC.
ContributorsSaikia, Jyotishman (Author) / Seo, Jae-Sun (Thesis advisor) / Chakrabarti, Chaitali (Thesis advisor) / Fan, Deliang (Committee member) / Cao, Yu (Committee member) / Arizona State University (Publisher)
Created2023
189353-Thumbnail Image.png
Description
In recent years, Artificial Intelligence (AI) (e.g., Deep Neural Networks (DNNs), Transformer) has shown great success in real-world applications due to its superior performance in various cognitive tasks. The impressive performance achieved by AI models normally accompanies the cost of enormous model size and high computational complexity, which significantly hampers

In recent years, Artificial Intelligence (AI) (e.g., Deep Neural Networks (DNNs), Transformer) has shown great success in real-world applications due to its superior performance in various cognitive tasks. The impressive performance achieved by AI models normally accompanies the cost of enormous model size and high computational complexity, which significantly hampers their implementation on resource-limited Cyber-Physical Systems (CPS), Internet-of-Things (IoT), or Edge systems due to their tightly constrained energy, computing, size, and memory budget. Thus, the urgent demand for enhancing the \textbf{Efficiency} of DNN has drawn significant research interests across various communities. Motivated by the aforementioned concerns, this doctoral research has been mainly focusing on Enabling Deep Learning at Edge: From Efficient and Dynamic Inference to On-Device Learning. Specifically, from the inference perspective, this dissertation begins by investigating a hardware-friendly model compression method that effectively reduces the size of AI model while simultaneously achieving improved speed on edge devices. Additionally, due to the fact that diverse resource constraints of different edge devices, this dissertation further explores dynamic inference, which allows for real-time tuning of inference model size, computation, and latency to accommodate the limitations of each edge device. Regarding efficient on-device learning, this dissertation starts by analyzing memory usage during transfer learning training. Based on this analysis, a novel framework called "Reprogramming Network'' (Rep-Net) is introduced that offers a fresh perspective on the on-device transfer learning problem. The Rep-Net enables on-device transferlearning by directly learning to reprogram the intermediate features of a pre-trained model. Lastly, this dissertation studies an efficient continual learning algorithm that facilitates learning multiple tasks without the risk of forgetting previously acquired knowledge. In practice, through the exploration of task correlation, an interesting phenomenon is observed that the intermediate features are highly correlated between tasks with the self-supervised pre-trained model. Building upon this observation, a novel approach called progressive task-correlated layer freezing is proposed to gradually freeze a subset of layers with the highest correlation ratios for each task leading to training efficiency.
ContributorsYang, Li (Author) / Fan, Deliang (Thesis advisor) / Seo, Jae-Sun (Committee member) / Zhang, Junshan (Committee member) / Cao, Yu (Committee member) / Arizona State University (Publisher)
Created2023
187583-Thumbnail Image.png
Description
Modern-day automobiles are becoming more connected and reliant on wireless connectivity. Thus, automotive electronics can be both a cause of and highly sensitive to electromagnetic interference (EMI), and the consequences of failure can be fatal. Technology advancements in engineering have brought several features into the automotive field but at the

Modern-day automobiles are becoming more connected and reliant on wireless connectivity. Thus, automotive electronics can be both a cause of and highly sensitive to electromagnetic interference (EMI), and the consequences of failure can be fatal. Technology advancements in engineering have brought several features into the automotive field but at the expense of electromagnetic compatibility issues. Automotive EMC problems are the result of the emissions from electronic assemblies inside a vehicle and the susceptibility of the electronics when exposed to external EMI sources. In both cases, automotive EMC problems can cause unintended changes in the automotive system operation. Robustness to electromagnetic interference (EMI) is one of the primary design aspects of state-of-the-art automotive ICs like System Basis Chips (SBCs) which provide a wide range of analog, power regulation and digital functions on the same die. One of the primary sources of conducted EMI on the Local Interconnect Network (LIN) driver output is an integrated switching DC-DC regulator noise coupling through the parasitic substrate capacitance of the SBC. In this dissertation an adaptive active EMI cancellation technique to cancel the switching noise of the DC-DC regulator on the LIN driver output to ensure electromagnetic compatibility (EMC) is presented. The proposed active EMI cancellation circuit synthesizes a phase synchronized cancellation pulse which is then injected onto the LIN driver output using an on-chip tunable capacitor array to cancel the switching noise injected via the substrate. The proposed EMI reduction technique can track and cancel substrate noise independent of process technology and device parasitics, input voltage, duty cycle, and loading conditions of the DC-DC switching regulator. The EMI cancellation system is designed and fabricated on a 180nm Bipolar-CMOS-DMOS (BCD) process with an integrated power stage of a DC-DC buck regulator at a switching frequency of 2MHz along with an automotive LIN driver. The EMI cancellation circuit occupies an area of 0.7 mm2, which is less than 3% of the overall area in a standard SBC and consumes 12.5 mW of power and achieves 25 dB reduction of conducted EMI in the LIN driver output’s power spectrum at the switching frequency and its harmonics.
ContributorsRay, Abhishek (Author) / Bakkaloglu, Bertan (Thesis advisor) / Garrity, Douglas (Committee member) / Kitchen, Jennifer (Committee member) / Seo, Jae-Sun (Committee member) / Arizona State University (Publisher)
Created2023
171380-Thumbnail Image.png
Description
Deep neural networks (DNNs), as a main-stream algorithm for various AI tasks, achieve higher accuracy at the cost of increased computational complexity and model size, posing great challenges to hardware platforms. This dissertation first tackles the design challenges of resistive random-access-memory (RRAM) based in-memory computing (IMC) architectures. A new metric,

Deep neural networks (DNNs), as a main-stream algorithm for various AI tasks, achieve higher accuracy at the cost of increased computational complexity and model size, posing great challenges to hardware platforms. This dissertation first tackles the design challenges of resistive random-access-memory (RRAM) based in-memory computing (IMC) architectures. A new metric, model stability from the loss landscape, is proposed to help shed light on accuracy under variations and model compression and guide a novel variation-aware training (VAT) solution. The proposed method effectively improves post-mapping accuracy of multiple datasets. Next, a hybrid RRAM/SRAM IMC DNN inference accelerator is developed, that integrates an RRAM-based IMC macro, a reconfigurable SRAM-based multiply-accumulate (MAC) macro, and a programmable shifter. The hybrid IMC accelerator fully recovers the inference accuracy post the mapping. Furthermore, this dissertation researches on architectural optimizations for high IMC utilization, low on-chip communication cost, and low energy-delay product (EDP), including on-chip interconnect design, PE array utilization, and tile-to-router mapping and scheduling. The optimal choice of on-chip interconnect results in up to 6x improvement in energy-delay-area product for RRAM IMC architectures. Furthermore, the PE and NoC optimizations show up to 62% improvement in PE utilization, 78% reduction in area, and 78% lower energy-area product for a wide range of modern DNNs. Finally, this dissertation proposes a novel chiplet-based IMC benchmarking simulator, SIAM, and a heterogeneous chiplet IMC architecture to address the limitations of a monolithic DNN accelerator. SIAM utilizes model-based and cycle-accurate simulation to provide a scalable and flexible architecture. SIAM is calibrated against a published silicon result, SIMBA, from Nvidia. The heterogeneous architecture utilizes a custom mapping with a bank of big and little chiplets, and a hybrid network-on-package (NoP) to optimize the utilization, interconnect bandwidth, and energy efficiency. The proposed big-little chiplet-based RRAM IMC architecture significantly improves energy efficiency at lower area, compared to conventional GPUs. In summary, this dissertation comprehensively investigates novel methods that encompass device, circuits, architecture, packaging, and algorithm to design scalable high-performance and energy-efficient IMC architectures.
ContributorsKrishnan, Gokul (Author) / Cao, Yu (Thesis advisor) / Seo, Jae-Sun (Committee member) / Chakrabarti, Chaitali (Committee member) / Ogras, Umit Y. (Committee member) / Arizona State University (Publisher)
Created2022
171616-Thumbnail Image.png
Description
Computer vision is becoming an essential component of embedded system applications such as smartphones, wearables, autonomous systems and internet-of-things (IoT). These applications are generally deployed into environments with limited energy, memory bandwidth and computational resources. This trend is driving the development of energy-effi cient image processing solutions from sensing to

Computer vision is becoming an essential component of embedded system applications such as smartphones, wearables, autonomous systems and internet-of-things (IoT). These applications are generally deployed into environments with limited energy, memory bandwidth and computational resources. This trend is driving the development of energy-effi cient image processing solutions from sensing to computation. In this thesis, diff erent alternatives are explored to implement energy-efficient computer vision systems. First, I present a fi eld programmable gate array (FPGA) implementation of an adaptive subsampling algorithm for region-of-interest (ROI) -based object tracking. By implementing the computationally intensive sections of this algorithm on an FPGA, I aim to offl oad computing resources from energy-ineffi cient graphics processing units (GPUs) and/or general-purpose central processing units (CPUs). I also present a working system executing this algorithm in near real-time latency implemented on a standalone embedded device. Secondly, I present a neural network-based pipeline to improve the performance of event-based cameras in non-ideal optical conditions. Event-based cameras or dynamic vision sensors (DVS) are bio-inspired sensors that measure logarithmic per-pixel brightness changes in a scene. Their advantages include high dynamic range, low latency and ultra-low power when compared to standard frame-based cameras. Several tasks have been proposed to take advantage of these novel sensors but they rely on perfectly calibrated optical lenses that are in-focus. In this work I propose a methodto reconstruct events captured with an out-of-focus event-camera so they can be fed into an intensity reconstruction task. The network is trained with a dataset generated by simulating defocus blur in sequences from object tracking datasets such as LaSOT and OTB100. I also test the generalization performance of this network in scenes captured with a DAVIS event-based sensor equipped with an out-of-focus lens.
ContributorsTorres Muro, Victor Isaac (Author) / Jayasuriya, Suren (Thesis advisor) / Spanias, Andreas (Committee member) / Seo, Jae-Sun (Committee member) / Arizona State University (Publisher)
Created2022
171408-Thumbnail Image.png
Description
A remarkable phenomenon in contemporary physics is quantum scarring in classically chaoticsystems, where the wave functions tend to concentrate on classical periodic orbits. Quantum scarring has been studied for more than four decades, but the problem of efficiently detecting quantum scars has remained to be challenging, relying mostly on human visualization of wave

A remarkable phenomenon in contemporary physics is quantum scarring in classically chaoticsystems, where the wave functions tend to concentrate on classical periodic orbits. Quantum scarring has been studied for more than four decades, but the problem of efficiently detecting quantum scars has remained to be challenging, relying mostly on human visualization of wave function patterns. This paper develops a machine learning approach to detecting quantum scars in an automated and highly efficient manner. In particular, this paper exploits Meta learning. The first step is to construct a few-shot classification algorithm, under the requirement that the one-shot classification accuracy be larger than 90%. Then propose a scheme based on a combination of neural networks to improve the accuracy. This paper shows that the machine learning scheme can find the correct quantum scars from thousands images of wave functions, without any human intervention, regardless of the symmetry of the underlying classical system. This will be the first application of Meta learning to quantum systems. Interacting spin networks are fundamental to quantum computing. Data-based tomography oftime-independent spin networks has been achieved, but an open challenge is to ascertain the structures of time-dependent spin networks using time series measurements taken locally from a small subset of the spins. Physically, the dynamical evolution of a spin network under time-dependent driving or perturbation is described by the Heisenberg equation of motion. Motivated by this basic fact, this paper articulates a physics-enhanced machine learning framework whose core is Heisenberg neural networks. This paper demonstrates that, from local measurements, not only the local Hamiltonian can be recovered but the Hamiltonian reflecting the interacting structure of the whole system can also be faithfully reconstructed. Using Heisenberg neural machine on spin networks of a variety of structures. In the extreme case where measurements are taken from only one spin, the achieved tomography fidelity values can reach about 90%. The developed machine learning framework is applicable to any time-dependent systems whose quantum dynamical evolution is governed by the Heisenberg equation of motion.
ContributorsHan, Chendi (Author) / Lai, Ying-Cheng (Thesis advisor) / Yu, Hongbin (Committee member) / Dasarathy, Gautam (Committee member) / Seo, Jae-Sun (Committee member) / Arizona State University (Publisher)
Created2022
168530-Thumbnail Image.png
Description
Edge computing applications have recently gained prominence as the world of internet-of-things becomes increasingly embedded into people's lives. Performing computations at the edge addresses multiple issues, such as memory bandwidth-latency bottlenecks, exposure of sensitive data to external attackers, etc. It is important to protect the data collected and processed by

Edge computing applications have recently gained prominence as the world of internet-of-things becomes increasingly embedded into people's lives. Performing computations at the edge addresses multiple issues, such as memory bandwidth-latency bottlenecks, exposure of sensitive data to external attackers, etc. It is important to protect the data collected and processed by edge devices, and also to prevent unauthorized access to such data. It is also important to ensure that the computing hardware fits well within the tight energy and area budgets for the edge devices which are being progressively scaled-down in size. Firstly, a novel low-power smart security prototype chip that combines multiple entropy sources, such as real-time electrocardiogram (ECG) data, and SRAM-based physical unclonable functions (PUF), for authentication and cryptography applications is proposed. Up to ~12X improvement in the equal error rate compared to a prior ECG-only authentication system is achieved by combining feature vectors obtained from ECG, heart rate variability, and SRAM PUF. The resulting vectors can also be utilized for secure cryptography applications. Secondly, a novel in-memory computing (IMC) hardware noise-aware training algorithms that make DNNs more robust to hardware noise is developed and evaluated. Up to 17% accuracy was recovered in deep neural networks (DNNs) deployed on IMC prototype hardware. The noise-aware training principles are also used to improve the adversarial robustness of DNNs, and successfully defend against both adversarial input and weight attacks. Up to ~10\% improvement in robustness against adversarial input attacks, and up to 33% improvement in robustness against adversarial weight attacks are achieved. Finally, a DNN training algorithm that pursues and optimises both activation and weight sparsity simultaneously is proposed and evaluated to obtain highly compressed DNNs. This lead to up to 4.7x reduction in the total number of flops required to perform complex image recognition tasks. A custom sparse inference accelerator is designed and synthesized to evaluate the benefits of the above flop reduction. A speedup of 4.24x is achieved. In summary, this dissertation contains innovative algorithm and hardware design techniques aided by machine learning, which enhance the security and efficiency of edge computing applications.
ContributorsCherupally, Sai Kiran (Author) / Seo, Jae-Sun (Thesis advisor) / Chakrabarti, Chaitali (Committee member) / Cao, Yu (Kevin) (Committee member) / Fan, Deliang (Committee member) / Arizona State University (Publisher)
Created2022